Exploratory Data Analysis in Healthcare - Working DICOM Data

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from glob import glob
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import chain
import pydicom #for working with DICOM files
from skimage.io import imread, imshow
from skimage import io

EDA is open-ended, and it is up to the analyst to decide on what different ways to slice and dice the data. A good starting point is to look at the requirements for the FDA documentation that we will need to produce as submit with our Machine Learning model.

This EDA will focus on investigating case of pneumonia and understand how data looks in the wild. E.g. what other types of diseases it's commonly found with, how often it is found, what ages it affects, etc.

Note that this NIH dataset was not specifically acquired for pneumonia. So, while this is a representation of 'pneumonia in the wild,' the prevalence of pneumonia may be different if you were to take only chest x-rays that were acquired in an ER setting with suspicion of pneumonia.

Some of the EDA tasks we will perform may include investigating:

  • The patient demographic data such as gender, age, patient position,etc. (as it is available)
  • The x-ray views taken (i.e. view position)
  • The number of cases including:
    • number of pneumonia cases,
    • number of non-pneumonia cases
  • The distribution of other diseases that are comorbid with pneumonia
  • Number of disease per patient
  • Pixel-level assessments of the imaging data for healthy & disease states of interest (e.g. histograms of intensity values) and compare distributions across diseases.

Data Understanding and Profiling

## Load NIH data
all_xray_df = pd.read_csv('/data/Data_Entry_2017.csv')
all_xray_df.sample(3)

## We will use'sample_labels.csv' data for pixel level assessments
sample_df = pd.read_csv('sample_labels.csv')
sample_df.sample(5)

Let's look at the shape of our data:

print(all_xray_df.shape)
print(sample_df.shape)
(112120, 12)
(5606, 11)
all_xray_df.sample(3)
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImage[Width Height] OriginalImagePixelSpacing[x y] Unnamed: 11
48566 00012309_002.png No Finding 2 12309 49 F AP 2500 2048 0.168 0.168 NaN
14793 00003863_011.png Atelectasis|Effusion 11 3863 52 M PA 2992 2989 0.143 0.143 NaN
90097 00022407_001.png Infiltration 1 22407 61 M PA 2992 2991 0.143 0.143 NaN

Explore all_xray_df dataframe

  • Look for null values (empty records)
  • Check data types
  • Examine how data is being distributed
  • Gain some intuititon and understanding on what the data is about
all_xray_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112120 entries, 0 to 112119
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Image Index                  112120 non-null  object 
 1   Finding Labels               112120 non-null  object 
 2   Follow-up #                  112120 non-null  int64  
 3   Patient ID                   112120 non-null  int64  
 4   Patient Age                  112120 non-null  int64  
 5   Patient Gender               112120 non-null  object 
 6   View Position                112120 non-null  object 
 7   OriginalImage[Width          112120 non-null  int64  
 8   Height]                      112120 non-null  int64  
 9   OriginalImagePixelSpacing[x  112120 non-null  float64
 10  y]                           112120 non-null  float64
 11  Unnamed: 11                  0 non-null       float64
dtypes: float64(3), int64(5), object(4)
memory usage: 10.3+ MB
  • Noticed an empty column with no records Unnamed: 11.
  • Drop Unnamed: 11 column since it has no values.
all_xray_df.drop(columns=['Unnamed: 11'], inplace=True)
all_xray_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112120 entries, 0 to 112119
Data columns (total 11 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Image Index                  112120 non-null  object 
 1   Finding Labels               112120 non-null  object 
 2   Follow-up #                  112120 non-null  int64  
 3   Patient ID                   112120 non-null  int64  
 4   Patient Age                  112120 non-null  int64  
 5   Patient Gender               112120 non-null  object 
 6   View Position                112120 non-null  object 
 7   OriginalImage[Width          112120 non-null  int64  
 8   Height]                      112120 non-null  int64  
 9   OriginalImagePixelSpacing[x  112120 non-null  float64
 10  y]                           112120 non-null  float64
dtypes: float64(2), int64(5), object(4)
memory usage: 9.4+ MB
all_xray_df.hist(bins=50, figsize=(20,15))
plt.show()
/opt/conda/lib/python3.7/site-packages/pandas/plotting/_matplotlib/tools.py:298: MatplotlibDeprecationWarning: 
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead.
  layout[ax.rowNum, ax.colNum] = ax.get_visible()
/opt/conda/lib/python3.7/site-packages/pandas/plotting/_matplotlib/tools.py:298: MatplotlibDeprecationWarning: 
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead.
  layout[ax.rowNum, ax.colNum] = ax.get_visible()
/opt/conda/lib/python3.7/site-packages/pandas/plotting/_matplotlib/tools.py:304: MatplotlibDeprecationWarning: 
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead.
  if not layout[ax.rowNum + 1, ax.colNum]:
/opt/conda/lib/python3.7/site-packages/pandas/plotting/_matplotlib/tools.py:304: MatplotlibDeprecationWarning: 
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead.
  if not layout[ax.rowNum + 1, ax.colNum]:
  • In terms of Patient/X-Ray related type of attributes that are possibly more important to understand are 'Age' , 'Gender' and 'View Position'.
  • Create a plotting function to use to further examine specific attributes/features
  • Adding to the plots is percent value counts for each category for each attribute/feature using value_counts() for categorical and min/max for continous.
cols = ['Patient Age', 'Patient Gender', 'View Position']
# function for plotting
def plot_hist(d, l):
    plt.figure(figsize=(6,6))
    plt.title(f'{l} distribution')
    plt.hist(d)
    plt.xlabel(l)
    plt.show()
    if l == 'Patient Age':
        print(f'Values Range:\nMin:{d.min()}\nMax:{d.max()}')
    else:
        print(d.value_counts(normalize=True))

Data distribution

for c in cols:
    plot_hist(all_xray_df[c], c)
Values Range:
Min:1
Max:414
M    0.56493
F    0.43507
Name: Patient Gender, dtype: float64
PA    0.600339
AP    0.399661
Name: View Position, dtype: float64

Patient Position We have two types of positions PA and AP

Patient Gender it's %56 Males and %43 Female. Not exactly a balanced distribution, but at the same time not extremely skewed to the point that it may cause concern undersampling or oversampling strategies.

Patient Age Note Max age we have 414 which is not a correct value. This hints that we have outliers in the data that could skew the results.

Remove Outliers

There are outliers in Patient Age indicating data entry problem.

Example: Patient Ages around 400 range does not make sense. But the fact they have been diagnosed already indicate that these are real patients, with diagnosis and a Patient ID. We will drop these records since they represent a small number around 16 records that will be dropped.

all_xray_df[all_xray_df['Patient Age'] > 100]['Patient Age']
20852     412
46965     414
48284     148
55742     148
58650     150
62929     149
74884     152
78795     151
84810     411
85404     412
86264     413
91369     412
95794     153
98495     154
101194    155
104590    155
Name: Patient Age, dtype: int64
len(all_xray_df[all_xray_df['Patient Age'] > 100]['Patient Age'])
16
all_xray_df[all_xray_df['Patient Age'] > 100]['Patient Age'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6dba85f50>
idx = all_xray_df[all_xray_df['Patient Age'] > 100].index.tolist()
all_xray_df.drop(index=idx, inplace=True)
  • Now let's re-examine the Age distribution after we dropped the outlier records
all_xray_df[all_xray_df['Patient Age'] < 100]['Patient Age'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6db9bf5d0>
# what is the current maximum age now?
all_xray_df[all_xray_df['Patient Age'] < 100]['Patient Age'].max()
95

Data Transformation for Further Analysis

  • Finding Labels contains a bunch of diagnostic labels concatenated together via | pipes. This will make it hard to understand the relationship between different diagnostics, their distribution ..etc.
  • Below we will perform a transformation to split each diagnostic and pivot it from ROWS to COLUMNS, this way each diagnostic becomes a feature with 0 or 1 indicating no-presence or presence of the disease.
  • This technique is similar to Dummy Variables or OneHotEncoding
# Different ways to do this:
# option 1: np.unique(list(chain(*[i.split('|') for i in all_xray_df['Finding Labels']])))
# option 2: np.unique(list(chain.from_iterable([i.split('|') for i in all_xray_df['Finding Labels']])))
# option 3: list(set(list(chain.from_iterable([i.split('|') for i in all_xray_df['Finding Labels']]))))

# chain vs chain.from_iterable approach 
# using list comprehension with chain

all_labels = list(set(list(chain.from_iterable([i.split('|') for i in all_xray_df['Finding Labels']]))))
all_labels
['Infiltration',
 'Effusion',
 'No Finding',
 'Nodule',
 'Pleural_Thickening',
 'Pneumothorax',
 'Hernia',
 'Atelectasis',
 'Emphysema',
 'Pneumonia',
 'Fibrosis',
 'Mass',
 'Consolidation',
 'Edema',
 'Cardiomegaly']
# Now, let's create the columns for each disease and check for presence or no presence
for c_label in all_labels:
    all_xray_df[c_label] = all_xray_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
all_xray_df.sample(3)
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImage[Width Height] OriginalImagePixelSpacing[x ... Pneumothorax Hernia Atelectasis Emphysema Pneumonia Fibrosis Mass Consolidation Edema Cardiomegaly
23842 00006289_005.png Pneumonia 5 6289 59 F AP 2500 2048 0.168 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
55702 00013935_004.png Nodule 4 13935 35 F PA 2542 2605 0.143 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
76095 00018669_004.png Atelectasis|Effusion|Infiltration 4 18669 63 M AP 3056 2544 0.139 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 26 columns

# check distribution for each disease (percent) and sort them
(all_xray_df[all_labels].sum()/len(all_xray_df)).nlargest(20)
No Finding            0.538366
Infiltration          0.177433
Effusion              0.118783
Atelectasis           0.103101
Nodule                0.056474
Mass                  0.051550
Pneumothorax          0.047286
Consolidation         0.041631
Pleural_Thickening    0.030186
Cardiomegaly          0.024763
Emphysema             0.022443
Edema                 0.020535
Fibrosis              0.015040
Pneumonia             0.012756
Hernia                0.002025
dtype: float64

Analysis from above data

  • Notice from above, around 54% of the records have No Findings, the remaining diseases correspond to the remaining 46%
  • Penumonia has a very low occurance around 1.2%. So, we need to pay attention when we train the data. Since it is a rare occurance in comparison, we need to watch for the imbalance of data.
    • Additionally, for validation or testing we need to keep this in mind since this percentage represents real world expectation on how often this disease occurs.

Examin Pneumonia specific data (where Pneumonia has been identified as a diagnostic)

Age distribution

ppage = all_xray_df[(all_xray_df['Pneumonia'] == 1) & (all_xray_df['Patient Age'] < 100)]['Patient Age']
ppage.hist()
plt.title('Pneumonia - Patient Age Distribution')
plt.xlabel('Patient Age')
print(f'Pneumonia Age range between {ppage.min()} and {ppage.max()}')
print(f"Population Age range between {all_xray_df['Patient Age'].min()} and {all_xray_df['Patient Age'].max()}")
Pneumonia Age range between 2 and 90
Population Age range between 1 and 95
mppage = all_xray_df[(all_xray_df['Pneumonia'] == 1) & (all_xray_df['Patient Age'] < 100) & (all_xray_df['Patient Gender'] == 'M')]['Patient Age']
print(f'Pneumonia Male Age range between {mppage.min()} and {mppage.max()}')
mppage.hist()
plt.title('Pneumonia Male Patients Age Distribution')
plt.xlabel('Male Ages')
plt.show()
Pneumonia Male Age range between 2 and 87
fppage = all_xray_df[(all_xray_df['Pneumonia'] == 1) & (all_xray_df['Patient Age'] < 100) & (all_xray_df['Patient Gender'] == 'F')]['Patient Age']
print(f'Pneumonia Female Age range between {fppage.min()} and {fppage.max()}')
fppage.hist()
plt.title('Pneumonia Female Patients Age Distribution')
plt.xlabel('Female Ages')
plt.show()
Pneumonia Female Age range between 3 and 90

Pneumonia Cases Specific Analysis

Further examine the Data where Pneumonia is True (records that has Penumonia)

plt.figure(figsize=(12,7))
all_xray_df[all_labels].sum().sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Number of images with labels')
plt.xlabel('Finding Label')
plt.title('Number of cases/records per Diagnostic Label')
plt.show()
plt.figure(figsize=(12,7))
(all_xray_df[all_labels].sum()/len(all_xray_df)).sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Number of images with labels')
plt.xlabel('Finding Label')
plt.title('Percent of cases/records per Diagnostic Label')
plt.show()
d = (all_xray_df[all_labels].sum()*100/len(all_xray_df)).sort_values(ascending=False)
v = (all_xray_df[all_labels].sum()*100/len(all_xray_df)).sort_values(ascending=False).values

plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(ylabel='Number of images with labels')
ax.set(xlabel='Finding Label')
for i, v in enumerate(v):
    ax.text(v+0.5, i, str(round(v,2))+'%', color='black')
plt.show()
  • Let's ignore No Finding and focus on Diseases Only
diagnostics = ['Consolidation',
 'Cardiomegaly',
 'Atelectasis',
 'Nodule',
 'Edema',
 'Emphysema',
 'Hernia',
 'Mass',
 'Pleural_Thickening',
 'Effusion',
 'Pneumonia',
 'Fibrosis',
 'Pneumothorax',
 'Infiltration']
d_df = all_xray_df[all_xray_df['Finding Labels'] != 'No Finding']

# or d_df = all_xray_df[all_xray_df['Pneumonia'] == 1]

d = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False)
v = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False).values

plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(xlabel='Percent of images with labels')
ax.set(ylabel='Diseasee')
ax.set(title='Diseases ONLY Distribution')
for i, v in enumerate(v):
    ax.text(v+0.5, i, str(round(v,2))+'%', color='black')
plt.show()
  • Notice above: that of all the diseases, the majority %86.5 fall between the top 3 diseases Infiltration, Effusion, Atelectas
  • Pneumonia is around %2.8
  • Now let's shift our attention to Pneumonia and co-occurances
d = d_df[diagnostics].sum().sort_values(ascending=False)
v = d_df[diagnostics].sum().sort_values(ascending=False).values

plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(title='Frequency Distribution Label wise of each disease that co occur with Pneumonia')
ax.set(ylabel='Disease')
ax.set(xlabel='Number of images with labels')
for i, v in enumerate(v):
    ax.text(v+10, i, v, color='black')
plt.show()
d_df = d_df[d_df['Pneumonia'] == 1]

d = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False)
v = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False).values

plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(xlabel='Percent of images with labels')
ax.set(ylabel='Disease')
ax.set(title='Frequency Distribution Label wise of each disease that co occur with Pneumonia')
for i, v in enumerate(v):
    ax.text(v+0.5, i, str(round(v,2))+'%', color='black')
plt.show()
  • Note from above: The dominatint three diseases identified before Infiltration, Effusion, Atelectas are also amongst the top 5 co-occuring diseases with Pneumonia at %42.3, %18.7 and %18.3, with Infiltration dominating at %42.3 alone
d = all_xray_df[all_xray_df['Pneumonia'] == 1]
(d[diagnostics].sum()*100/len(d)).sort_values(ascending=False)
(d[diagnostics].sum()*100/len(d)).sort_values(ascending=False).values
array([100.        ,  42.30769231,  23.77622378,  18.74125874,
        18.32167832,   8.6013986 ,   4.96503497,   4.8951049 ,
         3.35664336,   2.86713287,   2.86713287,   1.60839161,
         0.76923077,   0.20979021])
  • look at how other data being distributed for Pneumonia cases
d_df.shape
(1430, 26)
for c in cols:
    plot_hist(d_df[c], c)
Values Range:
Min:2
Max:90
M    0.585315
F    0.414685
Name: Patient Gender, dtype: float64
AP    0.559441
PA    0.440559
Name: View Position, dtype: float64
all_xray_df.columns
Index(['Image Index', 'Finding Labels', 'Follow-up #', 'Patient ID',
       'Patient Age', 'Patient Gender', 'View Position', 'OriginalImage[Width',
       'Height]', 'OriginalImagePixelSpacing[x', 'y]', 'Infiltration',
       'Effusion', 'No Finding', 'Nodule', 'Pleural_Thickening',
       'Pneumothorax', 'Hernia', 'Atelectasis', 'Emphysema', 'Pneumonia',
       'Fibrosis', 'Mass', 'Consolidation', 'Edema', 'Cardiomegaly'],
      dtype='object')

Correlation (Pearson)

all_xray_df[diagnostics].corr()
Consolidation Cardiomegaly Atelectasis Nodule Edema Emphysema Hernia Mass Pleural_Thickening Effusion Pneumonia Fibrosis Pneumothorax Infiltration
Consolidation 1.000000 0.015355 0.108945 0.031812 0.020835 -0.000526 -0.005414 0.074610 0.028742 0.101130 0.025257 0.003233 0.000487 0.045931
Cardiomegaly 0.015355 1.000000 0.015817 -0.012128 0.028331 -0.007093 0.001761 -0.010670 0.009126 0.130096 0.002859 0.004834 -0.022249 0.014191
Atelectasis 0.108945 0.015817 1.000000 -0.007972 -0.003379 0.032598 0.010830 0.018995 0.025221 0.172462 0.029948 0.011128 0.031437 0.093157
Nodule 0.031812 -0.012128 -0.007972 1.000000 0.000271 -0.007067 -0.002424 0.101300 0.049663 0.019109 -0.003705 0.022474 0.007579 0.042754
Edema 0.020835 0.028331 -0.003379 0.000271 1.000000 -0.009200 -0.002325 0.002939 -0.002018 0.062127 0.174110 -0.013241 -0.022479 0.094265
Emphysema -0.000526 -0.007093 0.032598 -0.007067 -0.009200 1.000000 -0.001466 0.023232 0.026416 0.011195 -0.004880 -0.000910 0.178194 0.000406
Hernia -0.005414 0.001761 0.010830 -0.002424 -0.002325 -0.001466 1.000000 0.011934 0.001331 -0.003658 0.000185 0.007477 -0.001621 -0.003780
Mass 0.074610 -0.010670 0.018995 0.101300 0.002939 0.023232 0.011934 1.000000 0.065206 0.070770 -0.000977 0.009972 0.029980 0.013898
Pleural_Thickening 0.028742 0.009126 0.025221 0.049663 -0.002018 0.026416 0.001331 0.065206 1.000000 0.072037 0.002246 0.053590 0.031682 0.020411
Effusion 0.101130 0.130096 0.172462 0.019109 0.062127 0.011195 -0.003658 0.070770 0.072037 1.000000 0.024112 -0.002779 0.047587 0.118164
Pneumonia 0.025257 0.002859 0.029948 -0.003705 0.174110 -0.004880 0.000185 -0.000977 0.002246 0.024112 1.000000 -0.006862 -0.009969 0.073088
Fibrosis 0.003233 0.004834 0.011128 0.022474 -0.013241 -0.000910 0.007477 0.009972 0.053590 -0.002779 -0.006862 1.000000 0.000095 0.008796
Pneumothorax 0.000487 -0.022249 0.031437 0.007579 -0.022479 0.178194 -0.001621 0.029980 0.031682 0.047587 -0.009969 0.000095 1.000000 0.000597
Infiltration 0.045931 0.014191 0.093157 0.042754 0.094265 0.000406 -0.003780 0.013898 0.020411 0.118164 0.073088 0.008796 0.000597 1.000000
plt.figure(figsize=(12,9))
sns.heatmap(all_xray_df[diagnostics].corr(), cmap='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7fb6dc3bd610>

Just looking at the heatmap above, for Pneumonia there is a stronger correlation with Edema and Infiltration cases. Seconday correlation would be Atelectasis and Consolidation cases.

Number of Pneumonia Cases compared to data population

print(f'Number of Pneumonia cases : {len(d_df)}')
print(f'Number of non-Pneumonia cases : {len(all_xray_df)-len(d_df)}')
print(f'Penumonia/Total : {round(len(d_df)*100/len(all_xray_df),2)}%' )
Number of Pneumonia cases : 1430
Number of non-Pneumonia cases : 110674
Penumonia/Total : 1.28%
Number of Images/Records per Patient (top 30)
# Distribution of images per patient analysis 
plt.figure(figsize=(10,6))
all_xray_df.groupby(by='Patient ID')['Image Index'].count().nlargest(30).plot(kind='bar')
plt.ylabel('Number of Records per Patient')
plt.title('Top 30 Patients with Largest Number of Records Example')
plt.show()
# Distribution of images per patient analysis 
all_xray_df.groupby(by='Patient ID')['Image Index'].count().nsmallest(10)
Patient ID
2     1
4     1
6     1
7     1
9     1
10    1
12    1
14    1
15    1
16    1
Name: Image Index, dtype: int64

Common diseases associated with Pneumonia

plt.figure(figsize=(12,6))
d_df['Finding Labels'].value_counts()[0:30].plot(kind='bar')
plt.show()
non_p_labels =[i for i in all_labels if i !='Pneumonia']
plt.figure(figsize=(12,6))
ax = d_df[non_p_labels].sum().nlargest(14).plot(kind='barh')
d_df[non_p_labels].sum().nlargest(20)
Infiltration          605.0
Edema                 340.0
Effusion              268.0
Atelectasis           262.0
Consolidation         123.0
Mass                   71.0
Nodule                 70.0
Pleural_Thickening     48.0
Pneumothorax           41.0
Cardiomegaly           41.0
Emphysema              23.0
Fibrosis               11.0
Hernia                  3.0
No Finding              0.0
dtype: float64
# total cases with Pneumonia and Other Conditions
total_cases_with_Pneumonia = d_df[non_p_labels].sum().sum()
total_cases_with_Pneumonia
1906.0

Number of Diseases per Patient (top 10)

diseases_per_patient = all_xray_df.groupby(by='Patient ID')[diagnostics].max().sum(axis=1)
diseases_per_patient.nlargest(10)
Patient ID
12021    13.0
14022    13.0
16778    13.0
26451    13.0
1836     12.0
10092    12.0
12094    12.0
12834    12.0
13111    12.0
21201    12.0
dtype: float64

Examine Images from data/images_001/images - Pick bottom 5 example

# 5 random images
img_list = !ls /data/images_001/images |sort -R |tail -5 
img_list
['00000632_000.png',
 '00000499_000.png',
 '00001294_000.png',
 '00000491_022.png',
 '00000627_018.png']
for img in img_list:
    imr = imread(f'/data/images_001/images/{img}')
    imshow(imr, cmap='gray')
    plt.show()

sample_df for image analysis

sample_df.head()
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImageWidth OriginalImageHeight OriginalImagePixelSpacing_x OriginalImagePixelSpacing_y
0 00000013_005.png Emphysema|Infiltration|Pleural_Thickening|Pneu... 5 13 060Y M AP 3056 2544 0.139 0.139
1 00000013_026.png Cardiomegaly|Emphysema 26 13 057Y M AP 2500 2048 0.168 0.168
2 00000017_001.png No Finding 1 17 077Y M AP 2500 2048 0.168 0.168
3 00000030_001.png Atelectasis 1 30 079Y M PA 2992 2991 0.143 0.143
4 00000032_001.png Cardiomegaly|Edema|Effusion 1 32 055Y F AP 2500 2048 0.168 0.168

Examine DCOM files in the parent folder

dcoms = !ls | grep .dcm
dcoms
['test1.dcm', 'test2.dcm', 'test3.dcm', 'test4.dcm', 'test5.dcm', 'test6.dcm']

View images and intensity distribution

for dcom in dcoms:
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
    dcm = pydicom.dcmread(dcom)
    img = (dcm.pixel_array - np.mean(dcm.pixel_array))/np.std(dcm.pixel_array)
    ax1.imshow(dcm.pixel_array,cmap='gray')
    ax2.hist(img.ravel(), bins = 256)
    plt.show()

Examin Images based on sample_df data set (labeled data)

# Add file path to images
data_sample_path = {os.path.basename(x): x for x in 
                   glob(os.path.join('/data','images*','*','*.png'))}
sample_df['path'] = sample_df['Image Index'].map(data_sample_path.get)
sample_df.head()
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImageWidth OriginalImageHeight OriginalImagePixelSpacing_x OriginalImagePixelSpacing_y path
0 00000013_005.png Emphysema|Infiltration|Pleural_Thickening|Pneu... 5 13 060Y M AP 3056 2544 0.139 0.139 /data/images_001/images/00000013_005.png
1 00000013_026.png Cardiomegaly|Emphysema 26 13 057Y M AP 2500 2048 0.168 0.168 /data/images_001/images/00000013_026.png
2 00000017_001.png No Finding 1 17 077Y M AP 2500 2048 0.168 0.168 /data/images_001/images/00000017_001.png
3 00000030_001.png Atelectasis 1 30 079Y M PA 2992 2991 0.143 0.143 /data/images_001/images/00000030_001.png
4 00000032_001.png Cardiomegaly|Edema|Effusion 1 32 055Y F AP 2500 2048 0.168 0.168 /data/images_001/images/00000032_001.png
# test
imshow(sample_df['path'][0])
<matplotlib.image.AxesImage at 0x7fb6d52aacd0>

Data Transformation

  • Make simnilar transformation that we made for the all_xray_df
sample_all_labels = list(set(list(chain.from_iterable([i.split('|') for i in sample_df['Finding Labels']]))))

for c_label in sample_all_labels:
    sample_df[c_label] = sample_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
sample_df.head()
Image Index Finding Labels Follow-up # Patient ID Patient Age Patient Gender View Position OriginalImageWidth OriginalImageHeight OriginalImagePixelSpacing_x ... Pneumothorax Hernia Atelectasis Emphysema Pneumonia Fibrosis Mass Consolidation Edema Cardiomegaly
0 00000013_005.png Emphysema|Infiltration|Pleural_Thickening|Pneu... 5 13 060Y M AP 3056 2544 0.139 ... 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1 00000013_026.png Cardiomegaly|Emphysema 26 13 057Y M AP 2500 2048 0.168 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
2 00000017_001.png No Finding 1 17 077Y M AP 2500 2048 0.168 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 00000030_001.png Atelectasis 1 30 079Y M PA 2992 2991 0.143 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 00000032_001.png Cardiomegaly|Edema|Effusion 1 32 055Y F AP 2500 2048 0.168 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0

5 rows × 27 columns

# split data: Pneumonia Only
sample_df_p = sample_df[sample_df['Pneumonia'] == 1]
# sample_df_p = sample_df[sample_df['Finding Labels'] == 'Pneumonia']

Lets now look at only Pneumonia images and pixel distribution

# Only Pneumonia

for d in sample_df_p.head(5)['path']:
    fig, (ax1, ax2) = plt.subplots(1, 2)
    img = io.imread(d)
    ax1.imshow(img, cmap='gray')
    img = (img-np.mean(img)) / np.std(img)
    ax2.hist(img.ravel(), bins = 256)
    plt.show()

Lets now look at No Findings images and pixel distribution

# Non Pneumonia
sample_df_nop = sample_df[sample_df['No Finding'] == 1]

for d in sample_df_nop.head(5)['path']:
    fig, (ax1, ax2) = plt.subplots(1, 2)
    img = io.imread(d)
    ax1.imshow(img, cmap='gray')
    img = (img-np.mean(img)) / np.std(img)
    ax2.hist(img.ravel(), bins = 256)
    plt.show()

Pixel Intensity Distribution

Comparing Pneumonia with Non-Pneumonia (not normalized)

  • We will focus more on pixel distribution and not look at the X-ray images to get a larger view of the distribution
# Same as above but more focused on intesnity distribution
# Only Pneumonia
for d in sample_df_p.head(5)['path']:
    plt.figure(figsize=(10,6))
    img = io.imread(d)
#     img = (img-np.mean(img)) / np.std(img)
#     ax = plt.hist(img.ravel(), bins = 256)
    sns.distplot(img, bins=256, kde=True, rug=False)
    plt.title(f'Pneumonia image {d}')
    plt.show()
# No Finding (Clear) 

for d in sample_df_nop.head(5)['path']:
    plt.figure(figsize=(10,6))
    img = io.imread(d)
#     img = (img-np.mean(img)) / np.std(img)
#     ax = plt.hist(img.ravel(), bins = 256)
    sns.distplot(img, bins=256, kde=True, rug=False)
    plt.title(f'Non-Pneumonia image {d}')
    plt.show()
  • From the Pixel Intesity Distribution charts above, the peak from values 0 to 40 represent black color, which is mainly the boundary/ackground color in the X-Rays.
  • Actual representation of the chest images, are more toward white/grayish color (lighter tones).
  • If we remove or ignore the values from 0-40 we will shift the focus on how the chest x-rays themeselves being represented. In other words, ignore the black boundaries/background color.

Replotting above but this time ignoring black boundary

# Same as above but more focused on intesnity distribution
# Only Pneumonia
for d in sample_df_p.head(5)['path']:
    plt.figure(figsize=(10,6))
    img = io.imread(d)
    img = img[img > 40]
#     img = (img-np.mean(img)) / np.std(img)
#     ax = plt.hist(img.ravel(), bins = 256)
    sns.distplot(img, bins=256, kde=True, rug=False)
    plt.title(f'Pneumonia image {d}')
    plt.show()
for d in sample_df_nop.head(5)['path']:
    plt.figure(figsize=(10,6))
    img = io.imread(d)
#     img = (img-np.mean(img)) / np.std(img)
#     ax = plt.hist(img.ravel(), bins = 256)
    img = img[img > 30]
    sns.distplot(img, bins=256, kde=True, rug=False)
    plt.title(f'Non-Pneumonia image {d}')
    plt.show()

Pixel Intensity Distribution Analysis for Diseases

  • Now let's examine the similarity and differences between Pneumonia and other diseases
  • Will do this for 5 groups, picking one picture for each diseas as representation and then plot them.
coll = []
for i in range(5):
    l = {}
    for label in all_labels:
        l[label] = sample_df[sample_df[label] == 1]['path'].iloc[i]
    coll.append(l)
coll[0]
{'Infiltration': '/data/images_001/images/00000013_005.png',
 'Effusion': '/data/images_001/images/00000032_001.png',
 'No Finding': '/data/images_001/images/00000017_001.png',
 'Nodule': '/data/images_001/images/00000061_025.png',
 'Pleural_Thickening': '/data/images_001/images/00000013_005.png',
 'Pneumothorax': '/data/images_001/images/00000013_005.png',
 'Hernia': '/data/images_004/images/00006624_002.png',
 'Atelectasis': '/data/images_001/images/00000030_001.png',
 'Emphysema': '/data/images_001/images/00000013_005.png',
 'Pneumonia': '/data/images_002/images/00001373_010.png',
 'Fibrosis': '/data/images_001/images/00000181_001.png',
 'Mass': '/data/images_001/images/00000040_003.png',
 'Consolidation': '/data/images_001/images/00000040_003.png',
 'Edema': '/data/images_001/images/00000032_001.png',
 'Cardiomegaly': '/data/images_001/images/00000013_026.png'}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[0]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[0].keys(), coll[0].values()):
    img2 = io.imread(path)
    img2 = img2[img2 > 40]
    img2 = img2.ravel()
    sns.distplot(img2, ax=ax, bins=256)
#     diff = abs(pneumonia_arr-np.array(img2)).sum()
#     difference[label] = diff
#     ax.set(title=f'{label} - {diff}')
    if label == 'Pneumonia':
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
    else:
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
  • Examinig above, the closest three diseases in terms of shape, mean and std (spread), is Edeam, Mass and nodule
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[1]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[1].keys(), coll[1].values()):
    img2 = io.imread(path)
    img2 = img2[img2 > 40]
    img2 = img2.ravel()
    sns.distplot(img2, ax=ax, bins=256)
#     diff = abs(pneumonia_arr-np.array(img2)).sum()
#     difference[label] = diff
#     ax.set(title=f'{label} - {diff}')
    if label == 'Pneumonia':
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
    else:
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
  • Examinig above, the closest three diseases in terms of shape, mean and std (spread), are Edema, Fibrosis and Infiltration
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[2]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[2].keys(), coll[2].values()):
    img2 = io.imread(path)
    img2 = img2[img2 > 40]
    img2 = img2.ravel()
    sns.distplot(img2, ax=ax, bins=256)
#     diff = abs(pneumonia_arr-np.array(img2)).sum()
#     difference[label] = diff
#     ax.set(title=f'{label} - {diff}')
    if label == 'Pneumonia':
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
    else:
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
  • Examinig above, the closest three diseases in terms of shape, mean and std (spread), are Nodule, Mass and Infiltration
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[3]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[3].keys(), coll[3].values()):
    img2 = io.imread(path)
    img2 = img2[img2 > 40]
    img2 = img2.ravel()
    sns.distplot(img2, ax=ax, bins=256)
#     diff = abs(pneumonia_arr-np.array(img2)).sum()
#     difference[label] = diff
#     ax.set(title=f'{label} - {diff}')
    if label == 'Pneumonia':
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
    else:
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
  • Examinig above, the closest two diseases in terms of shape, mean and std (spread), are Mass and Infiltration
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
pneumonia = io.imread(coll[0]['Pneumonia']).ravel()
p = io.imread(coll[4]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[4].keys(), coll[4].values()):
    img2 = io.imread(path)
    img2 = img2[img2 > 40]
    img2 = img2.ravel()
    sns.distplot(img2, ax=ax, bins=256)
#     diff = abs(np.array(pneumonia)-np.array(img2)).sum()
#     difference[label] = diff
    if label == 'Pneumonia':
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
    else:
        ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
  • Examinig above, the closest two diseases in terms of shape, mean and std (spread), is Mass and Infiltration