AI for Healthcare (Part 1)
Part 1 Exploratory Data Analysis using Python for 2D Imaging
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from glob import glob
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import chain
import pydicom #for working with DICOM files
from skimage.io import imread, imshow
from skimage import io
EDA is open-ended, and it is up to the analyst to decide on what different ways to slice and dice the data. A good starting point is to look at the requirements for the FDA documentation that we will need to produce as submit with our Machine Learning model.
This EDA will focus on investigating case of pneumonia and understand how data looks in the wild. E.g. what other types of diseases it's commonly found with, how often it is found, what ages it affects, etc.
Note that this NIH
dataset was not specifically acquired for pneumonia. So, while this is a representation of 'pneumonia in the wild,' the prevalence of pneumonia may be different if you were to take only chest x-rays that were acquired in an ER setting with suspicion of pneumonia.
Some of the EDA tasks we will perform may include investigating:
- The patient demographic data such as gender, age, patient position,etc. (as it is available)
- The x-ray views taken (i.e. view position)
- The number of cases including:
- number of pneumonia cases,
- number of non-pneumonia cases
- The distribution of other diseases that are comorbid with pneumonia
- Number of disease per patient
- Pixel-level assessments of the imaging data for healthy & disease states of interest (e.g. histograms of intensity values) and compare distributions across diseases.
## Load NIH data
all_xray_df = pd.read_csv('/data/Data_Entry_2017.csv')
all_xray_df.sample(3)
## We will use'sample_labels.csv' data for pixel level assessments
sample_df = pd.read_csv('sample_labels.csv')
sample_df.sample(5)
Let's look at the shape of our data:
print(all_xray_df.shape)
print(sample_df.shape)
all_xray_df.sample(3)
all_xray_df.info()
- Noticed an empty column with no records
Unnamed: 11
. - Drop
Unnamed: 11
column since it has no values.
all_xray_df.drop(columns=['Unnamed: 11'], inplace=True)
all_xray_df.info()
all_xray_df.hist(bins=50, figsize=(20,15))
plt.show()
- In terms of Patient/X-Ray related type of attributes that are possibly more important to understand are 'Age' , 'Gender' and 'View Position'.
- Create a plotting function to use to further examine specific attributes/features
- Adding to the plots is percent value counts for each category for each attribute/feature using
value_counts()
for categorical andmin/max
for continous.
cols = ['Patient Age', 'Patient Gender', 'View Position']
# function for plotting
def plot_hist(d, l):
plt.figure(figsize=(6,6))
plt.title(f'{l} distribution')
plt.hist(d)
plt.xlabel(l)
plt.show()
if l == 'Patient Age':
print(f'Values Range:\nMin:{d.min()}\nMax:{d.max()}')
else:
print(d.value_counts(normalize=True))
for c in cols:
plot_hist(all_xray_df[c], c)
Patient Position
We have two types of positionsPA
andAP
Patient Gender
it's %56 Males and %43 Female. Not exactly a balanced distribution, but at the same time not extremely skewed to the point that it may cause concern undersampling or oversampling strategies.
Patient Age
Note Max age we have 414 which is not a correct value. This hints that we have outliers in the data that could skew the results.
There are outliers in Patient Age
indicating data entry problem.
Example: Patient Ages around 400 range does not make sense. But the fact they have been diagnosed already indicate that these are real patients, with diagnosis and a Patient ID. We will drop these records since they represent a small number around 16
records that will be dropped.
all_xray_df[all_xray_df['Patient Age'] > 100]['Patient Age']
len(all_xray_df[all_xray_df['Patient Age'] > 100]['Patient Age'])
all_xray_df[all_xray_df['Patient Age'] > 100]['Patient Age'].hist()
idx = all_xray_df[all_xray_df['Patient Age'] > 100].index.tolist()
all_xray_df.drop(index=idx, inplace=True)
- Now let's re-examine the Age distribution after we dropped the outlier records
all_xray_df[all_xray_df['Patient Age'] < 100]['Patient Age'].hist()
# what is the current maximum age now?
all_xray_df[all_xray_df['Patient Age'] < 100]['Patient Age'].max()
Data Transformation for Further Analysis
Finding Labels
contains a bunch of diagnostic labels concatenated together via|
pipes. This will make it hard to understand the relationship between different diagnostics, their distribution ..etc.- Below we will perform a transformation to split each diagnostic and pivot it from ROWS to COLUMNS, this way each diagnostic becomes a feature with
0
or1
indicating no-presence or presence of the disease. - This technique is similar to Dummy Variables or OneHotEncoding
# Different ways to do this:
# option 1: np.unique(list(chain(*[i.split('|') for i in all_xray_df['Finding Labels']])))
# option 2: np.unique(list(chain.from_iterable([i.split('|') for i in all_xray_df['Finding Labels']])))
# option 3: list(set(list(chain.from_iterable([i.split('|') for i in all_xray_df['Finding Labels']]))))
# chain vs chain.from_iterable approach
# using list comprehension with chain
all_labels = list(set(list(chain.from_iterable([i.split('|') for i in all_xray_df['Finding Labels']]))))
all_labels
# Now, let's create the columns for each disease and check for presence or no presence
for c_label in all_labels:
all_xray_df[c_label] = all_xray_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
all_xray_df.sample(3)
# check distribution for each disease (percent) and sort them
(all_xray_df[all_labels].sum()/len(all_xray_df)).nlargest(20)
Analysis from above data
- Notice from above, around 54% of the records have
No Findings
, the remaining diseases correspond to the remaining 46% Penumonia
has a very low occurance around 1.2%. So, we need to pay attention when we train the data. Since it is a rare occurance in comparison, we need to watch for the imbalance of data.- Additionally, for validation or testing we need to keep this in mind since this percentage represents real world expectation on how often this disease occurs.
ppage = all_xray_df[(all_xray_df['Pneumonia'] == 1) & (all_xray_df['Patient Age'] < 100)]['Patient Age']
ppage.hist()
plt.title('Pneumonia - Patient Age Distribution')
plt.xlabel('Patient Age')
print(f'Pneumonia Age range between {ppage.min()} and {ppage.max()}')
print(f"Population Age range between {all_xray_df['Patient Age'].min()} and {all_xray_df['Patient Age'].max()}")
mppage = all_xray_df[(all_xray_df['Pneumonia'] == 1) & (all_xray_df['Patient Age'] < 100) & (all_xray_df['Patient Gender'] == 'M')]['Patient Age']
print(f'Pneumonia Male Age range between {mppage.min()} and {mppage.max()}')
mppage.hist()
plt.title('Pneumonia Male Patients Age Distribution')
plt.xlabel('Male Ages')
plt.show()
fppage = all_xray_df[(all_xray_df['Pneumonia'] == 1) & (all_xray_df['Patient Age'] < 100) & (all_xray_df['Patient Gender'] == 'F')]['Patient Age']
print(f'Pneumonia Female Age range between {fppage.min()} and {fppage.max()}')
fppage.hist()
plt.title('Pneumonia Female Patients Age Distribution')
plt.xlabel('Female Ages')
plt.show()
plt.figure(figsize=(12,7))
all_xray_df[all_labels].sum().sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Number of images with labels')
plt.xlabel('Finding Label')
plt.title('Number of cases/records per Diagnostic Label')
plt.show()
plt.figure(figsize=(12,7))
(all_xray_df[all_labels].sum()/len(all_xray_df)).sort_values(ascending=False).plot(kind='bar')
plt.ylabel('Number of images with labels')
plt.xlabel('Finding Label')
plt.title('Percent of cases/records per Diagnostic Label')
plt.show()
d = (all_xray_df[all_labels].sum()*100/len(all_xray_df)).sort_values(ascending=False)
v = (all_xray_df[all_labels].sum()*100/len(all_xray_df)).sort_values(ascending=False).values
plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(ylabel='Number of images with labels')
ax.set(xlabel='Finding Label')
for i, v in enumerate(v):
ax.text(v+0.5, i, str(round(v,2))+'%', color='black')
plt.show()
- Let's ignore
No Finding
and focus on Diseases Only
diagnostics = ['Consolidation',
'Cardiomegaly',
'Atelectasis',
'Nodule',
'Edema',
'Emphysema',
'Hernia',
'Mass',
'Pleural_Thickening',
'Effusion',
'Pneumonia',
'Fibrosis',
'Pneumothorax',
'Infiltration']
d_df = all_xray_df[all_xray_df['Finding Labels'] != 'No Finding']
# or d_df = all_xray_df[all_xray_df['Pneumonia'] == 1]
d = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False)
v = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False).values
plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(xlabel='Percent of images with labels')
ax.set(ylabel='Diseasee')
ax.set(title='Diseases ONLY Distribution')
for i, v in enumerate(v):
ax.text(v+0.5, i, str(round(v,2))+'%', color='black')
plt.show()
- Notice above: that of all the diseases, the majority %86.5 fall between the top 3 diseases
Infiltration
,Effusion
,Atelectas
Pneumonia
is around %2.8
- Now let's shift our attention to Pneumonia and co-occurances
d = d_df[diagnostics].sum().sort_values(ascending=False)
v = d_df[diagnostics].sum().sort_values(ascending=False).values
plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(title='Frequency Distribution Label wise of each disease that co occur with Pneumonia')
ax.set(ylabel='Disease')
ax.set(xlabel='Number of images with labels')
for i, v in enumerate(v):
ax.text(v+10, i, v, color='black')
plt.show()
d_df = d_df[d_df['Pneumonia'] == 1]
d = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False)
v = (d_df[diagnostics].sum()*100/len(d_df)).sort_values(ascending=False).values
plt.figure(figsize=(12,7))
ax = d.plot(kind='barh')
ax.set(xlabel='Percent of images with labels')
ax.set(ylabel='Disease')
ax.set(title='Frequency Distribution Label wise of each disease that co occur with Pneumonia')
for i, v in enumerate(v):
ax.text(v+0.5, i, str(round(v,2))+'%', color='black')
plt.show()
- Note from above: The dominatint three diseases identified before
Infiltration
,Effusion
,Atelectas
are also amongst the top 5 co-occuring diseases with Pneumonia at %42.3, %18.7 and %18.3, withInfiltration
dominating at %42.3 alone
d = all_xray_df[all_xray_df['Pneumonia'] == 1]
(d[diagnostics].sum()*100/len(d)).sort_values(ascending=False)
(d[diagnostics].sum()*100/len(d)).sort_values(ascending=False).values
- look at how other data being distributed for Pneumonia cases
d_df.shape
for c in cols:
plot_hist(d_df[c], c)
all_xray_df.columns
all_xray_df[diagnostics].corr()
plt.figure(figsize=(12,9))
sns.heatmap(all_xray_df[diagnostics].corr(), cmap='Blues')
Just looking at the heatmap above, for Pneumonia there is a stronger correlation with Edema
and Infiltration
cases. Seconday correlation would be Atelectasis
and Consolidation
cases.
print(f'Number of Pneumonia cases : {len(d_df)}')
print(f'Number of non-Pneumonia cases : {len(all_xray_df)-len(d_df)}')
print(f'Penumonia/Total : {round(len(d_df)*100/len(all_xray_df),2)}%' )
# Distribution of images per patient analysis
plt.figure(figsize=(10,6))
all_xray_df.groupby(by='Patient ID')['Image Index'].count().nlargest(30).plot(kind='bar')
plt.ylabel('Number of Records per Patient')
plt.title('Top 30 Patients with Largest Number of Records Example')
plt.show()
# Distribution of images per patient analysis
all_xray_df.groupby(by='Patient ID')['Image Index'].count().nsmallest(10)
plt.figure(figsize=(12,6))
d_df['Finding Labels'].value_counts()[0:30].plot(kind='bar')
plt.show()
non_p_labels =[i for i in all_labels if i !='Pneumonia']
plt.figure(figsize=(12,6))
ax = d_df[non_p_labels].sum().nlargest(14).plot(kind='barh')
d_df[non_p_labels].sum().nlargest(20)
# total cases with Pneumonia and Other Conditions
total_cases_with_Pneumonia = d_df[non_p_labels].sum().sum()
total_cases_with_Pneumonia
diseases_per_patient = all_xray_df.groupby(by='Patient ID')[diagnostics].max().sum(axis=1)
diseases_per_patient.nlargest(10)
# 5 random images
img_list = !ls /data/images_001/images |sort -R |tail -5
img_list
for img in img_list:
imr = imread(f'/data/images_001/images/{img}')
imshow(imr, cmap='gray')
plt.show()
sample_df.head()
dcoms = !ls | grep .dcm
dcoms
View images and intensity distribution
for dcom in dcoms:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
dcm = pydicom.dcmread(dcom)
img = (dcm.pixel_array - np.mean(dcm.pixel_array))/np.std(dcm.pixel_array)
ax1.imshow(dcm.pixel_array,cmap='gray')
ax2.hist(img.ravel(), bins = 256)
plt.show()
# Add file path to images
data_sample_path = {os.path.basename(x): x for x in
glob(os.path.join('/data','images*','*','*.png'))}
sample_df['path'] = sample_df['Image Index'].map(data_sample_path.get)
sample_df.head()
# test
imshow(sample_df['path'][0])
sample_all_labels = list(set(list(chain.from_iterable([i.split('|') for i in sample_df['Finding Labels']]))))
for c_label in sample_all_labels:
sample_df[c_label] = sample_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
sample_df.head()
# split data: Pneumonia Only
sample_df_p = sample_df[sample_df['Pneumonia'] == 1]
# sample_df_p = sample_df[sample_df['Finding Labels'] == 'Pneumonia']
# Only Pneumonia
for d in sample_df_p.head(5)['path']:
fig, (ax1, ax2) = plt.subplots(1, 2)
img = io.imread(d)
ax1.imshow(img, cmap='gray')
img = (img-np.mean(img)) / np.std(img)
ax2.hist(img.ravel(), bins = 256)
plt.show()
# Non Pneumonia
sample_df_nop = sample_df[sample_df['No Finding'] == 1]
for d in sample_df_nop.head(5)['path']:
fig, (ax1, ax2) = plt.subplots(1, 2)
img = io.imread(d)
ax1.imshow(img, cmap='gray')
img = (img-np.mean(img)) / np.std(img)
ax2.hist(img.ravel(), bins = 256)
plt.show()
# Same as above but more focused on intesnity distribution
# Only Pneumonia
for d in sample_df_p.head(5)['path']:
plt.figure(figsize=(10,6))
img = io.imread(d)
# img = (img-np.mean(img)) / np.std(img)
# ax = plt.hist(img.ravel(), bins = 256)
sns.distplot(img, bins=256, kde=True, rug=False)
plt.title(f'Pneumonia image {d}')
plt.show()
# No Finding (Clear)
for d in sample_df_nop.head(5)['path']:
plt.figure(figsize=(10,6))
img = io.imread(d)
# img = (img-np.mean(img)) / np.std(img)
# ax = plt.hist(img.ravel(), bins = 256)
sns.distplot(img, bins=256, kde=True, rug=False)
plt.title(f'Non-Pneumonia image {d}')
plt.show()
- From the Pixel Intesity Distribution charts above, the peak from values 0 to 40 represent black color, which is mainly the boundary/ackground color in the X-Rays.
- Actual representation of the chest images, are more toward white/grayish color (lighter tones).
- If we remove or ignore the values from 0-40 we will shift the focus on how the chest x-rays themeselves being represented. In other words, ignore the black boundaries/background color.
# Same as above but more focused on intesnity distribution
# Only Pneumonia
for d in sample_df_p.head(5)['path']:
plt.figure(figsize=(10,6))
img = io.imread(d)
img = img[img > 40]
# img = (img-np.mean(img)) / np.std(img)
# ax = plt.hist(img.ravel(), bins = 256)
sns.distplot(img, bins=256, kde=True, rug=False)
plt.title(f'Pneumonia image {d}')
plt.show()
for d in sample_df_nop.head(5)['path']:
plt.figure(figsize=(10,6))
img = io.imread(d)
# img = (img-np.mean(img)) / np.std(img)
# ax = plt.hist(img.ravel(), bins = 256)
img = img[img > 30]
sns.distplot(img, bins=256, kde=True, rug=False)
plt.title(f'Non-Pneumonia image {d}')
plt.show()
coll = []
for i in range(5):
l = {}
for label in all_labels:
l[label] = sample_df[sample_df[label] == 1]['path'].iloc[i]
coll.append(l)
coll[0]
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[0]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[0].keys(), coll[0].values()):
img2 = io.imread(path)
img2 = img2[img2 > 40]
img2 = img2.ravel()
sns.distplot(img2, ax=ax, bins=256)
# diff = abs(pneumonia_arr-np.array(img2)).sum()
# difference[label] = diff
# ax.set(title=f'{label} - {diff}')
if label == 'Pneumonia':
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
else:
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
- Examinig above, the closest three diseases in terms of shape, mean and std (spread), is
Edeam
,Mass
andnodule
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[1]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[1].keys(), coll[1].values()):
img2 = io.imread(path)
img2 = img2[img2 > 40]
img2 = img2.ravel()
sns.distplot(img2, ax=ax, bins=256)
# diff = abs(pneumonia_arr-np.array(img2)).sum()
# difference[label] = diff
# ax.set(title=f'{label} - {diff}')
if label == 'Pneumonia':
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
else:
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
- Examinig above, the closest three diseases in terms of shape, mean and std (spread), are
Edema
,Fibrosis
andInfiltration
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[2]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[2].keys(), coll[2].values()):
img2 = io.imread(path)
img2 = img2[img2 > 40]
img2 = img2.ravel()
sns.distplot(img2, ax=ax, bins=256)
# diff = abs(pneumonia_arr-np.array(img2)).sum()
# difference[label] = diff
# ax.set(title=f'{label} - {diff}')
if label == 'Pneumonia':
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
else:
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
- Examinig above, the closest three diseases in terms of shape, mean and std (spread), are
Nodule
,Mass
andInfiltration
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
p = io.imread(coll[3]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[3].keys(), coll[3].values()):
img2 = io.imread(path)
img2 = img2[img2 > 40]
img2 = img2.ravel()
sns.distplot(img2, ax=ax, bins=256)
# diff = abs(pneumonia_arr-np.array(img2)).sum()
# difference[label] = diff
# ax.set(title=f'{label} - {diff}')
if label == 'Pneumonia':
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
else:
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
- Examinig above, the closest two diseases in terms of shape, mean and std (spread), are
Mass
andInfiltration
# {k: v for k, v in sorted(difference.items(), key=lambda item: item[1])}
difference = {}
fig, axes = plt.subplots(4, 4, figsize = (16, 16))
fig.suptitle('Distributions of Pixel Intensity')
pneumonia = io.imread(coll[0]['Pneumonia']).ravel()
p = io.imread(coll[4]['Pneumonia'])
p = p[p > 40]
pneumonia = np.array(p.ravel())
for ax, label, path in zip(axes.flatten(), coll[4].keys(), coll[4].values()):
img2 = io.imread(path)
img2 = img2[img2 > 40]
img2 = img2.ravel()
sns.distplot(img2, ax=ax, bins=256)
# diff = abs(np.array(pneumonia)-np.array(img2)).sum()
# difference[label] = diff
if label == 'Pneumonia':
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}', color="red")
else:
ax.set_title(f'{label} , m: {round(img2.mean(),2)} s: {round(img2.std(),2)}')
- Examinig above, the closest two diseases in terms of shape, mean and std (spread), is
Mass
andInfiltration