本文已参与 [新人创作礼] 活动,一起开启掘金创作之路。
create_COVIDx.ipynb的内容较多,文件较长,因此分段贴出代码解析。
第一段代码如下:
import numpy as np
import pandas as pd
import os
import random
from shutil import copyfile
import pydicom as dicom
import cv2
# set parameters here
savepath = 'data'
seed = 0
np.random.seed(seed) # Reset the seed so all runs are the same.
random.seed(seed)
MAXVAL = 255 # Range [0 255]
# path to covid-19 dataset from https://github.com/ieee8023/covid-chestxray-dataset
cohen_imgpath = '../covid-chestxray-dataset/images'
cohen_csvpath = '../covid-chestxray-dataset/metadata.csv'
# path to covid-14 dataset from https://github.com/agchung/Figure1-COVID-chestxray-dataset
fig1_imgpath = '../Figure1-COVID-chestxray-dataset/images'
fig1_csvpath = '../Figure1-COVID-chestxray-dataset/metadata.csv'
# path to covid-19 dataset from https://github.com/agchung/Actualmed-COVID-chestxray-dataset
actmed_imgpath = '../Actualmed-COVID-chestxray-dataset/images'
actmed_csvpath = '../Actualmed-COVID-chestxray-dataset/metadata.csv'
# path to covid-19 dataset from https://www.kaggle.com/tawsifurrahman/covid19-radiography-database
sirm_imgpath = '../COVID-19-Radiography-Database/COVID-19'
sirm_csvpath = '../COVID-19-Radiography-Database/COVID-19.metadata.xlsx'
# path to https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
rsna_datapath = '../rsna-pneumonia-detection-challenge'
# get all the normal from here
rsna_csvname = 'stage_2_detailed_class_info.csv'
# get all the 1s from here since 1 indicate pneumonia
# found that images that aren't pneunmonia and also not normal are classified as 0s
rsna_csvname2 = 'stage_2_train_labels.csv'
rsna_imgpath = 'stage_2_train_images'
# parameters for COVIDx dataset
train = []
test = []
test_count = {'normal': 0, 'pneumonia': 0, 'COVID-19': 0}
train_count = {'normal': 0, 'pneumonia': 0, 'COVID-19': 0}
mapping = dict()
mapping['COVID-19'] = 'COVID-19'
mapping['SARS'] = 'pneumonia'
mapping['MERS'] = 'pneumonia'
mapping['Streptococcus'] = 'pneumonia'
mapping['Klebsiella'] = 'pneumonia'
mapping['Chlamydophila'] = 'pneumonia'
mapping['Legionella'] = 'pneumonia'
mapping['Normal'] = 'normal'
mapping['Lung Opacity'] = 'pneumonia'
mapping['1'] = 'pneumonia'
# train/test split
split = 0.1
# to avoid duplicates
patient_imgpath = {}
安装pydicom
conda install -c conda-forge pydicom
如果出现找不到软件包等错误,则执行如下几条命令更换清华镜像源:
conda config --add channels mirrors.tuna.tsinghua.edu.cn/anaconda/cl…
conda config --add channels mirrors.tuna.tsinghua.edu.cn/anaconda/cl…
conda config --add channels mirrors.tuna.tsinghua.edu.cn/anaconda/pk…
conda config --set show_channel_urls yes
安装cv2库
conda install opencv-python
或
pip install opencv-python
如果仍然出现找不到软件包的错误,则
conda install --channel conda.anaconda.org/menpo opencv3
如果依然不行,则下载离线包进行安装。步骤如下:
1)去清华镜像网站pypi.tuna.tsinghua.edu.cn/simple/open…,下载对应的opencv库版本;
2)将下载的xxx.whl包放在Anaconda3\Lib\site-packages\下;
3)进入到Anaconda3\Lib\site-packages\;
4)运行pip install .\xxx.whl命令进行安装。
第二段代码如下:
# adapted from https://github.com/mlmed/torchxrayvision/blob/master/torchxrayvision/datasets.py#L814
cohen_csv = pd.read_csv(cohen_csvpath, nrows=None)
#idx_pa = csv["view"] == "PA" # Keep only the PA view
views = ["PA", "AP", "AP Supine", "AP semi erect", "AP erect"]
cohen_idx_keep = cohen_csv.view.isin(views)
cohen_csv = cohen_csv[cohen_idx_keep]
fig1_csv = pd.read_csv(fig1_csvpath, encoding='ISO-8859-1', nrows=None)
actmed_csv = pd.read_csv(actmed_csvpath, nrows=None)
sirm_csv = pd.read_excel(sirm_csvpath)
cohen_csvpath在第一段代码中赋值,为:'../covid-chestxray-dataset/metadata.csv'。实际上是从GitHub - ieee8023/covid-chestxray-dataset: We are building an open database of COVID-19 cases with chest X-ray or CT images.上下载工程源码后,放到COVID-Net同级的路径下,当然你可以自行修改放置路径以及这里对应的路径代码,如同docs/COVIDx.md中说的那样。
pandas.read_csv()函数
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
cohen_csv = pd.read_csv(cohen_csvpath, nrows=None)
这句程序的作用是读取 covid-chestxray-dataset/metadata.csv文件的内容。
cohen_idx_keep = cohen_csv.view.isin(views)
cohen_csv = cohen_csv[cohen_idx_keep]
这2句程序的意思是从原数据集中取metadata.csv中“view”列值为"PA"、"AP"、 "AP Supine"、 "AP semi erect"、 "AP erect"的行,重新生成cohen_csv,相当于数据清洗,过滤掉了不需要的数据。
第三段代码如下:
# get non-COVID19 viral, bacteria, and COVID-19 infections from covid-chestxray-dataset, figure1 and actualmed
# stored as patient id, image filename and label
filename_label = {'normal': [], 'pneumonia': [], 'COVID-19': []}
count = {'normal': 0, 'pneumonia': 0, 'COVID-19': 0}
covid_ds = {'cohen': [], 'fig1': [], 'actmed': [], 'sirm': []}
for index, row in cohen_csv.iterrows():
f = row['finding'].split(',')[0] # take the first finding, for the case of COVID-19, ARDS
if f in mapping: #
count[mapping[f]] += 1
entry = [str(row['patientid']), row['filename'], mapping[f], 'cohen']
filename_label[mapping[f]].append(entry)
if mapping[f] == 'COVID-19':
covid_ds['cohen'].append(str(row['patientid']))
for index, row in fig1_csv.iterrows():
if not str(row['finding']) == 'nan':
f = row['finding'].split(',')[0] # take the first finding
if f in mapping: #
count[mapping[f]] += 1
if os.path.exists(os.path.join(fig1_imgpath, row['patientid'] + '.jpg')):
entry = [row['patientid'], row['patientid'] + '.jpg', mapping[f], 'fig1']
elif os.path.exists(os.path.join(fig1_imgpath, row['patientid'] + '.png')):
entry = [row['patientid'], row['patientid'] + '.png', mapping[f], 'fig1']
filename_label[mapping[f]].append(entry)
if mapping[f] == 'COVID-19':
covid_ds['fig1'].append(row['patientid'])
for index, row in actmed_csv.iterrows():
if not str(row['finding']) == 'nan':
f = row['finding'].split(',')[0]
if f in mapping:
count[mapping[f]] += 1
entry = [row['patientid'], row['imagename'], mapping[f], 'actmed']
filename_label[mapping[f]].append(entry)
if mapping[f] == 'COVID-19':
covid_ds['actmed'].append(row['patientid'])