1. Install HELP from GitHub ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Skip this cell if you already have installed HELP. .. code:: ipython3 !pip install git+https://github.com/giordamaug/HELP.git 2. Download the input files ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For a chosen tissue (here ``Kidney``), download from GitHub the label file (here ``Kidney_HELP.csv``, computed as in Example 1) and the attribute files (here BIO ``Kidney_BIO.csv``, CCcfs ``Kidney_CCcfs_1.csv``, …, ``Kidney_CCcfs_5.csv``, and N2V ``Kidney_EmbN2V_128.csv``). Skip this step if you already have these input files locally. .. code:: ipython3 tissue='Kidney' !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_HELP.csv !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_BIO.csv for i in range(5): !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCcfs_{i}.csv !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_EmbN2V_128.csv Observe that the CCcfs file has been subdivided into 5 separate files for storage limitations on GitHub. 3. Load the input files and process the tissue attributes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - The label file (``Kidney_HELP.csv``) can be loaded via ``read_csv``; its three-class labels (``E``, ``aE``, ``sNE``) are converted to two-class labels (``E``, ``NE``); - The tissue gene attributes are loaded and assembled via ``feature_assemble_df`` using the downloaded datafiles BIO, CCcfs subdivided into 5 subfiles (``'nchunks': 5``) and embedding. We do not apply missing values fixing (``'fixna': False``), while we do apply data scaling (``'normalize': 'std'``) to the BIO and CCcfs attributes. .. code:: ipython3 tissue='Kidney' import pandas as pd from HELPpy.preprocess.loaders import feature_assemble_df df_y = pd.read_csv(f"{tissue}_HELP.csv", index_col=0) df_y = df_y.replace({'aE': 'NE', 'sNE': 'NE'}) print(df_y.value_counts(normalize=False)) features = [{'fname': f'{tissue}_BIO.csv', 'fixna' : False, 'normalize': 'std'}, {'fname': f'{tissue}_CCcfs.csv', 'fixna' : False, 'normalize': 'std', 'nchunks' : 5}, {'fname': f'{tissue}_EmbN2V_128.csv', 'fixna' : False, 'normalize': None}] df_X, df_y = feature_assemble_df(df_y, features=features, saveflag=False, verbose=True) .. parsed-literal:: label NE 16678 E 1253 Name: count, dtype: int64 Majority NE 16678 minority E 1253 [Kidney_BIO.csv] found 52532 Nan... [Kidney_BIO.csv] Normalization with std ... .. parsed-literal:: Loading file in chunks: 100%|██████████| 5/5 [00:02<00:00, 2.05it/s] .. parsed-literal:: [Kidney_CCcfs.csv] found 6676644 Nan... [Kidney_CCcfs.csv] Normalization with std ... [Kidney_EmbN2V_128.csv] found 0 Nan... [Kidney_EmbN2V_128.csv] No normalization... 17236 labeled genes over a total of 17931 (17236, 3456) data input 4. Estimate the performance of EGs prediction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instantiate the prediction model described in the HELP paper (soft-voting ensemble ``VotingSplitClassifier`` of ``n_voters=10`` classifiers) and estimate its performance via 5-fold cross-validation (``k_fold_cv`` with ``n_splits=5``). Then, print the obtained average performances (``df_scores``)… .. code:: ipython3 from HELPpy.models.prediction import VotingSplitClassifier, k_fold_cv clf = VotingSplitClassifier(n_voters=10, n_jobs=-1, random_state=-1) df_scores, scores, predictions = k_fold_cv(df_X, df_y, clf, n_splits=5, seed=0, verbose=True) df_scores .. parsed-literal:: {'E': 0, 'NE': 1} label NE 16010 E 1224 Name: count, dtype: int64 Classification with VotingSplitClassifier... .. parsed-literal:: 5-fold: 100%|██████████| 5/5 [01:15<00:00, 15.08s/it] .. raw:: html
measure
ROC-AUC 0.9584±0.0043
Accuracy 0.8848±0.0025
BA 0.8939±0.0070
Sensitivity 0.9044±0.0156
Specificity 0.8833±0.0031
MCC 0.5354±0.0079
CM [[1107, 117], [1868, 14142]]
… and those in each fold (``scores``) .. code:: ipython3 scores .. raw:: html
ROC-AUC Accuracy BA Sensitivity Specificity MCC CM
0 0.954258 0.878735 0.889496 0.902041 0.876952 0.522809 [[221, 24], [394, 2808]]
1 0.953289 0.873223 0.894068 0.918367 0.869769 0.520189 [[225, 20], [417, 2785]]
2 0.955901 0.884827 0.890891 0.897959 0.883823 0.532617 [[220, 25], [372, 2830]]
3 0.960578 0.882507 0.895296 0.910204 0.880387 0.533671 [[223, 22], [383, 2819]]
4 0.965238 0.880731 0.901747 0.926230 0.877264 0.536883 [[226, 18], [393, 2809]]
Show labels, predictions and their probabilities (``predictions``) and save them in a csv file .. code:: ipython3 predictions .. raw:: html
label prediction probabilities
gene
A2M 1 1 0.016435
A2ML1 1 1 0.001649
AAGAB 1 1 0.230005
AANAT 1 1 0.002823
AARS2 1 0 0.529173
... ... ... ...
ZSCAN9 1 1 0.004752
ZSWIM6 1 1 0.007049
ZUP1 1 0 0.532555
ZYG11A 1 1 0.005995
ZZEF1 1 1 0.075781

17234 rows × 3 columns

.. code:: ipython3 predictions.to_csv(f"csEGs_{tissue}_EvsNE.csv", index=True) 5. Compute TPR for ucsEGs and csEGs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Read the result files for ucsEGs (``ucsEG_Kidney.txt``) and csEGs (``csEGs_Kidney_EvsNE.csv``) already computed for the tissue, compute the TPRs (tpr) and show their bar plot. .. code:: ipython3 import seaborn as sns import matplotlib.pyplot as plt import numpy as np labels = [] data = [] tpr = [] genes = {} !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/ucsEG_{tissue}.txt ucsEGs = pd.read_csv(f"ucsEG_{tissue}.txt", index_col=0, header=None).index.values !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/csEGs_{tissue}_EvsNE.csv predictions = pd.read_csv(f"csEGs_{tissue}_EvsNE.csv", index_col=0) indices = np.intersect1d(ucsEGs, predictions.index.values) preds = predictions.loc[indices] num1 = len(preds[preds['label'] == preds['prediction']]) den1 = len(preds[preds['label'] == 0]) den2 = len(predictions[predictions['label'] == 0]) num2 = len(predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])]) labels += [f"ucsEGs\n{tissue}", f"csEGs\n{tissue}"] data += [float(f"{num1 /den1:.3f}"), float(f"{num2 /den2:.3f}")] tpr += [f"{num1}/{den1}", f"{num2}/{den2}"] genes[f'ucsEGs_{tissue}_y'] = preds[preds['label'] == preds['prediction']].index.values genes[f'ucsEGs_{tissue}_n'] = preds[preds['label'] != preds['prediction']].index.values genes[f'csEGs_{tissue}_y'] = predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])].index.values genes[f'csEGs_{tissue}_n'] = predictions[(predictions['label'] == 0) & (predictions['label'] != predictions['prediction'])].index.values print(f"ucsEG {tissue} TPR = {num1 /den1:.3f} ({num1}/{den1}) ucsEG {tissue} TPR = {num2/den2:.3f} ({num2}/{den2})") f, ax = plt.subplots(figsize=(4, 4)) palette = sns.color_palette("pastel", n_colors=2) sns.barplot(y = data, x = labels, ax=ax, hue= data, palette = palette, orient='v', legend=False) ax.set_ylabel('TPR') ax.set(yticklabels=[]) for i,l,t in zip(range(4),labels,tpr): ax.text(-0.15 + (i * 1.03), 0.2, f"({t})", rotation='vertical') for i in ax.containers: ax.bar_label(i,) .. parsed-literal:: zsh:1: command not found: wget zsh:1: command not found: wget ucsEG Kidney TPR = 0.780 (46/59) ucsEG Kidney TPR = 0.897 (1114/1242) .. image:: output_17_1.png This code can be used to produce Fig 5(B) of the HELP paper by executing an iteration cycle for both ``kidney`` and ``lung`` tissues. At the end, we print the list of ucs_EGs for the tissue. .. code:: ipython3 genes[f'ucsEGs_{tissue}_y'] .. parsed-literal:: array(['ACTG1', 'ACTR6', 'ARF4', 'ARPC4', 'CDK6', 'CHMP7', 'COPS3', 'DCTN3', 'DDX11', 'DDX52', 'EMC3', 'EXOSC1', 'GEMIN7', 'GET3', 'HGS', 'HTATSF1', 'KIF4A', 'MCM10', 'MDM2', 'METAP2', 'MLST8', 'NCAPH2', 'NDOR1', 'OXA1L', 'PFN1', 'PIK3C3', 'PPIE', 'PPP1CA', 'PPP4R2', 'RAB7A', 'RAD1', 'RBM42', 'RBMX2', 'RTEL1', 'SNRPB2', 'SPTLC1', 'SRSF10', 'TAF1D', 'TMED10', 'TMED2', 'UBA5', 'UBC', 'UBE2D3', 'USP10', 'VPS52', 'YWHAZ'], dtype=object)