1. Install HELP from GitHub ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Skip this cell if you already have installed HELP. .. code:: ipython3 !pip install git+https://github.com/giordamaug/HELP.git 2. Download the input files ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For a chosen tissue (here ``Kidney``), download from GitHub the label file (here ``Kidney_HELP.csv``, computed as in Example 1) and the attribute files (here BIO ``Kidney_BIO.csv``, CCcfs ``Kidney_CCcfs_1.csv``, …, ``Kidney_CCcfs_5.csv``, and N2V ``Kidney_EmbN2V_128.csv``). Skip this step if you already have these input files locally. .. code:: ipython3 tissue='Kidney' !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_HELP.csv !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_BIO.csv for i in range(5): !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCcfs_{i}.csv !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_EmbN2V_128.csv Observe that the CCcfs file has been subdivided into 5 separate files for storage limitations on GitHub. 3. Load the input files and process the tissue attributes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - The label file (``Kidney_HELP.csv``) can be loaded via ``read_csv``; its three-class labels (``E``, ``aE``, ``sNE``) are converted to two-class labels (``E``, ``NE``); - The tissue gene attributes are loaded and assembled via ``feature_assemble_df`` using the downloaded datafiles BIO, CCcfs subdivided into 5 subfiles (``'nchunks': 5``) and embedding. We do not apply missing values fixing (``'fixna': False``), while we do apply data scaling (``'normalize': 'std'``) to the BIO and CCcfs attributes. .. code:: ipython3 tissue='Kidney' import pandas as pd from HELPpy.preprocess.loaders import feature_assemble_df df_y = pd.read_csv(f"{tissue}_HELP.csv", index_col=0) df_y = df_y.replace({'aE': 'NE', 'sNE': 'NE'}) print(df_y.value_counts(normalize=False)) features = [{'fname': f'{tissue}_BIO.csv', 'fixna' : False, 'normalize': 'std'}, {'fname': f'{tissue}_CCcfs.csv', 'fixna' : False, 'normalize': 'std', 'nchunks' : 5}, {'fname': f'{tissue}_EmbN2V_128.csv', 'fixna' : False, 'normalize': None}] df_X, df_y = feature_assemble_df(df_y, features=features, saveflag=False, verbose=True) .. parsed-literal:: label NE 16678 E 1253 Name: count, dtype: int64 Majority NE 16678 minority E 1253 [Kidney_BIO.csv] found 52532 Nan... [Kidney_BIO.csv] Normalization with std ... .. parsed-literal:: Loading file in chunks: 100%|██████████| 5/5 [00:02<00:00, 2.05it/s] .. parsed-literal:: [Kidney_CCcfs.csv] found 6676644 Nan... [Kidney_CCcfs.csv] Normalization with std ... [Kidney_EmbN2V_128.csv] found 0 Nan... [Kidney_EmbN2V_128.csv] No normalization... 17236 labeled genes over a total of 17931 (17236, 3456) data input 4. Estimate the performance of EGs prediction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instantiate the prediction model described in the HELP paper (soft-voting ensemble ``VotingSplitClassifier`` of ``n_voters=10`` classifiers) and estimate its performance via 5-fold cross-validation (``k_fold_cv`` with ``n_splits=5``). Then, print the obtained average performances (``df_scores``)… .. code:: ipython3 from HELPpy.models.prediction import VotingSplitClassifier, k_fold_cv clf = VotingSplitClassifier(n_voters=10, n_jobs=-1, random_state=-1) df_scores, scores, predictions = k_fold_cv(df_X, df_y, clf, n_splits=5, seed=0, verbose=True) df_scores .. parsed-literal:: {'E': 0, 'NE': 1} label NE 16010 E 1224 Name: count, dtype: int64 Classification with VotingSplitClassifier... .. parsed-literal:: 5-fold: 100%|██████████| 5/5 [01:15<00:00, 15.08s/it] .. raw:: html

	measure
ROC-AUC	0.9584±0.0043
Accuracy	0.8848±0.0025
BA	0.8939±0.0070
Sensitivity	0.9044±0.0156
Specificity	0.8833±0.0031
MCC	0.5354±0.0079
CM	[[1107, 117], [1868, 14142]]

… and those in each fold (``scores``) .. code:: ipython3 scores .. raw:: html

	ROC-AUC	Accuracy	BA	Sensitivity	Specificity	MCC	CM
0	0.954258	0.878735	0.889496	0.902041	0.876952	0.522809	[[221, 24], [394, 2808]]
1	0.953289	0.873223	0.894068	0.918367	0.869769	0.520189	[[225, 20], [417, 2785]]
2	0.955901	0.884827	0.890891	0.897959	0.883823	0.532617	[[220, 25], [372, 2830]]
3	0.960578	0.882507	0.895296	0.910204	0.880387	0.533671	[[223, 22], [383, 2819]]
4	0.965238	0.880731	0.901747	0.926230	0.877264	0.536883	[[226, 18], [393, 2809]]

Show labels, predictions and their probabilities (``predictions``) and save them in a csv file .. code:: ipython3 predictions .. raw:: html

	label	prediction	probabilities
gene
A2M	1	1	0.016435
A2ML1	1	1	0.001649
AAGAB	1	1	0.230005
AANAT	1	1	0.002823
AARS2	1	0	0.529173
...	...	...	...
ZSCAN9	1	1	0.004752
ZSWIM6	1	1	0.007049
ZUP1	1	0	0.532555
ZYG11A	1	1	0.005995
ZZEF1	1	1	0.075781

17234 rows × 3 columns

.. code:: ipython3 predictions.to_csv(f"csEGs_{tissue}_EvsNE.csv", index=True) 5. Compute TPR for ucsEGs and csEGs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Read the result files for ucsEGs (``ucsEG_Kidney.txt``) and csEGs (``csEGs_Kidney_EvsNE.csv``) already computed for the tissue, compute the TPRs (tpr) and show their bar plot. .. code:: ipython3 import seaborn as sns import matplotlib.pyplot as plt import numpy as np labels = [] data = [] tpr = [] genes = {} !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/ucsEG_{tissue}.txt ucsEGs = pd.read_csv(f"ucsEG_{tissue}.txt", index_col=0, header=None).index.values !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/csEGs_{tissue}_EvsNE.csv predictions = pd.read_csv(f"csEGs_{tissue}_EvsNE.csv", index_col=0) indices = np.intersect1d(ucsEGs, predictions.index.values) preds = predictions.loc[indices] num1 = len(preds[preds['label'] == preds['prediction']]) den1 = len(preds[preds['label'] == 0]) den2 = len(predictions[predictions['label'] == 0]) num2 = len(predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])]) labels += [f"ucsEGs\n{tissue}", f"csEGs\n{tissue}"] data += [float(f"{num1 /den1:.3f}"), float(f"{num2 /den2:.3f}")] tpr += [f"{num1}/{den1}", f"{num2}/{den2}"] genes[f'ucsEGs_{tissue}_y'] = preds[preds['label'] == preds['prediction']].index.values genes[f'ucsEGs_{tissue}_n'] = preds[preds['label'] != preds['prediction']].index.values genes[f'csEGs_{tissue}_y'] = predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])].index.values genes[f'csEGs_{tissue}_n'] = predictions[(predictions['label'] == 0) & (predictions['label'] != predictions['prediction'])].index.values print(f"ucsEG {tissue} TPR = {num1 /den1:.3f} ({num1}/{den1}) ucsEG {tissue} TPR = {num2/den2:.3f} ({num2}/{den2})") f, ax = plt.subplots(figsize=(4, 4)) palette = sns.color_palette("pastel", n_colors=2) sns.barplot(y = data, x = labels, ax=ax, hue= data, palette = palette, orient='v', legend=False) ax.set_ylabel('TPR') ax.set(yticklabels=[]) for i,l,t in zip(range(4),labels,tpr): ax.text(-0.15 + (i * 1.03), 0.2, f"({t})", rotation='vertical') for i in ax.containers: ax.bar_label(i,) .. parsed-literal:: zsh:1: command not found: wget zsh:1: command not found: wget ucsEG Kidney TPR = 0.780 (46/59) ucsEG Kidney TPR = 0.897 (1114/1242) .. image:: output_17_1.png This code can be used to produce Fig 5(B) of the HELP paper by executing an iteration cycle for both ``kidney`` and ``lung`` tissues. At the end, we print the list of ucs_EGs for the tissue. .. code:: ipython3 genes[f'ucsEGs_{tissue}_y'] .. parsed-literal:: array(['ACTG1', 'ACTR6', 'ARF4', 'ARPC4', 'CDK6', 'CHMP7', 'COPS3', 'DCTN3', 'DDX11', 'DDX52', 'EMC3', 'EXOSC1', 'GEMIN7', 'GET3', 'HGS', 'HTATSF1', 'KIF4A', 'MCM10', 'MDM2', 'METAP2', 'MLST8', 'NCAPH2', 'NDOR1', 'OXA1L', 'PFN1', 'PIK3C3', 'PPIE', 'PPP1CA', 'PPP4R2', 'RAB7A', 'RAD1', 'RBM42', 'RBMX2', 'RTEL1', 'SNRPB2', 'SPTLC1', 'SRSF10', 'TAF1D', 'TMED10', 'TMED2', 'UBA5', 'UBC', 'UBE2D3', 'USP10', 'VPS52', 'YWHAZ'], dtype=object)