1. Install HELP from GitHub

Skip this cell if you already have installed HELP.

!pip install git+https://github.com/giordamaug/HELP.git

2. Download the input files

For a chosen tissue (here Kidney), download from GitHub the label file (here Kidney_HELP.csv, computed as in Example 1) and the attribute files (here BIO Kidney_BIO.csv, CCcfs Kidney_CCcfs_1.csv, …, Kidney_CCcfs_5.csv, and N2V Kidney_EmbN2V_128.csv).

Skip this step if you already have these input files locally.

tissue='Kidney'
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_HELP.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_BIO.csv
for i in range(5):
  !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCcfs_{i}.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_EmbN2V_128.csv

Observe that the CCcfs file has been subdivided into 5 separate files for storage limitations on GitHub.

3. Load the input files and process the tissue attributes

The label file (Kidney_HELP.csv) can be loaded via read_csv; its three-class labels (E, aE, sNE) are converted to two-class labels (E, NE);
The tissue gene attributes are loaded and assembled via feature_assemble_df using the downloaded datafiles BIO, CCcfs subdivided into 5 subfiles ('nchunks': 5) and embedding. We do not apply missing values fixing ('fixna': False), while we do apply data scaling ('normalize': 'std') to the BIO and CCcfs attributes.

tissue='Kidney'
import pandas as pd
from HELPpy.preprocess.loaders import feature_assemble_df
df_y = pd.read_csv(f"{tissue}_HELP.csv", index_col=0)
df_y = df_y.replace({'aE': 'NE', 'sNE': 'NE'})
print(df_y.value_counts(normalize=False))
features = [{'fname': f'{tissue}_BIO.csv', 'fixna' : False, 'normalize': 'std'},
            {'fname': f'{tissue}_CCcfs.csv', 'fixna' : False, 'normalize': 'std', 'nchunks' : 5},
            {'fname': f'{tissue}_EmbN2V_128.csv', 'fixna' : False, 'normalize': None}]
df_X, df_y = feature_assemble_df(df_y, features=features, saveflag=False, verbose=True)

label
NE       16678
E         1253
Name: count, dtype: int64
Majority NE 16678 minority E 1253
[Kidney_BIO.csv] found 52532 Nan...
[Kidney_BIO.csv] Normalization with std ...

Loading file in chunks: 100%|██████████| 5/5 [00:02<00:00,  2.05it/s]

[Kidney_CCcfs.csv] found 6676644 Nan...
[Kidney_CCcfs.csv] Normalization with std ...
[Kidney_EmbN2V_128.csv] found 0 Nan...
[Kidney_EmbN2V_128.csv] No normalization...
17236 labeled genes over a total of 17931
(17236, 3456) data input

4. Estimate the performance of EGs prediction

Instantiate the prediction model described in the HELP paper (soft-voting ensemble VotingSplitClassifier of n_voters=10 classifiers) and estimate its performance via 5-fold cross-validation (k_fold_cv with n_splits=5). Then, print the obtained average performances (df_scores)…

from HELPpy.models.prediction import VotingSplitClassifier, k_fold_cv
clf = VotingSplitClassifier(n_voters=10, n_jobs=-1, random_state=-1)
df_scores, scores, predictions = k_fold_cv(df_X, df_y, clf, n_splits=5, seed=0, verbose=True)
df_scores

{'E': 0, 'NE': 1}
label
NE       16010
E         1224
Name: count, dtype: int64
Classification with VotingSplitClassifier...

5-fold: 100%|██████████| 5/5 [01:15<00:00, 15.08s/it]

	measure
ROC-AUC	0.9584±0.0043
Accuracy	0.8848±0.0025
BA	0.8939±0.0070
Sensitivity	0.9044±0.0156
Specificity	0.8833±0.0031
MCC	0.5354±0.0079
CM	[[1107, 117], [1868, 14142]]

… and those in each fold (scores)

scores

	ROC-AUC	Accuracy	BA	Sensitivity	Specificity	MCC	CM
0	0.954258	0.878735	0.889496	0.902041	0.876952	0.522809	[[221, 24], [394, 2808]]
1	0.953289	0.873223	0.894068	0.918367	0.869769	0.520189	[[225, 20], [417, 2785]]
2	0.955901	0.884827	0.890891	0.897959	0.883823	0.532617	[[220, 25], [372, 2830]]
3	0.960578	0.882507	0.895296	0.910204	0.880387	0.533671	[[223, 22], [383, 2819]]
4	0.965238	0.880731	0.901747	0.926230	0.877264	0.536883	[[226, 18], [393, 2809]]

Show labels, predictions and their probabilities (predictions) and save them in a csv file

predictions

	label	prediction	probabilities
gene
A2M	1	1	0.016435
A2ML1	1	1	0.001649
AAGAB	1	1	0.230005
AANAT	1	1	0.002823
AARS2	1	0	0.529173
...	...	...	...
ZSCAN9	1	1	0.004752
ZSWIM6	1	1	0.007049
ZUP1	1	0	0.532555
ZYG11A	1	1	0.005995
ZZEF1	1	1	0.075781

17234 rows × 3 columns

predictions.to_csv(f"csEGs_{tissue}_EvsNE.csv", index=True)

5. Compute TPR for ucsEGs and csEGs

Read the result files for ucsEGs (ucsEG_Kidney.txt) and csEGs (csEGs_Kidney_EvsNE.csv) already computed for the tissue, compute the TPRs (tpr) and show their bar plot.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
labels = []
data = []
tpr = []
genes = {}
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/ucsEG_{tissue}.txt
ucsEGs = pd.read_csv(f"ucsEG_{tissue}.txt", index_col=0, header=None).index.values
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/csEGs_{tissue}_EvsNE.csv
predictions = pd.read_csv(f"csEGs_{tissue}_EvsNE.csv", index_col=0)
indices = np.intersect1d(ucsEGs, predictions.index.values)
preds = predictions.loc[indices]
num1 = len(preds[preds['label'] == preds['prediction']])
den1 = len(preds[preds['label'] == 0])
den2 = len(predictions[predictions['label'] == 0])
num2 = len(predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])])
labels += [f"ucsEGs\n{tissue}", f"csEGs\n{tissue}"]
data += [float(f"{num1 /den1:.3f}"), float(f"{num2 /den2:.3f}")]
tpr += [f"{num1}/{den1}", f"{num2}/{den2}"]
genes[f'ucsEGs_{tissue}_y'] = preds[preds['label'] == preds['prediction']].index.values
genes[f'ucsEGs_{tissue}_n'] = preds[preds['label'] != preds['prediction']].index.values
genes[f'csEGs_{tissue}_y'] = predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])].index.values
genes[f'csEGs_{tissue}_n'] = predictions[(predictions['label'] == 0) & (predictions['label'] != predictions['prediction'])].index.values
print(f"ucsEG {tissue} TPR = {num1 /den1:.3f} ({num1}/{den1}) ucsEG {tissue} TPR =  {num2/den2:.3f} ({num2}/{den2})")

f, ax = plt.subplots(figsize=(4, 4))
palette = sns.color_palette("pastel", n_colors=2)
sns.barplot(y = data, x = labels, ax=ax, hue= data, palette = palette, orient='v', legend=False)
ax.set_ylabel('TPR')
ax.set(yticklabels=[])
for i,l,t in zip(range(4),labels,tpr):
    ax.text(-0.15 + (i * 1.03), 0.2, f"({t})", rotation='vertical')
for i in ax.containers:
    ax.bar_label(i,)

zsh:1: command not found: wget
zsh:1: command not found: wget
ucsEG Kidney TPR = 0.780 (46/59) ucsEG Kidney TPR =  0.897 (1114/1242)

This code can be used to produce Fig 5(B) of the HELP paper by executing an iteration cycle for both kidney and lung tissues.

At the end, we print the list of ucs_EGs for the tissue.

genes[f'ucsEGs_{tissue}_y']

array(['ACTG1', 'ACTR6', 'ARF4', 'ARPC4', 'CDK6', 'CHMP7', 'COPS3',
       'DCTN3', 'DDX11', 'DDX52', 'EMC3', 'EXOSC1', 'GEMIN7', 'GET3',
       'HGS', 'HTATSF1', 'KIF4A', 'MCM10', 'MDM2', 'METAP2', 'MLST8',
       'NCAPH2', 'NDOR1', 'OXA1L', 'PFN1', 'PIK3C3', 'PPIE', 'PPP1CA',
       'PPP4R2', 'RAB7A', 'RAD1', 'RBM42', 'RBMX2', 'RTEL1', 'SNRPB2',
       'SPTLC1', 'SRSF10', 'TAF1D', 'TMED10', 'TMED2', 'UBA5', 'UBC',
       'UBE2D3', 'USP10', 'VPS52', 'YWHAZ'], dtype=object)