1. Install HELP from GitHub

Skip this cell if you already have installed HELP.

!pip install git+https://github.com/giordamaug/HELP.git

2. Download the input files

For a chosen tissue (here Kidney), download from GitHub the label file (here Kidney_HELP.csv, computed as in Example 1) and the attribute files (here BIO Kidney_BIO.csv, CCcfs Kidney_CCcfs_1.csv, …, Kidney_CCcfs_5.csv, and N2V Kidney_EmbN2V_128.csv). Skip this step if you already have these input files locally.

tissue='Kidney'
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_HELP.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_BIO.csv
for i in range(5):
  !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCcfs_{i}.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_EmbN2V_128.csv
#!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCBeder.csv

Other attribute files (CCBeder) are shown but commented to help the user experiment with different data.

3. Download the script for the experiments and show the man page

Download the batch script for EG prediction used for the experiments and show its manual page:

!wget https://raw.githubusercontent.com/giordamaug/HELP/main/HELPpy/notebooks/EG_prediction.py
!python EG_prediction.py -h
usage: EG_prediction.py [-h] -i <inputfile> [<inputfile> ...]
                        [-c <chunks> [<chunks> ...]]
                        [-X <excludelabels> [<excludelabels> ...]]
                        [-L <labelname>] -l <labelfile> [-A <aliases>]
                        [-b <seed>] [-r <repeat>] [-f <folds>] [-j <jobs>]
                        [-B] [-v <voters>] [-ba] [-fx] [-n <normalize>]
                        [-o <outfile>] [-s <scorefile>] [-p <predfile>]

PLOS COMPBIO

options:
  -h, --help            show this help message and exit
  -i <inputfile> [<inputfile> ...], --inputfile <inputfile> [<inputfile> ...]
                        input attribute filename list
  -c <chunks> [<chunks> ...], --chunks <chunks> [<chunks> ...]
                        no of chunks for attribute filename list
  -X <excludelabels> [<excludelabels> ...], --excludelabels <excludelabels> [<excludelabels> ...]
                        labels to exclude (default NaN, values any list)
  -L <labelname>, --labelname <labelname>
                        label name (default label)
  -l <labelfile>, --labelfile <labelfile>
                        label filename
  -A <aliases>, --aliases <aliases>
                        the dictionary for label renaming (es: {"oldlabel1":
                        "newlabel1", ..., "oldlabelN": "newlabelN"})
  -b <seed>, --seed <seed>
                        random seed (default: 1)
  -r <repeat>, --repeat <repeat>
                        n. of iteration (default: 10)
  -f <folds>, --folds <folds>
                        n. of cv folds (default: 5)
  -j <jobs>, --jobs <jobs>
                        n. of parallel jobs (default: -1)
  -B, --batch           enable batch mode (no output)
  -v <voters>, --voters <voters>
                        n. of voter predictors (default: 1 - one classifier)
  -ba, --balanced       enable balancing in classifier (default disabled)
  -fx, --fixna          enable fixing NaN (default disabled)
  -n <normalize>, --normalize <normalize>
                        normalization mode (default None)
  -o <outfile>, --outfile <outfile>
                        output file for performance measures sumup
  -s <scorefile>, --scorefile <scorefile>
                        output file reporting all measurements
  -p <predfile>, --predfile <predfile>
                        output file reporting predictions

4. Run the E vs NE experiments

This cell’s code reproduces the results for Kidney reported in Table 3 (A) of the HELP paper.

datapath = "."
tissue = "Kidney"                               # or 'Lung'
labelfile = f"{tissue}_HELP.csv"                # label filename
aliases = "-A \"{'aE': 'NE', 'sNE':'NE'}\""     # dictionary for renaming labels before prediction: es. {'oldlabel': 'newlabel'}
excludeflags = ""                               # label to remove (none for E vs NE problem)
njobs = "-1"                                    # parallelism level: -1 = all cpus, 1 = sequential
nchunks = "-c 1 5 1"                            # no. of chunks for each input attribute file: es. 1 5 (Bio is one chunk, CCcfs split in 5 chunks)
voters = "-v 10"                                # no. of voters on classifier ensemble
repeats = "-r 10"                               # no. of iterations for experiments
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_CCcfs.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            {nchunks} \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags}  \
                            {voters} {repeats} \
                            -n std -ba \
                            -j -1 -B
METHOD: LGBM        VOTERS: 10      BALANCE: yes
PROBL: E vs NE
INPUT: Kidney_BIO.csv Kidney_CCcfs.csv Kidney_EmbN2V_128.csv
LABEL: Kidney_HELP.csv DISTRIB: E : 1242, NE: 15994
+-------------+----------------------------------+
|             | measure                          |
|-------------+----------------------------------|
| ROC-AUC     | 0.9572±0.0057                    |
| Accuracy    | 0.8939±0.0037                    |
| BA          | 0.8904±0.0089                    |
| Sensitivity | 0.8862±0.0190                    |
| Specificity | 0.8945±0.0044                    |
| MCC         | 0.5483±0.0114                    |
| CM          | [[11007, 1413], [16876, 143064]] |
+-------------+----------------------------------+

5. Run the E vs sNE experiments

This cell’s code reproduces the results for Kidney reported in Table 4 (A) of the HELP paper, removing the aE flags (excludeflags = "-X aE").

datapath = "."
tissue = "Kidney"                               # or 'Lung'
labelfile = f"{tissue}_HELP.csv"                # label filename
aliases = ""                                    # dictionary for renaming labels before prediction: es. {'oldlabel': 'newlabel'}
excludeflags = "-X aE"                          # label to remove: es. -X aE (for E vs sNE problem)
njobs = "-1"                                    # parallelism level: -1 = all cpus, 1 = sequential
nchunks = "-c 1 5 1"                            # no. of chunks for each input attribute file: es. 1 5 (Bio is one chunk, CCcfs split in 5 chunks)
voters = "-v 8"                                 # no. of voters on classifier ensemble
repeats = "-r 10"                               # no. of iterations for experiments
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_CCcfs.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            {nchunks} \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags}  \
                            -n std -ba \
                            {voters} {repeats} \
                            -j {njobs} -B
METHOD: LGBM        VOTERS: 8       BALANCE: yes
PROBL: E vs sNE
INPUT: Kidney_BIO.csv Kidney_CCcfs.csv Kidney_EmbN2V_128.csv
LABEL: Kidney_HELP.csv DISTRIB: E : 1242, sNE: 12886
+-------------+---------------------------------+
|             | measure                         |
|-------------+---------------------------------|
| ROC-AUC     | 0.9724±0.0039                   |
| Accuracy    | 0.9221±0.0044                   |
| BA          | 0.9129±0.0085                   |
| Sensitivity | 0.9018±0.0186                   |
| Specificity | 0.9241±0.0053                   |
| MCC         | 0.6579±0.0135                   |
| CM          | [[11200, 1220], [9779, 119081]] |
+-------------+---------------------------------+

Please be aware that this will take a while in sequential execution.