1. Install HELP from GitHub

Skip this cell if you already have installed HELP.

!pip install git+https://github.com/giordamaug/HELP.git

2. Download the input files

For a chosen tissue (here Kidney), download from GitHub the label file (here Kidney_HELP.csv, computed as in Example 1) and the attribute files (here BIO Kidney_BIO.csv, CCcfs Kidney_CCcfs_1.csv, …, Kidney_CCcfs_5.csv, and N2V Kidney_EmbN2V_128.csv). Skip this step if you already have these input files locally.

tissue='Kidney'
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_HELP.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_BIO.csv
for i in range(5):
  !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCcfs_{i}.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_EmbN2V_128.csv
#!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCBeder.csv

Other attribute files (CCBeder) are shown but commented to help the user experiment with different data.

3. Download the script for the experiments and show the man page

Download the batch script for EG prediction used for the experiments and show its manual page:

!wget https://raw.githubusercontent.com/giordamaug/HELP/main/HELPpy/notebooks/EG_prediction.py
!python EG_prediction.py -h

usage: EG_prediction.py [-h] -i <inputfile> [<inputfile> ...]
                        [-c <chunks> [<chunks> ...]]
                        [-X <excludelabels> [<excludelabels> ...]]
                        [-L <labelname>] -l <labelfile> [-A <aliases>]
                        [-b <seed>] [-r <repeat>] [-f <folds>] [-j <jobs>]
                        [-B] [-v <voters>] [-ba] [-fx] [-n <normalize>]
                        [-o <outfile>] [-s <scorefile>] [-p <predfile>]

PLOS COMPBIO

options:
  -h, --help            show this help message and exit
  -i <inputfile> [<inputfile> ...], --inputfile <inputfile> [<inputfile> ...]
                        input attribute filename list
  -c <chunks> [<chunks> ...], --chunks <chunks> [<chunks> ...]
                        no of chunks for attribute filename list
  -X <excludelabels> [<excludelabels> ...], --excludelabels <excludelabels> [<excludelabels> ...]
                        labels to exclude (default NaN, values any list)
  -L <labelname>, --labelname <labelname>
                        label name (default label)
  -l <labelfile>, --labelfile <labelfile>
                        label filename
  -A <aliases>, --aliases <aliases>
                        the dictionary for label renaming (es: {"oldlabel1":
                        "newlabel1", ..., "oldlabelN": "newlabelN"})
  -b <seed>, --seed <seed>
                        random seed (default: 1)
  -r <repeat>, --repeat <repeat>
                        n. of iteration (default: 10)
  -f <folds>, --folds <folds>
                        n. of cv folds (default: 5)
  -j <jobs>, --jobs <jobs>
                        n. of parallel jobs (default: -1)
  -B, --batch           enable batch mode (no output)
  -v <voters>, --voters <voters>
                        n. of voter predictors (default: 1 - one classifier)
  -ba, --balanced       enable balancing in classifier (default disabled)
  -fx, --fixna          enable fixing NaN (default disabled)
  -n <normalize>, --normalize <normalize>
                        normalization mode (default None)
  -o <outfile>, --outfile <outfile>
                        output file for performance measures sumup
  -s <scorefile>, --scorefile <scorefile>
                        output file reporting all measurements
  -p <predfile>, --predfile <predfile>
                        output file reporting predictions

4. Run the E vs NE experiments

This cell’s code reproduces the results for Kidney reported in Table 3 (A) of the HELP paper.

datapath = "."
tissue = "Kidney"                               # or 'Lung'
labelfile = f"{tissue}_HELP.csv"                # label filename
aliases = "-A \"{'aE': 'NE', 'sNE':'NE'}\""     # dictionary for renaming labels before prediction: es. {'oldlabel': 'newlabel'}
excludeflags = ""                               # label to remove (none for E vs NE problem)
njobs = "-1"                                    # parallelism level: -1 = all cpus, 1 = sequential
nchunks = "-c 1 5 1"                            # no. of chunks for each input attribute file: es. 1 5 (Bio is one chunk, CCcfs split in 5 chunks)
voters = "-v 10"                                # no. of voters on classifier ensemble
repeats = "-r 10"                               # no. of iterations for experiments
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_CCcfs.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            {nchunks} \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags}  \
                            {voters} {repeats} \
                            -n std -ba \
                            -j -1 -B

METHOD: LGBM        VOTERS: 10      BALANCE: yes
PROBL: E vs NE
INPUT: Kidney_BIO.csv Kidney_CCcfs.csv Kidney_EmbN2V_128.csv
LABEL: Kidney_HELP.csv DISTRIB: E : 1242, NE: 15994
+-------------+----------------------------------+
|             | measure                          |
|-------------+----------------------------------|
| ROC-AUC     | 0.9572±0.0057                    |
| Accuracy    | 0.8939±0.0037                    |
| BA          | 0.8904±0.0089                    |
| Sensitivity | 0.8862±0.0190                    |
| Specificity | 0.8945±0.0044                    |
| MCC         | 0.5483±0.0114                    |
| CM          | [[11007, 1413], [16876, 143064]] |
+-------------+----------------------------------+

5. Run the E vs sNE experiments

This cell’s code reproduces the results for Kidney reported in Table 4 (A) of the HELP paper, removing the aE flags (excludeflags = "-X aE").

datapath = "."
tissue = "Kidney"                               # or 'Lung'
labelfile = f"{tissue}_HELP.csv"                # label filename
aliases = ""                                    # dictionary for renaming labels before prediction: es. {'oldlabel': 'newlabel'}
excludeflags = "-X aE"                          # label to remove: es. -X aE (for E vs sNE problem)
njobs = "-1"                                    # parallelism level: -1 = all cpus, 1 = sequential
nchunks = "-c 1 5 1"                            # no. of chunks for each input attribute file: es. 1 5 (Bio is one chunk, CCcfs split in 5 chunks)
voters = "-v 8"                                 # no. of voters on classifier ensemble
repeats = "-r 10"                               # no. of iterations for experiments
!python EG_prediction.py -i {datapath}/{tissue}_BIO.csv \
                            {datapath}/{tissue}_CCcfs.csv \
                            {datapath}/{tissue}_EmbN2V_128.csv \
                            {nchunks} \
                            -l {datapath}/{labelfile} \
                            {aliases} {excludeflags}  \
                            -n std -ba \
                            {voters} {repeats} \
                            -j {njobs} -B

METHOD: LGBM        VOTERS: 8       BALANCE: yes
PROBL: E vs sNE
INPUT: Kidney_BIO.csv Kidney_CCcfs.csv Kidney_EmbN2V_128.csv
LABEL: Kidney_HELP.csv DISTRIB: E : 1242, sNE: 12886
+-------------+---------------------------------+
|             | measure                         |
|-------------+---------------------------------|
| ROC-AUC     | 0.9724±0.0039                   |
| Accuracy    | 0.9221±0.0044                   |
| BA          | 0.9129±0.0085                   |
| Sensitivity | 0.9018±0.0186                   |
| Specificity | 0.9241±0.0053                   |
| MCC         | 0.6579±0.0135                   |
| CM          | [[11200, 1220], [9779, 119081]] |
+-------------+---------------------------------+

Please be aware that this will take a while in sequential execution.