1. Install HELP from GitHub

Skip this cell if you alread have installed HELP.

!pip install git+https://github.com/giordamaug/HELP.git

2. Download the input files

Download from the DepMap portal the gene deletion expression scores (CRISPRGeneEffect.csv) and the map between cell-lines and tissues (Model.csv). Skip this step if you already have these input files locally.

!wget -c https://figshare.com/ndownloader/files/43346616 -O CRISPRGeneEffect.csv
!wget -c https://figshare.com/ndownloader/files/43746708 -O Model.csv

3. Load the input files

Load the CRISPR data and show the content.

import pandas as pd
import os
df = pd.read_csv("CRISPRGeneEffect.csv").rename(columns={'Unnamed: 0': 'gene'}).rename(columns=lambda x: x.split(' ')[0]).set_index('gene').T
print(f'{df.isna().sum().sum()} NaN over {len(df)*len(df.columns)} values')
df
739493 NaN over 20287300 values
gene ACH-000001 ACH-000004 ACH-000005 ACH-000007 ACH-000009 ACH-000011 ACH-000012 ACH-000013 ACH-000015 ACH-000017 ... ACH-002693 ACH-002710 ACH-002785 ACH-002799 ACH-002800 ACH-002834 ACH-002847 ACH-002922 ACH-002925 ACH-002926
A1BG -0.122637 0.019756 -0.107208 -0.031027 0.008888 0.022670 -0.096631 0.049811 -0.099040 -0.044896 ... -0.072582 -0.033722 -0.053881 -0.060617 0.025795 -0.055721 -0.009973 -0.025991 -0.127639 -0.068666
A1CF 0.025881 -0.083640 -0.023211 -0.137850 -0.146566 -0.057743 -0.024440 -0.158811 -0.070409 -0.115830 ... -0.237311 -0.108704 -0.114864 -0.042591 -0.132627 -0.121228 -0.119813 -0.007706 -0.040705 -0.107530
A2M 0.034217 -0.060118 0.200204 0.067704 0.084471 0.079679 0.041922 -0.003968 -0.029389 0.024537 ... -0.065940 0.079277 0.069333 0.030989 0.249826 0.072790 0.044097 -0.038468 0.134556 0.067806
A2ML1 -0.128082 -0.027417 0.116039 0.107988 0.089419 0.227512 0.039121 0.034778 0.084594 -0.003710 ... 0.101541 0.038977 0.066599 0.043809 0.064657 0.021916 0.041358 0.236576 -0.047984 0.112071
A3GALT2 -0.031285 -0.036116 -0.172227 0.007992 0.065109 -0.130448 0.028947 -0.120875 -0.052288 -0.336776 ... 0.005374 -0.144070 -0.256227 -0.116473 -0.294305 -0.221940 -0.146565 -0.239690 -0.116114 -0.149897
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZYG11A -0.289724 0.032983 -0.201273 -0.100344 -0.112703 0.013401 0.005124 -0.089180 -0.005409 -0.070396 ... -0.296880 -0.084936 -0.128569 -0.110504 -0.087171 0.024959 -0.119911 -0.079342 -0.043555 -0.045115
ZYG11B -0.062972 -0.410392 -0.178877 -0.462160 -0.598698 -0.296421 -0.131949 -0.145737 -0.216393 -0.257916 ... -0.332415 -0.193408 -0.327408 -0.257879 -0.349111 0.015259 -0.289412 -0.347484 -0.335270 -0.307900
ZYX 0.074180 0.113156 -0.055349 -0.001555 0.095877 0.067705 -0.109147 -0.034886 -0.137350 0.029457 ... -0.005090 -0.218960 -0.053033 -0.041612 -0.057478 -0.306562 -0.195097 -0.085302 -0.208063 0.070671
ZZEF1 0.111244 0.234388 -0.002161 -0.325964 -0.026742 -0.232453 -0.164482 -0.175850 -0.168087 -0.284838 ... -0.188751 -0.120449 -0.267081 0.006148 -0.189602 -0.148368 -0.206400 -0.095965 -0.094741 -0.187813
ZZZ3 -0.467908 -0.088306 -0.186842 -0.486660 -0.320759 -0.347234 -0.277397 -0.519586 -0.282338 -0.247634 ... -0.239991 -0.311396 -0.202158 -0.195154 -0.107107 -0.579576 -0.486525 -0.346272 -0.222404 -0.452143

18443 rows × 1100 columns

Then load the mapping information and show the content.

df_map = pd.read_csv("Model.csv")
df_map
ModelID PatientID CellLineName StrippedCellLineName DepmapModelType OncotreeLineage OncotreePrimaryDisease OncotreeSubtype OncotreeCode LegacyMolecularSubtype ... TissueOrigin CCLEName CatalogNumber PlateCoating ModelDerivationMaterial PublicComments WTSIMasterCellID SangerModelID COSMICID LegacySubSubtype
0 ACH-000001 PT-gj46wT NIH:OVCAR-3 NIHOVCAR3 HGSOC Ovary/Fallopian Tube Ovarian Epithelial Tumor High-Grade Serous Ovarian Cancer HGSOC NaN ... NaN NIHOVCAR3_OVARY HTB-71 NaN NaN NaN 2201.0 SIDM00105 905933.0 high_grade_serous
1 ACH-000002 PT-5qa3uk HL-60 HL60 AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML NaN ... NaN HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE CCL-240 NaN NaN NaN 55.0 SIDM00829 905938.0 M3
2 ACH-000003 PT-puKIyc CACO2 CACO2 COAD Bowel Colorectal Adenocarcinoma Colon Adenocarcinoma COAD NaN ... NaN CACO2_LARGE_INTESTINE HTB-37 NaN NaN NaN NaN SIDM00891 NaN NaN
3 ACH-000004 PT-q4K2cp HEL HEL AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML NaN ... NaN HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE ACC 11 NaN NaN NaN 783.0 SIDM00594 907053.0 M6
4 ACH-000005 PT-q4K2cp HEL 92.1.7 HEL9217 AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML NaN ... NaN HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE HEL9217 NaN NaN NaN NaN SIDM00593 NaN M6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1916 ACH-003157 PT-QDEP9D ABM-T0822 ABMT0822 ZIMMMPLC Lung Non-Cancerous Immortalized MPLC Cells NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1917 ACH-003158 PT-nszsxG ABM-T9220 ABMT9220 ZIMMSMCI Muscle Non-Cancerous Immortalized Smooth Muscle Cells, Intestinal NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1918 ACH-003159 PT-AUxVvV ABM-T9233 ABMT9233 ZIMMRSCH Hair Non-Cancerous Immortalized Hair Follicle Inner Root Sheath C... NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1919 ACH-003160 PT-AUxVvV ABM-T9249 ABMT9249 ZIMMGMCH Hair Non-Cancerous Immortalized Hair Germinal Matrix Cells NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1920 ACH-003161 PT-or1hkT ABM-T9430 ABMT9430 ZIMMPSC Pancreas Non-Cancerous Immortalized Pancreatic Stromal Cells NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1921 rows × 36 columns

Example 1.1 two-class labeling of EGs based on tissue information

Filter the information to be exploited

Filter the genes mapped to tissues (OncotreeLineage column in the mapping file) having less than minlines cell-lines.

from HELPpy.utility.selection import filter_crispr_by_model
df = filter_crispr_by_model(df, df_map, minlines=10, line_group='OncotreeLineage')
df
gene ACH-000001 ACH-000004 ACH-000005 ACH-000007 ACH-000009 ACH-000011 ACH-000012 ACH-000013 ACH-000015 ACH-000017 ... ACH-002693 ACH-002710 ACH-002785 ACH-002799 ACH-002800 ACH-002834 ACH-002847 ACH-002922 ACH-002925 ACH-002926
A1BG -0.122637 0.019756 -0.107208 -0.031027 0.008888 0.022670 -0.096631 0.049811 -0.099040 -0.044896 ... -0.072582 -0.033722 -0.053881 -0.060617 0.025795 -0.055721 -0.009973 -0.025991 -0.127639 -0.068666
A1CF 0.025881 -0.083640 -0.023211 -0.137850 -0.146566 -0.057743 -0.024440 -0.158811 -0.070409 -0.115830 ... -0.237311 -0.108704 -0.114864 -0.042591 -0.132627 -0.121228 -0.119813 -0.007706 -0.040705 -0.107530
A2M 0.034217 -0.060118 0.200204 0.067704 0.084471 0.079679 0.041922 -0.003968 -0.029389 0.024537 ... -0.065940 0.079277 0.069333 0.030989 0.249826 0.072790 0.044097 -0.038468 0.134556 0.067806
A2ML1 -0.128082 -0.027417 0.116039 0.107988 0.089419 0.227512 0.039121 0.034778 0.084594 -0.003710 ... 0.101541 0.038977 0.066599 0.043809 0.064657 0.021916 0.041358 0.236576 -0.047984 0.112071
A3GALT2 -0.031285 -0.036116 -0.172227 0.007992 0.065109 -0.130448 0.028947 -0.120875 -0.052288 -0.336776 ... 0.005374 -0.144070 -0.256227 -0.116473 -0.294305 -0.221940 -0.146565 -0.239690 -0.116114 -0.149897
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZYG11A -0.289724 0.032983 -0.201273 -0.100344 -0.112703 0.013401 0.005124 -0.089180 -0.005409 -0.070396 ... -0.296880 -0.084936 -0.128569 -0.110504 -0.087171 0.024959 -0.119911 -0.079342 -0.043555 -0.045115
ZYG11B -0.062972 -0.410392 -0.178877 -0.462160 -0.598698 -0.296421 -0.131949 -0.145737 -0.216393 -0.257916 ... -0.332415 -0.193408 -0.327408 -0.257879 -0.349111 0.015259 -0.289412 -0.347484 -0.335270 -0.307900
ZYX 0.074180 0.113156 -0.055349 -0.001555 0.095877 0.067705 -0.109147 -0.034886 -0.137350 0.029457 ... -0.005090 -0.218960 -0.053033 -0.041612 -0.057478 -0.306562 -0.195097 -0.085302 -0.208063 0.070671
ZZEF1 0.111244 0.234388 -0.002161 -0.325964 -0.026742 -0.232453 -0.164482 -0.175850 -0.168087 -0.284838 ... -0.188751 -0.120449 -0.267081 0.006148 -0.189602 -0.148368 -0.206400 -0.095965 -0.094741 -0.187813
ZZZ3 -0.467908 -0.088306 -0.186842 -0.486660 -0.320759 -0.347234 -0.277397 -0.519586 -0.282338 -0.247634 ... -0.239991 -0.311396 -0.202158 -0.195154 -0.107107 -0.579576 -0.486525 -0.346272 -0.222404 -0.452143

18443 rows × 1091 columns

Show which are the tissues available from the mapping file:

print(df_map[['OncotreeLineage']].value_counts())
OncotreeLineage
Lung                         249
Lymphoid                     211
CNS/Brain                    122
Skin                         120
Esophagus/Stomach             95
Breast                        94
Bowel                         89
Head and Neck                 84
Bone                          77
Myeloid                       77
Ovary/Fallopian Tube          75
Kidney                        73
Pancreas                      66
Peripheral Nervous System     56
Soft Tissue                   55
Biliary Tract                 44
Uterus                        41
Fibroblast                    41
Bladder/Urinary Tract         39
Normal                        39
Pleura                        35
Liver                         29
Cervix                        25
Eye                           21
Thyroid                       18
Prostate                      15
Testis                         7
Vulva/Vagina                   5
Muscle                         5
Ampulla of Vater               4
Hair                           2
Other                          1
Embryonal                      1
Adrenal Gland                  1
Name: count, dtype: int64

Select only cell-lines of a chosen tissue (here Kidney) and remove cell-lines having more than a certain percentage of NaN values (here 95%):

tissue = 'Kidney'
from HELPpy.utility.selection import select_cell_lines, delrows_with_nan_percentage
from HELPpy.models.labelling import labelling
cell_lines = select_cell_lines(df, df_map, [tissue])
print(f"Selecting {len(cell_lines)} cell-lines")
# remove rows with more than perc NaNs
df_nonan = delrows_with_nan_percentage(df, perc=95)
Selecting 37 cell-lines
Removed 512 rows from 18443 with at least 95% NaN

Apply two-class HELP labelling

Compute the two-class labeling (mode='flat-multi') using the Otsu algorithm (algorithm='otsu') and save the results in a csv file (Kidney_HELP_twoClasses.csv):

df_label2 = labelling(df_nonan, columns = cell_lines, n_classes=2,
                      labelnames={0: 'E', 1: 'NE'},
                      mode='flat-multi', algorithm='otsu', verbose=True)
# save the result
df_label2.to_csv(f"{tissue}_HELP_twoClasses.csv")
performing flat mode on 2-class labelling (flat-multi).
[flat-multi]: 1. multi-class labelling:
100%|██████████| 37/37 [00:00<00:00, 598.29it/s]
label
NE       16678
E         1253
Name: count, dtype: int64

Example 1.2 three-class labeling of EGs based on tissue information

Genes have already been filtered according to tissue information for Example 1.1, so we only need to:

Apply three-class HELP labelling

Compute the three-class labeling (mode='two-by-two') using the Otsu algorithm (algorithm='otsu') and save the results in a csv file ('Kidney_HELP_threeClasses.csv'):

df_label3 = labelling(df_nonan, columns = cell_lines, n_classes=2,
                      labelnames={0: 'E', 1: 'aE', 2: 'sNE'},
                      mode='two-by-two', algorithm='otsu', verbose=True)
# save the result
df_label3.to_csv(f"{tissue}_HELP_threeClasses.csv")
performing flat mode on 3-class labelling (two-by-two).
[two-by-two]: 1. Two-class labelling:
100%|██████████| 37/37 [00:00<00:00, 572.91it/s]
(17931,)
[two-by-two]: 2. Two-class labelling on 1-label rows:
100%|██████████| 37/37 [00:00<00:00, 737.77it/s]
(16678,)
label
sNE      13457
aE        3221
E         1253
Name: count, dtype: int64