1. Install HELP from GitHub

Skip this cell if you alread have installed HELP.

!pip install git+https://github.com/giordamaug/HELP.git

2. Download the input files

Download from the DepMap portal the gene deletion expression scores (CRISPRGeneEffect.csv) and the map between cell-lines and tissues (Model.csv). Skip this step if you already have these input files locally.

!wget -c https://figshare.com/ndownloader/files/43346616 -O CRISPRGeneEffect.csv
!wget -c https://figshare.com/ndownloader/files/43746708 -O Model.csv

3. Load the input files

Load the CRISPR data and show the content.

import pandas as pd
import os
df = pd.read_csv("CRISPRGeneEffect.csv").rename(columns={'Unnamed: 0': 'gene'}).rename(columns=lambda x: x.split(' ')[0]).set_index('gene').T
print(f'{df.isna().sum().sum()} NaN over {len(df)*len(df.columns)} values')
df

739493 NaN over 20287300 values

gene	ACH-000001	ACH-000004	ACH-000005	ACH-000007	ACH-000009	ACH-000011	ACH-000012	ACH-000013	ACH-000015	ACH-000017	...	ACH-002693	ACH-002710	ACH-002785	ACH-002799	ACH-002800	ACH-002834	ACH-002847	ACH-002922	ACH-002925	ACH-002926
A1BG	-0.122637	0.019756	-0.107208	-0.031027	0.008888	0.022670	-0.096631	0.049811	-0.099040	-0.044896	...	-0.072582	-0.033722	-0.053881	-0.060617	0.025795	-0.055721	-0.009973	-0.025991	-0.127639	-0.068666
A1CF	0.025881	-0.083640	-0.023211	-0.137850	-0.146566	-0.057743	-0.024440	-0.158811	-0.070409	-0.115830	...	-0.237311	-0.108704	-0.114864	-0.042591	-0.132627	-0.121228	-0.119813	-0.007706	-0.040705	-0.107530
A2M	0.034217	-0.060118	0.200204	0.067704	0.084471	0.079679	0.041922	-0.003968	-0.029389	0.024537	...	-0.065940	0.079277	0.069333	0.030989	0.249826	0.072790	0.044097	-0.038468	0.134556	0.067806
A2ML1	-0.128082	-0.027417	0.116039	0.107988	0.089419	0.227512	0.039121	0.034778	0.084594	-0.003710	...	0.101541	0.038977	0.066599	0.043809	0.064657	0.021916	0.041358	0.236576	-0.047984	0.112071
A3GALT2	-0.031285	-0.036116	-0.172227	0.007992	0.065109	-0.130448	0.028947	-0.120875	-0.052288	-0.336776	...	0.005374	-0.144070	-0.256227	-0.116473	-0.294305	-0.221940	-0.146565	-0.239690	-0.116114	-0.149897
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZYG11A	-0.289724	0.032983	-0.201273	-0.100344	-0.112703	0.013401	0.005124	-0.089180	-0.005409	-0.070396	...	-0.296880	-0.084936	-0.128569	-0.110504	-0.087171	0.024959	-0.119911	-0.079342	-0.043555	-0.045115
ZYG11B	-0.062972	-0.410392	-0.178877	-0.462160	-0.598698	-0.296421	-0.131949	-0.145737	-0.216393	-0.257916	...	-0.332415	-0.193408	-0.327408	-0.257879	-0.349111	0.015259	-0.289412	-0.347484	-0.335270	-0.307900
ZYX	0.074180	0.113156	-0.055349	-0.001555	0.095877	0.067705	-0.109147	-0.034886	-0.137350	0.029457	...	-0.005090	-0.218960	-0.053033	-0.041612	-0.057478	-0.306562	-0.195097	-0.085302	-0.208063	0.070671
ZZEF1	0.111244	0.234388	-0.002161	-0.325964	-0.026742	-0.232453	-0.164482	-0.175850	-0.168087	-0.284838	...	-0.188751	-0.120449	-0.267081	0.006148	-0.189602	-0.148368	-0.206400	-0.095965	-0.094741	-0.187813
ZZZ3	-0.467908	-0.088306	-0.186842	-0.486660	-0.320759	-0.347234	-0.277397	-0.519586	-0.282338	-0.247634	...	-0.239991	-0.311396	-0.202158	-0.195154	-0.107107	-0.579576	-0.486525	-0.346272	-0.222404	-0.452143

18443 rows × 1100 columns

Then load the mapping information and show the content.

df_map = pd.read_csv("Model.csv")
df_map

	ModelID	PatientID	CellLineName	StrippedCellLineName	DepmapModelType	OncotreeLineage	OncotreePrimaryDisease	OncotreeSubtype	OncotreeCode	LegacyMolecularSubtype	...	TissueOrigin	CCLEName	CatalogNumber	PlateCoating	ModelDerivationMaterial	PublicComments	WTSIMasterCellID	SangerModelID	COSMICID	LegacySubSubtype
0	ACH-000001	PT-gj46wT	NIH:OVCAR-3	NIHOVCAR3	HGSOC	Ovary/Fallopian Tube	Ovarian Epithelial Tumor	High-Grade Serous Ovarian Cancer	HGSOC	NaN	...	NaN	NIHOVCAR3_OVARY	HTB-71	NaN	NaN	NaN	2201.0	SIDM00105	905933.0	high_grade_serous
1	ACH-000002	PT-5qa3uk	HL-60	HL60	AML	Myeloid	Acute Myeloid Leukemia	Acute Myeloid Leukemia	AML	NaN	...	NaN	HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE	CCL-240	NaN	NaN	NaN	55.0	SIDM00829	905938.0	M3
2	ACH-000003	PT-puKIyc	CACO2	CACO2	COAD	Bowel	Colorectal Adenocarcinoma	Colon Adenocarcinoma	COAD	NaN	...	NaN	CACO2_LARGE_INTESTINE	HTB-37	NaN	NaN	NaN	NaN	SIDM00891	NaN	NaN
3	ACH-000004	PT-q4K2cp	HEL	HEL	AML	Myeloid	Acute Myeloid Leukemia	Acute Myeloid Leukemia	AML	NaN	...	NaN	HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE	ACC 11	NaN	NaN	NaN	783.0	SIDM00594	907053.0	M6
4	ACH-000005	PT-q4K2cp	HEL 92.1.7	HEL9217	AML	Myeloid	Acute Myeloid Leukemia	Acute Myeloid Leukemia	AML	NaN	...	NaN	HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE	HEL9217	NaN	NaN	NaN	NaN	SIDM00593	NaN	M6
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1916	ACH-003157	PT-QDEP9D	ABM-T0822	ABMT0822	ZIMMMPLC	Lung	Non-Cancerous	Immortalized MPLC Cells	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1917	ACH-003158	PT-nszsxG	ABM-T9220	ABMT9220	ZIMMSMCI	Muscle	Non-Cancerous	Immortalized Smooth Muscle Cells, Intestinal	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1918	ACH-003159	PT-AUxVvV	ABM-T9233	ABMT9233	ZIMMRSCH	Hair	Non-Cancerous	Immortalized Hair Follicle Inner Root Sheath C...	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1919	ACH-003160	PT-AUxVvV	ABM-T9249	ABMT9249	ZIMMGMCH	Hair	Non-Cancerous	Immortalized Hair Germinal Matrix Cells	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1920	ACH-003161	PT-or1hkT	ABM-T9430	ABMT9430	ZIMMPSC	Pancreas	Non-Cancerous	Immortalized Pancreatic Stromal Cells	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1921 rows × 36 columns

Example 1.1 two-class labeling of EGs based on tissue information

Filter the information to be exploited

Filter the genes mapped to tissues (OncotreeLineage column in the mapping file) having less than minlines cell-lines.

from HELPpy.utility.selection import filter_crispr_by_model
df = filter_crispr_by_model(df, df_map, minlines=10, line_group='OncotreeLineage')
df

gene	ACH-000001	ACH-000004	ACH-000005	ACH-000007	ACH-000009	ACH-000011	ACH-000012	ACH-000013	ACH-000015	ACH-000017	...	ACH-002693	ACH-002710	ACH-002785	ACH-002799	ACH-002800	ACH-002834	ACH-002847	ACH-002922	ACH-002925	ACH-002926
A1BG	-0.122637	0.019756	-0.107208	-0.031027	0.008888	0.022670	-0.096631	0.049811	-0.099040	-0.044896	...	-0.072582	-0.033722	-0.053881	-0.060617	0.025795	-0.055721	-0.009973	-0.025991	-0.127639	-0.068666
A1CF	0.025881	-0.083640	-0.023211	-0.137850	-0.146566	-0.057743	-0.024440	-0.158811	-0.070409	-0.115830	...	-0.237311	-0.108704	-0.114864	-0.042591	-0.132627	-0.121228	-0.119813	-0.007706	-0.040705	-0.107530
A2M	0.034217	-0.060118	0.200204	0.067704	0.084471	0.079679	0.041922	-0.003968	-0.029389	0.024537	...	-0.065940	0.079277	0.069333	0.030989	0.249826	0.072790	0.044097	-0.038468	0.134556	0.067806
A2ML1	-0.128082	-0.027417	0.116039	0.107988	0.089419	0.227512	0.039121	0.034778	0.084594	-0.003710	...	0.101541	0.038977	0.066599	0.043809	0.064657	0.021916	0.041358	0.236576	-0.047984	0.112071
A3GALT2	-0.031285	-0.036116	-0.172227	0.007992	0.065109	-0.130448	0.028947	-0.120875	-0.052288	-0.336776	...	0.005374	-0.144070	-0.256227	-0.116473	-0.294305	-0.221940	-0.146565	-0.239690	-0.116114	-0.149897
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZYG11A	-0.289724	0.032983	-0.201273	-0.100344	-0.112703	0.013401	0.005124	-0.089180	-0.005409	-0.070396	...	-0.296880	-0.084936	-0.128569	-0.110504	-0.087171	0.024959	-0.119911	-0.079342	-0.043555	-0.045115
ZYG11B	-0.062972	-0.410392	-0.178877	-0.462160	-0.598698	-0.296421	-0.131949	-0.145737	-0.216393	-0.257916	...	-0.332415	-0.193408	-0.327408	-0.257879	-0.349111	0.015259	-0.289412	-0.347484	-0.335270	-0.307900
ZYX	0.074180	0.113156	-0.055349	-0.001555	0.095877	0.067705	-0.109147	-0.034886	-0.137350	0.029457	...	-0.005090	-0.218960	-0.053033	-0.041612	-0.057478	-0.306562	-0.195097	-0.085302	-0.208063	0.070671
ZZEF1	0.111244	0.234388	-0.002161	-0.325964	-0.026742	-0.232453	-0.164482	-0.175850	-0.168087	-0.284838	...	-0.188751	-0.120449	-0.267081	0.006148	-0.189602	-0.148368	-0.206400	-0.095965	-0.094741	-0.187813
ZZZ3	-0.467908	-0.088306	-0.186842	-0.486660	-0.320759	-0.347234	-0.277397	-0.519586	-0.282338	-0.247634	...	-0.239991	-0.311396	-0.202158	-0.195154	-0.107107	-0.579576	-0.486525	-0.346272	-0.222404	-0.452143

18443 rows × 1091 columns

Show which are the tissues available from the mapping file:

print(df_map[['OncotreeLineage']].value_counts())

OncotreeLineage
Lung                         249
Lymphoid                     211
CNS/Brain                    122
Skin                         120
Esophagus/Stomach             95
Breast                        94
Bowel                         89
Head and Neck                 84
Bone                          77
Myeloid                       77
Ovary/Fallopian Tube          75
Kidney                        73
Pancreas                      66
Peripheral Nervous System     56
Soft Tissue                   55
Biliary Tract                 44
Uterus                        41
Fibroblast                    41
Bladder/Urinary Tract         39
Normal                        39
Pleura                        35
Liver                         29
Cervix                        25
Eye                           21
Thyroid                       18
Prostate                      15
Testis                         7
Vulva/Vagina                   5
Muscle                         5
Ampulla of Vater               4
Hair                           2
Other                          1
Embryonal                      1
Adrenal Gland                  1
Name: count, dtype: int64

Select only cell-lines of a chosen tissue (here Kidney) and remove cell-lines having more than a certain percentage of NaN values (here 95%):

tissue = 'Kidney'
from HELPpy.utility.selection import select_cell_lines, delrows_with_nan_percentage
from HELPpy.models.labelling import labelling
cell_lines = select_cell_lines(df, df_map, [tissue])
print(f"Selecting {len(cell_lines)} cell-lines")
# remove rows with more than perc NaNs
df_nonan = delrows_with_nan_percentage(df, perc=95)

Selecting 37 cell-lines
Removed 512 rows from 18443 with at least 95% NaN

Apply two-class HELP labelling

Compute the two-class labeling (mode='flat-multi') using the Otsu algorithm (algorithm='otsu') and save the results in a csv file (Kidney_HELP_twoClasses.csv):

df_label2 = labelling(df_nonan, columns = cell_lines, n_classes=2,
                      labelnames={0: 'E', 1: 'NE'},
                      mode='flat-multi', algorithm='otsu', verbose=True)
# save the result
df_label2.to_csv(f"{tissue}_HELP_twoClasses.csv")

performing flat mode on 2-class labelling (flat-multi).
[flat-multi]: 1. multi-class labelling:

100%|██████████| 37/37 [00:00<00:00, 598.29it/s]

label
NE       16678
E         1253
Name: count, dtype: int64

Example 1.2 three-class labeling of EGs based on tissue information

Genes have already been filtered according to tissue information for Example 1.1, so we only need to:

Apply three-class HELP labelling

Compute the three-class labeling (mode='two-by-two') using the Otsu algorithm (algorithm='otsu') and save the results in a csv file ('Kidney_HELP_threeClasses.csv'):

df_label3 = labelling(df_nonan, columns = cell_lines, n_classes=2,
                      labelnames={0: 'E', 1: 'aE', 2: 'sNE'},
                      mode='two-by-two', algorithm='otsu', verbose=True)
# save the result
df_label3.to_csv(f"{tissue}_HELP_threeClasses.csv")

performing flat mode on 3-class labelling (two-by-two).
[two-by-two]: 1. Two-class labelling:

100%|██████████| 37/37 [00:00<00:00, 572.91it/s]

(17931,)
[two-by-two]: 2. Two-class labelling on 1-label rows:

100%|██████████| 37/37 [00:00<00:00, 737.77it/s]

(16678,)
label
sNE      13457
aE        3221
E         1253
Name: count, dtype: int64

1. Install HELP from GitHub

2. Download the input files

3. Load the input files

Example 1.1 two-class labeling of EGs based on tissue information

Filter the information to be exploited

Apply two-class HELP labelling

Example 1.2 three-class labeling of EGs based on tissue information

Apply three-class HELP labelling

Example 1.3 two-class labeling of EGs based on disease-related information

Filter the information to be exploited

Apply two-class HELP labelling