1. Install HELP from GitHub
Skip this cell if you alread have installed HELP.
!pip install git+https://github.com/giordamaug/HELP.git
2. Download the input files
Download from the DepMap portal the gene deletion expression scores
(CRISPRGeneEffect.csv
) and the map between cell-lines and tissues
(Model.csv
). Skip this step if you already have these input files
locally.
!wget -c https://figshare.com/ndownloader/files/43346616 -O CRISPRGeneEffect.csv
!wget -c https://figshare.com/ndownloader/files/43746708 -O Model.csv
3. Load the input file
Load the CRISPR data and show the content.
import pandas as pd
import os
df_orig = pd.read_csv("CRISPRGeneEffect.csv").rename(columns={'Unnamed: 0': 'gene'}).rename(columns=lambda x: x.split(' ')[0]).set_index('gene').T
print(f'{df_orig.isna().sum().sum()} NaNs over {len(df_orig)*len(df_orig.columns)} values')
df_orig
/Users/maurizio/HELPold/help/datafinal
739493 NaNs over 20287300 values
gene | ACH-000001 | ACH-000004 | ACH-000005 | ACH-000007 | ACH-000009 | ACH-000011 | ACH-000012 | ACH-000013 | ACH-000015 | ACH-000017 | ... | ACH-002693 | ACH-002710 | ACH-002785 | ACH-002799 | ACH-002800 | ACH-002834 | ACH-002847 | ACH-002922 | ACH-002925 | ACH-002926 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A1BG | -0.122637 | 0.019756 | -0.107208 | -0.031027 | 0.008888 | 0.022670 | -0.096631 | 0.049811 | -0.099040 | -0.044896 | ... | -0.072582 | -0.033722 | -0.053881 | -0.060617 | 0.025795 | -0.055721 | -0.009973 | -0.025991 | -0.127639 | -0.068666 |
A1CF | 0.025881 | -0.083640 | -0.023211 | -0.137850 | -0.146566 | -0.057743 | -0.024440 | -0.158811 | -0.070409 | -0.115830 | ... | -0.237311 | -0.108704 | -0.114864 | -0.042591 | -0.132627 | -0.121228 | -0.119813 | -0.007706 | -0.040705 | -0.107530 |
A2M | 0.034217 | -0.060118 | 0.200204 | 0.067704 | 0.084471 | 0.079679 | 0.041922 | -0.003968 | -0.029389 | 0.024537 | ... | -0.065940 | 0.079277 | 0.069333 | 0.030989 | 0.249826 | 0.072790 | 0.044097 | -0.038468 | 0.134556 | 0.067806 |
A2ML1 | -0.128082 | -0.027417 | 0.116039 | 0.107988 | 0.089419 | 0.227512 | 0.039121 | 0.034778 | 0.084594 | -0.003710 | ... | 0.101541 | 0.038977 | 0.066599 | 0.043809 | 0.064657 | 0.021916 | 0.041358 | 0.236576 | -0.047984 | 0.112071 |
A3GALT2 | -0.031285 | -0.036116 | -0.172227 | 0.007992 | 0.065109 | -0.130448 | 0.028947 | -0.120875 | -0.052288 | -0.336776 | ... | 0.005374 | -0.144070 | -0.256227 | -0.116473 | -0.294305 | -0.221940 | -0.146565 | -0.239690 | -0.116114 | -0.149897 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ZYG11A | -0.289724 | 0.032983 | -0.201273 | -0.100344 | -0.112703 | 0.013401 | 0.005124 | -0.089180 | -0.005409 | -0.070396 | ... | -0.296880 | -0.084936 | -0.128569 | -0.110504 | -0.087171 | 0.024959 | -0.119911 | -0.079342 | -0.043555 | -0.045115 |
ZYG11B | -0.062972 | -0.410392 | -0.178877 | -0.462160 | -0.598698 | -0.296421 | -0.131949 | -0.145737 | -0.216393 | -0.257916 | ... | -0.332415 | -0.193408 | -0.327408 | -0.257879 | -0.349111 | 0.015259 | -0.289412 | -0.347484 | -0.335270 | -0.307900 |
ZYX | 0.074180 | 0.113156 | -0.055349 | -0.001555 | 0.095877 | 0.067705 | -0.109147 | -0.034886 | -0.137350 | 0.029457 | ... | -0.005090 | -0.218960 | -0.053033 | -0.041612 | -0.057478 | -0.306562 | -0.195097 | -0.085302 | -0.208063 | 0.070671 |
ZZEF1 | 0.111244 | 0.234388 | -0.002161 | -0.325964 | -0.026742 | -0.232453 | -0.164482 | -0.175850 | -0.168087 | -0.284838 | ... | -0.188751 | -0.120449 | -0.267081 | 0.006148 | -0.189602 | -0.148368 | -0.206400 | -0.095965 | -0.094741 | -0.187813 |
ZZZ3 | -0.467908 | -0.088306 | -0.186842 | -0.486660 | -0.320759 | -0.347234 | -0.277397 | -0.519586 | -0.282338 | -0.247634 | ... | -0.239991 | -0.311396 | -0.202158 | -0.195154 | -0.107107 | -0.579576 | -0.486525 | -0.346272 | -0.222404 | -0.452143 |
18443 rows × 1100 columns
Then load the mapping information and show the content.
df_map = pd.read_csv("Model.csv")
df_map
ModelID | PatientID | CellLineName | StrippedCellLineName | DepmapModelType | OncotreeLineage | OncotreePrimaryDisease | OncotreeSubtype | OncotreeCode | LegacyMolecularSubtype | ... | TissueOrigin | CCLEName | CatalogNumber | PlateCoating | ModelDerivationMaterial | PublicComments | WTSIMasterCellID | SangerModelID | COSMICID | LegacySubSubtype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACH-000001 | PT-gj46wT | NIH:OVCAR-3 | NIHOVCAR3 | HGSOC | Ovary/Fallopian Tube | Ovarian Epithelial Tumor | High-Grade Serous Ovarian Cancer | HGSOC | NaN | ... | NaN | NIHOVCAR3_OVARY | HTB-71 | NaN | NaN | NaN | 2201.0 | SIDM00105 | 905933.0 | high_grade_serous |
1 | ACH-000002 | PT-5qa3uk | HL-60 | HL60 | AML | Myeloid | Acute Myeloid Leukemia | Acute Myeloid Leukemia | AML | NaN | ... | NaN | HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE | CCL-240 | NaN | NaN | NaN | 55.0 | SIDM00829 | 905938.0 | M3 |
2 | ACH-000003 | PT-puKIyc | CACO2 | CACO2 | COAD | Bowel | Colorectal Adenocarcinoma | Colon Adenocarcinoma | COAD | NaN | ... | NaN | CACO2_LARGE_INTESTINE | HTB-37 | NaN | NaN | NaN | NaN | SIDM00891 | NaN | NaN |
3 | ACH-000004 | PT-q4K2cp | HEL | HEL | AML | Myeloid | Acute Myeloid Leukemia | Acute Myeloid Leukemia | AML | NaN | ... | NaN | HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE | ACC 11 | NaN | NaN | NaN | 783.0 | SIDM00594 | 907053.0 | M6 |
4 | ACH-000005 | PT-q4K2cp | HEL 92.1.7 | HEL9217 | AML | Myeloid | Acute Myeloid Leukemia | Acute Myeloid Leukemia | AML | NaN | ... | NaN | HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE | HEL9217 | NaN | NaN | NaN | NaN | SIDM00593 | NaN | M6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1916 | ACH-003157 | PT-QDEP9D | ABM-T0822 | ABMT0822 | ZIMMMPLC | Lung | Non-Cancerous | Immortalized MPLC Cells | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1917 | ACH-003158 | PT-nszsxG | ABM-T9220 | ABMT9220 | ZIMMSMCI | Muscle | Non-Cancerous | Immortalized Smooth Muscle Cells, Intestinal | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1918 | ACH-003159 | PT-AUxVvV | ABM-T9233 | ABMT9233 | ZIMMRSCH | Hair | Non-Cancerous | Immortalized Hair Follicle Inner Root Sheath C... | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1919 | ACH-003160 | PT-AUxVvV | ABM-T9249 | ABMT9249 | ZIMMGMCH | Hair | Non-Cancerous | Immortalized Hair Germinal Matrix Cells | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1920 | ACH-003161 | PT-or1hkT | ABM-T9430 | ABMT9430 | ZIMMPSC | Pancreas | Non-Cancerous | Immortalized Pancreatic Stromal Cells | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1921 rows × 36 columns
4. Filter the information to be exploited
Filter the genes mapped to tissues (OncotreeLineage
column in the
mapping file) having less than minlines
cell-lines:
from HELPpy.utility.selection import filter_crispr_by_model
df = filter_crispr_by_model(df_orig, df_map, minlines=10, line_group='OncotreeLineage')
df
gene | ACH-000001 | ACH-000004 | ACH-000005 | ACH-000007 | ACH-000009 | ACH-000011 | ACH-000012 | ACH-000013 | ACH-000015 | ACH-000017 | ... | ACH-002693 | ACH-002710 | ACH-002785 | ACH-002799 | ACH-002800 | ACH-002834 | ACH-002847 | ACH-002922 | ACH-002925 | ACH-002926 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A1BG | -0.122637 | 0.019756 | -0.107208 | -0.031027 | 0.008888 | 0.022670 | -0.096631 | 0.049811 | -0.099040 | -0.044896 | ... | -0.072582 | -0.033722 | -0.053881 | -0.060617 | 0.025795 | -0.055721 | -0.009973 | -0.025991 | -0.127639 | -0.068666 |
A1CF | 0.025881 | -0.083640 | -0.023211 | -0.137850 | -0.146566 | -0.057743 | -0.024440 | -0.158811 | -0.070409 | -0.115830 | ... | -0.237311 | -0.108704 | -0.114864 | -0.042591 | -0.132627 | -0.121228 | -0.119813 | -0.007706 | -0.040705 | -0.107530 |
A2M | 0.034217 | -0.060118 | 0.200204 | 0.067704 | 0.084471 | 0.079679 | 0.041922 | -0.003968 | -0.029389 | 0.024537 | ... | -0.065940 | 0.079277 | 0.069333 | 0.030989 | 0.249826 | 0.072790 | 0.044097 | -0.038468 | 0.134556 | 0.067806 |
A2ML1 | -0.128082 | -0.027417 | 0.116039 | 0.107988 | 0.089419 | 0.227512 | 0.039121 | 0.034778 | 0.084594 | -0.003710 | ... | 0.101541 | 0.038977 | 0.066599 | 0.043809 | 0.064657 | 0.021916 | 0.041358 | 0.236576 | -0.047984 | 0.112071 |
A3GALT2 | -0.031285 | -0.036116 | -0.172227 | 0.007992 | 0.065109 | -0.130448 | 0.028947 | -0.120875 | -0.052288 | -0.336776 | ... | 0.005374 | -0.144070 | -0.256227 | -0.116473 | -0.294305 | -0.221940 | -0.146565 | -0.239690 | -0.116114 | -0.149897 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ZYG11A | -0.289724 | 0.032983 | -0.201273 | -0.100344 | -0.112703 | 0.013401 | 0.005124 | -0.089180 | -0.005409 | -0.070396 | ... | -0.296880 | -0.084936 | -0.128569 | -0.110504 | -0.087171 | 0.024959 | -0.119911 | -0.079342 | -0.043555 | -0.045115 |
ZYG11B | -0.062972 | -0.410392 | -0.178877 | -0.462160 | -0.598698 | -0.296421 | -0.131949 | -0.145737 | -0.216393 | -0.257916 | ... | -0.332415 | -0.193408 | -0.327408 | -0.257879 | -0.349111 | 0.015259 | -0.289412 | -0.347484 | -0.335270 | -0.307900 |
ZYX | 0.074180 | 0.113156 | -0.055349 | -0.001555 | 0.095877 | 0.067705 | -0.109147 | -0.034886 | -0.137350 | 0.029457 | ... | -0.005090 | -0.218960 | -0.053033 | -0.041612 | -0.057478 | -0.306562 | -0.195097 | -0.085302 | -0.208063 | 0.070671 |
ZZEF1 | 0.111244 | 0.234388 | -0.002161 | -0.325964 | -0.026742 | -0.232453 | -0.164482 | -0.175850 | -0.168087 | -0.284838 | ... | -0.188751 | -0.120449 | -0.267081 | 0.006148 | -0.189602 | -0.148368 | -0.206400 | -0.095965 | -0.094741 | -0.187813 |
ZZZ3 | -0.467908 | -0.088306 | -0.186842 | -0.486660 | -0.320759 | -0.347234 | -0.277397 | -0.519586 | -0.282338 | -0.247634 | ... | -0.239991 | -0.311396 | -0.202158 | -0.195154 | -0.107107 | -0.579576 | -0.486525 | -0.346272 | -0.222404 | -0.452143 |
18443 rows × 1091 columns
and remove also those having more than a certain percentage of NaN values (here 80%):
from HELPpy.utility.selection import delrows_with_nan_percentage
# remove rows with more than perc NaNs
df_nonan = delrows_with_nan_percentage(df, perc=80)
Removed 512 rows from 18443 with at least 80% NaN
5. Compute EGs common to all tissues (pan-tissue labeling)
Here, pan-tissue EGs are obtained by 1. identifying EGs in all tissue-specific cell-lines and 2. computing the label of each gene as the mode of the obtained labels.
In order to do that, we need to select from the mapping file all
cell-lines (tissue_list='all'
) as a nested list of cell-lines (lists
of lists for each tissue, obtained with 'nested=True'
):
from HELPpy.utility.selection import select_cell_lines
cell_lines = select_cell_lines(df_nonan, df_map, tissue_list='all', nested=True)
print(f"Selecting {len(cell_lines)} tissues for a total of {sum([len(x) for x in cell_lines])} cell-lines")
Selecting 24 tissues for a total of 1091 cell-lines
Then, we compute the two-class labeling (mode='flat-multi'
) using
the Otsu algorithm (algorithm='otsu'
), returning the mode of the
labels (due to the input nested list of cell-lines), save the results in
a csv file ('PanTissue_group_HELP.csv'
) and print their summary:
from HELPpy.models.labelling import labelling
# remove rows with all nans
df_common = labelling(df_nonan, columns=cell_lines, n_classes=2, labelnames={0:'E', 1: 'NE'}, mode='flat-multi', algorithm='otsu')
df_common.to_csv("PanTissue_group_HELP.csv")
df_common.value_counts()
0%| | 0/34 [00:00<?, ?it/s]
100%|██████████| 34/34 [00:00<00:00, 154.54it/s]
100%|██████████| 32/32 [00:00<00:00, 159.87it/s]
100%|██████████| 37/37 [00:00<00:00, 173.37it/s]
100%|██████████| 59/59 [00:00<00:00, 141.95it/s]
100%|██████████| 48/48 [00:00<00:00, 134.49it/s]
100%|██████████| 86/86 [00:00<00:00, 131.64it/s]
100%|██████████| 18/18 [00:00<00:00, 204.46it/s]
100%|██████████| 65/65 [00:00<00:00, 150.54it/s]
100%|██████████| 15/15 [00:00<00:00, 153.70it/s]
100%|██████████| 72/72 [00:00<00:00, 135.68it/s]
100%|██████████| 37/37 [00:00<00:00, 131.99it/s]
100%|██████████| 24/24 [00:00<00:00, 139.12it/s]
100%|██████████| 119/119 [00:00<00:00, 126.31it/s]
100%|██████████| 81/81 [00:00<00:00, 127.79it/s]
100%|██████████| 37/37 [00:00<00:00, 119.07it/s]
100%|██████████| 59/59 [00:00<00:00, 121.51it/s]
100%|██████████| 47/47 [00:00<00:00, 149.28it/s]
100%|██████████| 41/41 [00:00<00:00, 164.70it/s]
100%|██████████| 19/19 [00:00<00:00, 132.71it/s]
100%|██████████| 10/10 [00:00<00:00, 252.74it/s]
100%|██████████| 71/71 [00:00<00:00, 127.23it/s]
100%|██████████| 36/36 [00:00<00:00, 185.67it/s]
100%|██████████| 11/11 [00:00<00:00, 106.69it/s]
100%|██████████| 33/33 [00:00<00:00, 137.99it/s]
label
NE 16681
E 1250
Name: count, dtype: int64
An alternative way for computing pan-tissue EGs could be to select all
cell-lines as a flat list of identifiers ('nested=False'
), so
disregarding their mapping to tissues, and compute the EG labeling:
from HELPpy.utility.selection import select_cell_lines
cell_lines_un = select_cell_lines(df_nonan, df_map, tissue_list='all', nested=False)
print(f"Selecting {len(cell_lines)} tissues for a total of {sum([len(x) for x in cell_lines_un])} cell-lines")
df_common_flat = labelling(df_nonan, columns=cell_lines_un, n_classes=2, labelnames={0:'E', 1: 'NE'}, mode='flat-multi', algorithm='otsu')
df_common_flat.to_csv("PanTissue.csv")
df_common_flat.value_counts()
Selecting 24 tissues for a total of 10910 cell-lines
100%|██████████| 1091/1091 [00:09<00:00, 118.25it/s]
label
NE 16668
E 1263
Name: count, dtype: int64
In this case, the cell-lines contribute in the same way to the labelling criterion regardless of the related tissue, thus providing a different, less stringent labelling.
6. Subtract pan-tissue EGs from those of the chosen tissue
Uncommon context-specific EGs (ucsEGs) for a chosen tissue (here
tissueK = 'Kidney'
) are obtained by subtracting the pan-tissue EGs
computed in the previous step (df_common
) by the EGs identified for
the chosen tissue.
import pandas as pd
#Identification of EGs in Kidney tissue (as in Example 1)
tissueK = 'Kidney'
from HELPpy.utility.selection import select_cell_lines
from HELPpy.models.labelling import labelling
cell_linesK = select_cell_lines(df_nonan, df_map, [tissueK])
print(f"Selecting {len(cell_linesK)} cell-lines")
df_labelK = labelling(df_nonan, columns = cell_linesK, n_classes=2,
labelnames={0: 'E', 1: 'NE'},
mode='flat-multi', algorithm='otsu')
df_labelK.to_csv(f"{tissueK}_HELP_twoClasses.csv")
#Alternatively, you can download the Kidney labels already computed:
#!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/Kidney_HELP.csv
#Identification of Kidney uncommon context-specific EGs
import numpy as np
import sys
np.set_printoptions(threshold=sys.maxsize)
EG_kidney = df_labelK[df_labelK['label'] == 'E'].index.values
cEG = df_common[df_common['label']=='E'].index.values
cs_EG_kidney = np.setdiff1d(EG_kidney, cEG)
print(cs_EG_kidney)
with open("csEG_Kidney.txt", 'w', encoding='utf-8') as f:
f.write('\n'.join(list(cs_EG_kidney)))
Selecting 37 cell-lines
100%|██████████| 37/37 [00:00<00:00, 202.51it/s]
['ACTG1' 'ACTR6' 'ARF4' 'ARFRP1' 'ARPC4' 'CDK6' 'CFLAR' 'CHMP7' 'COPS3'
'DCTN3' 'DDX11' 'DDX52' 'EMC3' 'EXOSC1' 'FERMT2' 'GEMIN7' 'GET3' 'HGS'
'HNF1B' 'HTATSF1' 'ITGAV' 'KIF4A' 'MCM10' 'MDM2' 'METAP2' 'MLST8'
'NCAPH2' 'NDOR1' 'NHLRC2' 'OXA1L' 'PAX8' 'PFN1' 'PIK3C3' 'PPIE' 'PPP1CA'
'PPP4R2' 'PTK2' 'RAB7A' 'RAD1' 'RBM42' 'RBMX2' 'RTEL1' 'SEPHS2' 'SNAP23'
'SNRPB2' 'SPTLC1' 'SRSF10' 'TAF1D' 'TMED10' 'TMED2' 'TRIM37' 'UBA5' 'UBC'
'UBE2D3' 'USP10' 'VPS33A' 'VPS52' 'WDR25' 'YWHAZ' 'ZNG1B']
Visualizing the obtained results
Show the supervenn plot of pan-tissue EGs, Kidney EGs and Kidney ucsEGs.
from HELPpy.visualization.plot import svenn_intesect
svenn_intesect([set(cs_EG_kidney),set(EG_kidney), set(cEG)], labels=['kidney ucsEGs', 'kidney EGs', 'common EGs'], ylabel='EGs', figsize=(8,4))

The plot shows that the Kidney tissue shares 1193 EGs with all the other tissues (over a total of 1250 ucEGs) and has 60 ucsEGs.
Show the supervenn plot of Kidney ucsEGs against Lung ucsEGs.
from HELPpy.visualization.plot import svenn_intesect
from HELPpy.utility.selection import select_cell_lines
from HELPpy.models.labelling import labelling
tissueL = 'Lung'
#a) Identify Lung EGs (as in Example 1)
cell_linesL = select_cell_lines(df_nonan, df_map, [tissueL])
print(f"Selecting {len(cell_linesL)} cell-lines")
df_labelL = labelling(df_nonan, columns = cell_linesL, n_classes=2,
labelnames={0: 'E', 1: 'NE'},
mode='flat-multi', algorithm='otsu')
#b) Compute Lung ucsEGs
np.set_printoptions(threshold=sys.maxsize)
EG_lung = df_labelL[df_labelL['label'] == 'E'].index.values
ucs_EG_lung = np.setdiff1d(EG_lung, cEG)
print(ucs_EG_lung)
with open("ucsEG_Lung.txt", 'w', encoding='utf-8') as f:
f.write('\n'.join(list(ucs_EG_lung)))
#Show the supervenn plot
svenn_intesect([set(cs_EG_kidney), set(ucs_EG_lung)], labels=['kidney', 'lung'], ylabel='ucsEGs', figsize=(8,4))
Selecting 119 cell-lines
0%| | 0/119 [00:00<?, ?it/s]
100%|██████████| 119/119 [00:00<00:00, 215.18it/s]
['ACO2' 'AP2M1' 'ATP5F1D' 'BORA' 'CCDC86' 'CDK2' 'CKS1B' 'DCTN3' 'DDX11'
'DDX39B' 'DGCR8' 'GEMIN7' 'NCAPH2' 'NFYB' 'NUMA1' 'NUP153' 'OXA1L'
'PI4KA' 'PPAT' 'PTCD3' 'SCD' 'SLBP' 'SLC25A3' 'TFRC' 'TRPM7' 'YPEL5'
'YTHDC1' 'ZNF407']

The plot shows that the Kidney and Lung tissues share 5 ucsEGs.