1. Install HELP from GitHub
Skip this cell if you alread have installed HELP.
!pip install git+https://github.com/giordamaug/HELP.git
2. Download the input files
Download from the DepMap portal the gene deletion expression scores
(CRISPRGeneEffect.csv
) and the map between cell-lines and tissues
(Model.csv
). Skip this step if you already have these input files
locally.
!wget -c https://figshare.com/ndownloader/files/43346616 -O CRISPRGeneEffect.csv
!wget -c https://figshare.com/ndownloader/files/43746708 -O Model.csv
3. Load the input files
Load the CRISPR data and show the content.
import pandas as pd
import os
df = pd.read_csv("CRISPRGeneEffect.csv").rename(columns={'Unnamed: 0': 'gene'}).rename(columns=lambda x: x.split(' ')[0]).set_index('gene').T
print(f'{df.isna().sum().sum()} NaN over {len(df)*len(df.columns)} values')
df
739493 NaN over 20287300 values
gene | ACH-000001 | ACH-000004 | ACH-000005 | ACH-000007 | ACH-000009 | ACH-000011 | ACH-000012 | ACH-000013 | ACH-000015 | ACH-000017 | ... | ACH-002693 | ACH-002710 | ACH-002785 | ACH-002799 | ACH-002800 | ACH-002834 | ACH-002847 | ACH-002922 | ACH-002925 | ACH-002926 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A1BG | -0.122637 | 0.019756 | -0.107208 | -0.031027 | 0.008888 | 0.022670 | -0.096631 | 0.049811 | -0.099040 | -0.044896 | ... | -0.072582 | -0.033722 | -0.053881 | -0.060617 | 0.025795 | -0.055721 | -0.009973 | -0.025991 | -0.127639 | -0.068666 |
A1CF | 0.025881 | -0.083640 | -0.023211 | -0.137850 | -0.146566 | -0.057743 | -0.024440 | -0.158811 | -0.070409 | -0.115830 | ... | -0.237311 | -0.108704 | -0.114864 | -0.042591 | -0.132627 | -0.121228 | -0.119813 | -0.007706 | -0.040705 | -0.107530 |
A2M | 0.034217 | -0.060118 | 0.200204 | 0.067704 | 0.084471 | 0.079679 | 0.041922 | -0.003968 | -0.029389 | 0.024537 | ... | -0.065940 | 0.079277 | 0.069333 | 0.030989 | 0.249826 | 0.072790 | 0.044097 | -0.038468 | 0.134556 | 0.067806 |
A2ML1 | -0.128082 | -0.027417 | 0.116039 | 0.107988 | 0.089419 | 0.227512 | 0.039121 | 0.034778 | 0.084594 | -0.003710 | ... | 0.101541 | 0.038977 | 0.066599 | 0.043809 | 0.064657 | 0.021916 | 0.041358 | 0.236576 | -0.047984 | 0.112071 |
A3GALT2 | -0.031285 | -0.036116 | -0.172227 | 0.007992 | 0.065109 | -0.130448 | 0.028947 | -0.120875 | -0.052288 | -0.336776 | ... | 0.005374 | -0.144070 | -0.256227 | -0.116473 | -0.294305 | -0.221940 | -0.146565 | -0.239690 | -0.116114 | -0.149897 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ZYG11A | -0.289724 | 0.032983 | -0.201273 | -0.100344 | -0.112703 | 0.013401 | 0.005124 | -0.089180 | -0.005409 | -0.070396 | ... | -0.296880 | -0.084936 | -0.128569 | -0.110504 | -0.087171 | 0.024959 | -0.119911 | -0.079342 | -0.043555 | -0.045115 |
ZYG11B | -0.062972 | -0.410392 | -0.178877 | -0.462160 | -0.598698 | -0.296421 | -0.131949 | -0.145737 | -0.216393 | -0.257916 | ... | -0.332415 | -0.193408 | -0.327408 | -0.257879 | -0.349111 | 0.015259 | -0.289412 | -0.347484 | -0.335270 | -0.307900 |
ZYX | 0.074180 | 0.113156 | -0.055349 | -0.001555 | 0.095877 | 0.067705 | -0.109147 | -0.034886 | -0.137350 | 0.029457 | ... | -0.005090 | -0.218960 | -0.053033 | -0.041612 | -0.057478 | -0.306562 | -0.195097 | -0.085302 | -0.208063 | 0.070671 |
ZZEF1 | 0.111244 | 0.234388 | -0.002161 | -0.325964 | -0.026742 | -0.232453 | -0.164482 | -0.175850 | -0.168087 | -0.284838 | ... | -0.188751 | -0.120449 | -0.267081 | 0.006148 | -0.189602 | -0.148368 | -0.206400 | -0.095965 | -0.094741 | -0.187813 |
ZZZ3 | -0.467908 | -0.088306 | -0.186842 | -0.486660 | -0.320759 | -0.347234 | -0.277397 | -0.519586 | -0.282338 | -0.247634 | ... | -0.239991 | -0.311396 | -0.202158 | -0.195154 | -0.107107 | -0.579576 | -0.486525 | -0.346272 | -0.222404 | -0.452143 |
18443 rows × 1100 columns
Then load the mapping information and show the content.
df_map = pd.read_csv("Model.csv")
df_map
ModelID | PatientID | CellLineName | StrippedCellLineName | DepmapModelType | OncotreeLineage | OncotreePrimaryDisease | OncotreeSubtype | OncotreeCode | LegacyMolecularSubtype | ... | TissueOrigin | CCLEName | CatalogNumber | PlateCoating | ModelDerivationMaterial | PublicComments | WTSIMasterCellID | SangerModelID | COSMICID | LegacySubSubtype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACH-000001 | PT-gj46wT | NIH:OVCAR-3 | NIHOVCAR3 | HGSOC | Ovary/Fallopian Tube | Ovarian Epithelial Tumor | High-Grade Serous Ovarian Cancer | HGSOC | NaN | ... | NaN | NIHOVCAR3_OVARY | HTB-71 | NaN | NaN | NaN | 2201.0 | SIDM00105 | 905933.0 | high_grade_serous |
1 | ACH-000002 | PT-5qa3uk | HL-60 | HL60 | AML | Myeloid | Acute Myeloid Leukemia | Acute Myeloid Leukemia | AML | NaN | ... | NaN | HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE | CCL-240 | NaN | NaN | NaN | 55.0 | SIDM00829 | 905938.0 | M3 |
2 | ACH-000003 | PT-puKIyc | CACO2 | CACO2 | COAD | Bowel | Colorectal Adenocarcinoma | Colon Adenocarcinoma | COAD | NaN | ... | NaN | CACO2_LARGE_INTESTINE | HTB-37 | NaN | NaN | NaN | NaN | SIDM00891 | NaN | NaN |
3 | ACH-000004 | PT-q4K2cp | HEL | HEL | AML | Myeloid | Acute Myeloid Leukemia | Acute Myeloid Leukemia | AML | NaN | ... | NaN | HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE | ACC 11 | NaN | NaN | NaN | 783.0 | SIDM00594 | 907053.0 | M6 |
4 | ACH-000005 | PT-q4K2cp | HEL 92.1.7 | HEL9217 | AML | Myeloid | Acute Myeloid Leukemia | Acute Myeloid Leukemia | AML | NaN | ... | NaN | HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE | HEL9217 | NaN | NaN | NaN | NaN | SIDM00593 | NaN | M6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1916 | ACH-003157 | PT-QDEP9D | ABM-T0822 | ABMT0822 | ZIMMMPLC | Lung | Non-Cancerous | Immortalized MPLC Cells | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1917 | ACH-003158 | PT-nszsxG | ABM-T9220 | ABMT9220 | ZIMMSMCI | Muscle | Non-Cancerous | Immortalized Smooth Muscle Cells, Intestinal | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1918 | ACH-003159 | PT-AUxVvV | ABM-T9233 | ABMT9233 | ZIMMRSCH | Hair | Non-Cancerous | Immortalized Hair Follicle Inner Root Sheath C... | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1919 | ACH-003160 | PT-AUxVvV | ABM-T9249 | ABMT9249 | ZIMMGMCH | Hair | Non-Cancerous | Immortalized Hair Germinal Matrix Cells | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1920 | ACH-003161 | PT-or1hkT | ABM-T9430 | ABMT9430 | ZIMMPSC | Pancreas | Non-Cancerous | Immortalized Pancreatic Stromal Cells | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1921 rows × 36 columns
Example 1.1 two-class labeling of EGs based on tissue information
Filter the information to be exploited
Filter the genes mapped to tissues (OncotreeLineage
column in the
mapping file) having less than minlines
cell-lines.
from HELPpy.utility.selection import filter_crispr_by_model
df = filter_crispr_by_model(df, df_map, minlines=10, line_group='OncotreeLineage')
df
gene | ACH-000001 | ACH-000004 | ACH-000005 | ACH-000007 | ACH-000009 | ACH-000011 | ACH-000012 | ACH-000013 | ACH-000015 | ACH-000017 | ... | ACH-002693 | ACH-002710 | ACH-002785 | ACH-002799 | ACH-002800 | ACH-002834 | ACH-002847 | ACH-002922 | ACH-002925 | ACH-002926 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A1BG | -0.122637 | 0.019756 | -0.107208 | -0.031027 | 0.008888 | 0.022670 | -0.096631 | 0.049811 | -0.099040 | -0.044896 | ... | -0.072582 | -0.033722 | -0.053881 | -0.060617 | 0.025795 | -0.055721 | -0.009973 | -0.025991 | -0.127639 | -0.068666 |
A1CF | 0.025881 | -0.083640 | -0.023211 | -0.137850 | -0.146566 | -0.057743 | -0.024440 | -0.158811 | -0.070409 | -0.115830 | ... | -0.237311 | -0.108704 | -0.114864 | -0.042591 | -0.132627 | -0.121228 | -0.119813 | -0.007706 | -0.040705 | -0.107530 |
A2M | 0.034217 | -0.060118 | 0.200204 | 0.067704 | 0.084471 | 0.079679 | 0.041922 | -0.003968 | -0.029389 | 0.024537 | ... | -0.065940 | 0.079277 | 0.069333 | 0.030989 | 0.249826 | 0.072790 | 0.044097 | -0.038468 | 0.134556 | 0.067806 |
A2ML1 | -0.128082 | -0.027417 | 0.116039 | 0.107988 | 0.089419 | 0.227512 | 0.039121 | 0.034778 | 0.084594 | -0.003710 | ... | 0.101541 | 0.038977 | 0.066599 | 0.043809 | 0.064657 | 0.021916 | 0.041358 | 0.236576 | -0.047984 | 0.112071 |
A3GALT2 | -0.031285 | -0.036116 | -0.172227 | 0.007992 | 0.065109 | -0.130448 | 0.028947 | -0.120875 | -0.052288 | -0.336776 | ... | 0.005374 | -0.144070 | -0.256227 | -0.116473 | -0.294305 | -0.221940 | -0.146565 | -0.239690 | -0.116114 | -0.149897 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ZYG11A | -0.289724 | 0.032983 | -0.201273 | -0.100344 | -0.112703 | 0.013401 | 0.005124 | -0.089180 | -0.005409 | -0.070396 | ... | -0.296880 | -0.084936 | -0.128569 | -0.110504 | -0.087171 | 0.024959 | -0.119911 | -0.079342 | -0.043555 | -0.045115 |
ZYG11B | -0.062972 | -0.410392 | -0.178877 | -0.462160 | -0.598698 | -0.296421 | -0.131949 | -0.145737 | -0.216393 | -0.257916 | ... | -0.332415 | -0.193408 | -0.327408 | -0.257879 | -0.349111 | 0.015259 | -0.289412 | -0.347484 | -0.335270 | -0.307900 |
ZYX | 0.074180 | 0.113156 | -0.055349 | -0.001555 | 0.095877 | 0.067705 | -0.109147 | -0.034886 | -0.137350 | 0.029457 | ... | -0.005090 | -0.218960 | -0.053033 | -0.041612 | -0.057478 | -0.306562 | -0.195097 | -0.085302 | -0.208063 | 0.070671 |
ZZEF1 | 0.111244 | 0.234388 | -0.002161 | -0.325964 | -0.026742 | -0.232453 | -0.164482 | -0.175850 | -0.168087 | -0.284838 | ... | -0.188751 | -0.120449 | -0.267081 | 0.006148 | -0.189602 | -0.148368 | -0.206400 | -0.095965 | -0.094741 | -0.187813 |
ZZZ3 | -0.467908 | -0.088306 | -0.186842 | -0.486660 | -0.320759 | -0.347234 | -0.277397 | -0.519586 | -0.282338 | -0.247634 | ... | -0.239991 | -0.311396 | -0.202158 | -0.195154 | -0.107107 | -0.579576 | -0.486525 | -0.346272 | -0.222404 | -0.452143 |
18443 rows × 1091 columns
Show which are the tissues available from the mapping file:
print(df_map[['OncotreeLineage']].value_counts())
OncotreeLineage
Lung 249
Lymphoid 211
CNS/Brain 122
Skin 120
Esophagus/Stomach 95
Breast 94
Bowel 89
Head and Neck 84
Bone 77
Myeloid 77
Ovary/Fallopian Tube 75
Kidney 73
Pancreas 66
Peripheral Nervous System 56
Soft Tissue 55
Biliary Tract 44
Uterus 41
Fibroblast 41
Bladder/Urinary Tract 39
Normal 39
Pleura 35
Liver 29
Cervix 25
Eye 21
Thyroid 18
Prostate 15
Testis 7
Vulva/Vagina 5
Muscle 5
Ampulla of Vater 4
Hair 2
Other 1
Embryonal 1
Adrenal Gland 1
Name: count, dtype: int64
Select only cell-lines of a chosen tissue (here Kidney
) and remove
cell-lines having more than a certain percentage of NaN values (here
95%):
tissue = 'Kidney'
from HELPpy.utility.selection import select_cell_lines, delrows_with_nan_percentage
from HELPpy.models.labelling import labelling
cell_lines = select_cell_lines(df, df_map, [tissue])
print(f"Selecting {len(cell_lines)} cell-lines")
# remove rows with more than perc NaNs
df_nonan = delrows_with_nan_percentage(df, perc=95)
Selecting 37 cell-lines
Removed 512 rows from 18443 with at least 95% NaN
Apply two-class HELP labelling
Compute the two-class labeling (mode='flat-multi'
) using the Otsu
algorithm (algorithm='otsu'
) and save the results in a csv file
(Kidney_HELP_twoClasses.csv
):
df_label2 = labelling(df_nonan, columns = cell_lines, n_classes=2,
labelnames={0: 'E', 1: 'NE'},
mode='flat-multi', algorithm='otsu', verbose=True)
# save the result
df_label2.to_csv(f"{tissue}_HELP_twoClasses.csv")
performing flat mode on 2-class labelling (flat-multi).
[flat-multi]: 1. multi-class labelling:
100%|██████████| 37/37 [00:00<00:00, 598.29it/s]
label
NE 16678
E 1253
Name: count, dtype: int64
Example 1.2 three-class labeling of EGs based on tissue information
Genes have already been filtered according to tissue information for Example 1.1, so we only need to:
Apply three-class HELP labelling
Compute the three-class labeling (mode='two-by-two'
) using
the Otsu algorithm (algorithm='otsu'
) and save the results in a csv
file ('Kidney_HELP_threeClasses.csv'
):
df_label3 = labelling(df_nonan, columns = cell_lines, n_classes=2,
labelnames={0: 'E', 1: 'aE', 2: 'sNE'},
mode='two-by-two', algorithm='otsu', verbose=True)
# save the result
df_label3.to_csv(f"{tissue}_HELP_threeClasses.csv")
performing flat mode on 3-class labelling (two-by-two).
[two-by-two]: 1. Two-class labelling:
100%|██████████| 37/37 [00:00<00:00, 572.91it/s]
(17931,)
[two-by-two]: 2. Two-class labelling on 1-label rows:
100%|██████████| 37/37 [00:00<00:00, 737.77it/s]
(16678,)
label
sNE 13457
aE 3221
E 1253
Name: count, dtype: int64