1. Install HELP from GitHub ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Skip this cell if you alread have installed HELP. .. code:: ipython3 !pip install git+https://github.com/giordamaug/HELP.git 2. Download the input files ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Download from the DepMap portal the gene deletion expression scores (``CRISPRGeneEffect.csv``) and the map between cell-lines and tissues (``Model.csv``). Skip this step if you already have these input files locally. .. code:: ipython3 !wget -c https://figshare.com/ndownloader/files/43346616 -O CRISPRGeneEffect.csv !wget -c https://figshare.com/ndownloader/files/43746708 -O Model.csv 3. Load the input files ~~~~~~~~~~~~~~~~~~~~~~~ Load the CRISPR data and show the content. .. code:: ipython3 import pandas as pd import os df = pd.read_csv("CRISPRGeneEffect.csv").rename(columns={'Unnamed: 0': 'gene'}).rename(columns=lambda x: x.split(' ')[0]).set_index('gene').T print(f'{df.isna().sum().sum()} NaN over {len(df)*len(df.columns)} values') df .. parsed-literal:: 739493 NaN over 20287300 values .. raw:: html
gene ACH-000001 ACH-000004 ACH-000005 ACH-000007 ACH-000009 ACH-000011 ACH-000012 ACH-000013 ACH-000015 ACH-000017 ... ACH-002693 ACH-002710 ACH-002785 ACH-002799 ACH-002800 ACH-002834 ACH-002847 ACH-002922 ACH-002925 ACH-002926
A1BG -0.122637 0.019756 -0.107208 -0.031027 0.008888 0.022670 -0.096631 0.049811 -0.099040 -0.044896 ... -0.072582 -0.033722 -0.053881 -0.060617 0.025795 -0.055721 -0.009973 -0.025991 -0.127639 -0.068666
A1CF 0.025881 -0.083640 -0.023211 -0.137850 -0.146566 -0.057743 -0.024440 -0.158811 -0.070409 -0.115830 ... -0.237311 -0.108704 -0.114864 -0.042591 -0.132627 -0.121228 -0.119813 -0.007706 -0.040705 -0.107530
A2M 0.034217 -0.060118 0.200204 0.067704 0.084471 0.079679 0.041922 -0.003968 -0.029389 0.024537 ... -0.065940 0.079277 0.069333 0.030989 0.249826 0.072790 0.044097 -0.038468 0.134556 0.067806
A2ML1 -0.128082 -0.027417 0.116039 0.107988 0.089419 0.227512 0.039121 0.034778 0.084594 -0.003710 ... 0.101541 0.038977 0.066599 0.043809 0.064657 0.021916 0.041358 0.236576 -0.047984 0.112071
A3GALT2 -0.031285 -0.036116 -0.172227 0.007992 0.065109 -0.130448 0.028947 -0.120875 -0.052288 -0.336776 ... 0.005374 -0.144070 -0.256227 -0.116473 -0.294305 -0.221940 -0.146565 -0.239690 -0.116114 -0.149897
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZYG11A -0.289724 0.032983 -0.201273 -0.100344 -0.112703 0.013401 0.005124 -0.089180 -0.005409 -0.070396 ... -0.296880 -0.084936 -0.128569 -0.110504 -0.087171 0.024959 -0.119911 -0.079342 -0.043555 -0.045115
ZYG11B -0.062972 -0.410392 -0.178877 -0.462160 -0.598698 -0.296421 -0.131949 -0.145737 -0.216393 -0.257916 ... -0.332415 -0.193408 -0.327408 -0.257879 -0.349111 0.015259 -0.289412 -0.347484 -0.335270 -0.307900
ZYX 0.074180 0.113156 -0.055349 -0.001555 0.095877 0.067705 -0.109147 -0.034886 -0.137350 0.029457 ... -0.005090 -0.218960 -0.053033 -0.041612 -0.057478 -0.306562 -0.195097 -0.085302 -0.208063 0.070671
ZZEF1 0.111244 0.234388 -0.002161 -0.325964 -0.026742 -0.232453 -0.164482 -0.175850 -0.168087 -0.284838 ... -0.188751 -0.120449 -0.267081 0.006148 -0.189602 -0.148368 -0.206400 -0.095965 -0.094741 -0.187813
ZZZ3 -0.467908 -0.088306 -0.186842 -0.486660 -0.320759 -0.347234 -0.277397 -0.519586 -0.282338 -0.247634 ... -0.239991 -0.311396 -0.202158 -0.195154 -0.107107 -0.579576 -0.486525 -0.346272 -0.222404 -0.452143

18443 rows × 1100 columns

Then load the mapping information and show the content. .. code:: ipython3 df_map = pd.read_csv("Model.csv") df_map .. raw:: html
ModelID PatientID CellLineName StrippedCellLineName DepmapModelType OncotreeLineage OncotreePrimaryDisease OncotreeSubtype OncotreeCode LegacyMolecularSubtype ... TissueOrigin CCLEName CatalogNumber PlateCoating ModelDerivationMaterial PublicComments WTSIMasterCellID SangerModelID COSMICID LegacySubSubtype
0 ACH-000001 PT-gj46wT NIH:OVCAR-3 NIHOVCAR3 HGSOC Ovary/Fallopian Tube Ovarian Epithelial Tumor High-Grade Serous Ovarian Cancer HGSOC NaN ... NaN NIHOVCAR3_OVARY HTB-71 NaN NaN NaN 2201.0 SIDM00105 905933.0 high_grade_serous
1 ACH-000002 PT-5qa3uk HL-60 HL60 AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML NaN ... NaN HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE CCL-240 NaN NaN NaN 55.0 SIDM00829 905938.0 M3
2 ACH-000003 PT-puKIyc CACO2 CACO2 COAD Bowel Colorectal Adenocarcinoma Colon Adenocarcinoma COAD NaN ... NaN CACO2_LARGE_INTESTINE HTB-37 NaN NaN NaN NaN SIDM00891 NaN NaN
3 ACH-000004 PT-q4K2cp HEL HEL AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML NaN ... NaN HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE ACC 11 NaN NaN NaN 783.0 SIDM00594 907053.0 M6
4 ACH-000005 PT-q4K2cp HEL 92.1.7 HEL9217 AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML NaN ... NaN HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE HEL9217 NaN NaN NaN NaN SIDM00593 NaN M6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1916 ACH-003157 PT-QDEP9D ABM-T0822 ABMT0822 ZIMMMPLC Lung Non-Cancerous Immortalized MPLC Cells NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1917 ACH-003158 PT-nszsxG ABM-T9220 ABMT9220 ZIMMSMCI Muscle Non-Cancerous Immortalized Smooth Muscle Cells, Intestinal NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1918 ACH-003159 PT-AUxVvV ABM-T9233 ABMT9233 ZIMMRSCH Hair Non-Cancerous Immortalized Hair Follicle Inner Root Sheath C... NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1919 ACH-003160 PT-AUxVvV ABM-T9249 ABMT9249 ZIMMGMCH Hair Non-Cancerous Immortalized Hair Germinal Matrix Cells NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1920 ACH-003161 PT-or1hkT ABM-T9430 ABMT9430 ZIMMPSC Pancreas Non-Cancerous Immortalized Pancreatic Stromal Cells NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1921 rows × 36 columns

Example 1.1 two-class labeling of EGs based on tissue information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Filter the information to be exploited '''''''''''''''''''''''''''''''''''''' Filter the genes mapped to tissues (``OncotreeLineage`` column in the mapping file) having less than ``minlines`` cell-lines. .. code:: ipython3 from HELPpy.utility.selection import filter_crispr_by_model df = filter_crispr_by_model(df, df_map, minlines=10, line_group='OncotreeLineage') df .. raw:: html
gene ACH-000001 ACH-000004 ACH-000005 ACH-000007 ACH-000009 ACH-000011 ACH-000012 ACH-000013 ACH-000015 ACH-000017 ... ACH-002693 ACH-002710 ACH-002785 ACH-002799 ACH-002800 ACH-002834 ACH-002847 ACH-002922 ACH-002925 ACH-002926
A1BG -0.122637 0.019756 -0.107208 -0.031027 0.008888 0.022670 -0.096631 0.049811 -0.099040 -0.044896 ... -0.072582 -0.033722 -0.053881 -0.060617 0.025795 -0.055721 -0.009973 -0.025991 -0.127639 -0.068666
A1CF 0.025881 -0.083640 -0.023211 -0.137850 -0.146566 -0.057743 -0.024440 -0.158811 -0.070409 -0.115830 ... -0.237311 -0.108704 -0.114864 -0.042591 -0.132627 -0.121228 -0.119813 -0.007706 -0.040705 -0.107530
A2M 0.034217 -0.060118 0.200204 0.067704 0.084471 0.079679 0.041922 -0.003968 -0.029389 0.024537 ... -0.065940 0.079277 0.069333 0.030989 0.249826 0.072790 0.044097 -0.038468 0.134556 0.067806
A2ML1 -0.128082 -0.027417 0.116039 0.107988 0.089419 0.227512 0.039121 0.034778 0.084594 -0.003710 ... 0.101541 0.038977 0.066599 0.043809 0.064657 0.021916 0.041358 0.236576 -0.047984 0.112071
A3GALT2 -0.031285 -0.036116 -0.172227 0.007992 0.065109 -0.130448 0.028947 -0.120875 -0.052288 -0.336776 ... 0.005374 -0.144070 -0.256227 -0.116473 -0.294305 -0.221940 -0.146565 -0.239690 -0.116114 -0.149897
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZYG11A -0.289724 0.032983 -0.201273 -0.100344 -0.112703 0.013401 0.005124 -0.089180 -0.005409 -0.070396 ... -0.296880 -0.084936 -0.128569 -0.110504 -0.087171 0.024959 -0.119911 -0.079342 -0.043555 -0.045115
ZYG11B -0.062972 -0.410392 -0.178877 -0.462160 -0.598698 -0.296421 -0.131949 -0.145737 -0.216393 -0.257916 ... -0.332415 -0.193408 -0.327408 -0.257879 -0.349111 0.015259 -0.289412 -0.347484 -0.335270 -0.307900
ZYX 0.074180 0.113156 -0.055349 -0.001555 0.095877 0.067705 -0.109147 -0.034886 -0.137350 0.029457 ... -0.005090 -0.218960 -0.053033 -0.041612 -0.057478 -0.306562 -0.195097 -0.085302 -0.208063 0.070671
ZZEF1 0.111244 0.234388 -0.002161 -0.325964 -0.026742 -0.232453 -0.164482 -0.175850 -0.168087 -0.284838 ... -0.188751 -0.120449 -0.267081 0.006148 -0.189602 -0.148368 -0.206400 -0.095965 -0.094741 -0.187813
ZZZ3 -0.467908 -0.088306 -0.186842 -0.486660 -0.320759 -0.347234 -0.277397 -0.519586 -0.282338 -0.247634 ... -0.239991 -0.311396 -0.202158 -0.195154 -0.107107 -0.579576 -0.486525 -0.346272 -0.222404 -0.452143

18443 rows × 1091 columns

Show which are the tissues available from the mapping file: .. code:: ipython3 print(df_map[['OncotreeLineage']].value_counts()) .. parsed-literal:: OncotreeLineage Lung 249 Lymphoid 211 CNS/Brain 122 Skin 120 Esophagus/Stomach 95 Breast 94 Bowel 89 Head and Neck 84 Bone 77 Myeloid 77 Ovary/Fallopian Tube 75 Kidney 73 Pancreas 66 Peripheral Nervous System 56 Soft Tissue 55 Biliary Tract 44 Uterus 41 Fibroblast 41 Bladder/Urinary Tract 39 Normal 39 Pleura 35 Liver 29 Cervix 25 Eye 21 Thyroid 18 Prostate 15 Testis 7 Vulva/Vagina 5 Muscle 5 Ampulla of Vater 4 Hair 2 Other 1 Embryonal 1 Adrenal Gland 1 Name: count, dtype: int64 Select only cell-lines of a chosen tissue (here ``Kidney``) and remove cell-lines having more than a certain percentage of NaN values (here 95%): .. code:: ipython3 tissue = 'Kidney' from HELPpy.utility.selection import select_cell_lines, delrows_with_nan_percentage from HELPpy.models.labelling import labelling cell_lines = select_cell_lines(df, df_map, [tissue]) print(f"Selecting {len(cell_lines)} cell-lines") # remove rows with more than perc NaNs df_nonan = delrows_with_nan_percentage(df, perc=95) .. parsed-literal:: Selecting 37 cell-lines Removed 512 rows from 18443 with at least 95% NaN Apply two-class HELP labelling '''''''''''''''''''''''''''''' Compute the two-class labeling (``mode='flat-multi'``) using the Otsu algorithm (``algorithm='otsu'``) and save the results in a csv file (``Kidney_HELP_twoClasses.csv``): .. code:: ipython3 df_label2 = labelling(df_nonan, columns = cell_lines, n_classes=2, labelnames={0: 'E', 1: 'NE'}, mode='flat-multi', algorithm='otsu', verbose=True) # save the result df_label2.to_csv(f"{tissue}_HELP_twoClasses.csv") .. parsed-literal:: performing flat mode on 2-class labelling (flat-multi). [flat-multi]: 1. multi-class labelling: .. parsed-literal:: 100%|██████████| 37/37 [00:00<00:00, 598.29it/s] .. parsed-literal:: label NE 16678 E 1253 Name: count, dtype: int64 Example 1.2 three-class labeling of EGs based on tissue information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Genes have already been filtered according to tissue information for Example 1.1, so we only need to: Apply three-class HELP labelling '''''''''''''''''''''''''''''''' Compute the three-class labeling (``mode='two-by-two'``) using the Otsu algorithm (``algorithm='otsu'``) and save the results in a csv file (``'Kidney_HELP_threeClasses.csv'``): .. code:: ipython3 df_label3 = labelling(df_nonan, columns = cell_lines, n_classes=2, labelnames={0: 'E', 1: 'aE', 2: 'sNE'}, mode='two-by-two', algorithm='otsu', verbose=True) # save the result df_label3.to_csv(f"{tissue}_HELP_threeClasses.csv") .. parsed-literal:: performing flat mode on 3-class labelling (two-by-two). [two-by-two]: 1. Two-class labelling: .. parsed-literal:: 100%|██████████| 37/37 [00:00<00:00, 572.91it/s] .. parsed-literal:: (17931,) [two-by-two]: 2. Two-class labelling on 1-label rows: .. parsed-literal:: 100%|██████████| 37/37 [00:00<00:00, 737.77it/s] .. parsed-literal:: (16678,) label sNE 13457 aE 3221 E 1253 Name: count, dtype: int64 Example 1.3 two-class labeling of EGs based on disease-related information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Filter the information to be exploited '''''''''''''''''''''''''''''''''''''' Show which are the diseases available from the mapping file (``OncotreePrimaryDisease`` column): .. code:: ipython3 print(df_map[['OncotreePrimaryDisease']].value_counts()) .. parsed-literal:: OncotreePrimaryDisease Non-Small Cell Lung Cancer 161 Non-Cancerous 131 Mature B-Cell Neoplasms 113 Melanoma 107 Diffuse Glioma 94 ... Hepatocellular Carcinoma plus Intrahepatic Cholangiocarcinoma 1 Myelodysplastic Syndromes 1 Mixed Cervical Carcinoma 1 Hereditary Spherocytosis 1 Acute Leukemias of Ambiguous Lineage 1 Name: count, Length: 86, dtype: int64 Select only cell-lines mapped (via the ``OncotreePrimaryDisease`` column of the mapping file) to a chosen disease (here ``Acute Myeloid Leukemia``\ ’) and remove cell-lines having more than a certain percentage of NaN values (here 95%): .. code:: ipython3 disease = 'Acute Myeloid Leukemia' from HELPpy.utility.selection import select_cell_lines, delrows_with_nan_percentage from HELPpy.models.labelling import labelling cell_lines = select_cell_lines(df, df_map, [disease], line_group='OncotreePrimaryDisease') print(f"Selecting {len(cell_lines)} cell-lines") # remove rows with more than perc NaNs df_nonan = delrows_with_nan_percentage(df[cell_lines], perc=95) .. parsed-literal:: Selecting 24 cell-lines Removed 512 rows from 18443 with at least 95% NaN Apply two-class HELP labelling '''''''''''''''''''''''''''''' Compute the two-class labeling (mode=‘flat-multi’) using the Otsu algorithm (algorithm=‘otsu’), save the results in a csv file (‘Acute Myeloid Leukemia_HELP_twoClasses.csv’) and print their summary: .. code:: ipython3 df_label2 = labelling(df_nonan, columns = cell_lines, n_classes=2, labelnames={0: 'E', 1: 'NE'}, mode='flat-multi', algorithm='otsu', verbose=True) # save the result df_label2.to_csv(f"{disease}_HELP_twoClasses.csv") # print the number of NaNs df_label2.value_counts(normalize=False) .. parsed-literal:: performing flat mode on 2-class labelling (flat-multi). [flat-multi]: 1. multi-class labelling: .. parsed-literal:: 100%|██████████| 24/24 [00:00<00:00, 492.87it/s] .. parsed-literal:: label NE 16609 E 1322 Name: count, dtype: int64 .. parsed-literal:: label NE 16609 E 1322 Name: count, dtype: int64