HELPpy.utility package

Submodules

HELPpy.utility.selection module

HELPpy.utility.selection.EG_tissues_intersect()[source]

Calculate the intersection and differences of gene sets across multiple tissues.

Parameters:

tissues (Dict[str, pd.DataFrame]) – Dictionary of tissue names and associated dataframes.
common_df (Union[None, pd.DataFrame]) – DataFrame containing common data.
labelname (str) – Name of the label column in the dataframes.
labelval (str) – Value to consider as the target label.
display (bool) – Whether to display a Venn diagram.
verbose (bool) – Whether to print verbose information.
barheight (int) – Height of bars in the Venn diagram.
barwidth (int) – Width of the Venn diagram.
fontsize (int) – Font size of the Venn diagram labels.

Returns:

A tuple containing sets of genes for each tissue, the intersection of genes, and differences in genes.

Return type:

Tuple[Dict[str, set], set, Dict[str, set]]

Example:

tissues = {'Tissue1': pd.DataFrame(...), 'Tissue2': pd.DataFrame(...), ...}
common_df = pd.DataFrame(...)  # Optional
sets, inset, diffs = EG_tissues_intersect(tissues, common_df, labelname='label', labelval='E', display=True)

HELPpy.utility.selection.EG_tissues_intersect_dolabelling(df: DataFrame, df_map: DataFrame, tissues: List[str] = [], subtract_common: bool = False, three_class: bool = False, display: bool = False, verbose: bool = False, barheight: int = 2, barwidth: int = 10, fontsize: int = 17) → DataFrame[source]

Identify overlapping and unique Essential Genes (EGs) by tissues.

Parameters:

df (pd.DataFrame) – DataFrame containing cell line information.
df_map (pd.DataFrame) – DataFrame containing mapping information.
tissues (List[str]) – List of tissues for which EGs need to be identified.
subtract_common (bool) – Whether to subtract common EGs from pantissue labeling.
three_class (bool) – Whether to use a three-class labeling (E, NE, NC).
display (bool) – Whether to display a Venn diagram.
verbose (bool) – Verbosity level for printing information.
barheight (int) – Height of the Venn diagram.
barwidth (int) – Width of the Venn diagram.
fontsize (int) – Font size for the Venn diagram.

Returns:

Tuple containing sets of EGs, intersection of EGs, and differences in EGs.

Return type:

Tuple[List[set], set, Dict[str, set]]

Example:

df = pd.DataFrame(...)
df_map = pd.DataFrame(...)
tissues = ['Tissue1', 'Tissue2']
sets, inset, diffs = EG_tissues_intersect_dolabelling(df, df_map, tissues, subtract_common=True, three_class=False, display=True, verbose=True)

HELPpy.utility.selection.delrows_with_nan_percentage(df: DataFrame, perc: float = 100.0, verbose=False)[source]

Filter rows in a DataFrame based on the percentage of NaN values.

Parameters: :param: pd.DataFrame df: The input DataFrame. :param: float perc: The percentage of NaN values allowed in each row. Default is 0.0.

Returns:: A new DataFrame with rows filtered based on the specified percentage of NaN values.
Return type:: pd.DataFrame

HELPpy.utility.selection.filter_cellmap(df_map: DataFrame, minlines: int = 1, line_group: str = 'OncotreeLineage')[source]

Filters a cell map DataFrame based on the minimum number of lines per group.

Parameters: :param pd.DataFrame df_map: The input DataFrame containing cell map information. :param int minlines: The minimum number of lines required to retain a group. :param str line_group: Column name for the grouping information in the cell map DataFrame. Default: ‘OncotreeLineage’.

Returns:: Filtered DataFrame containing only the groups that meet the minimum lines criteria.
Return type:: pd.DataFrame
Example:

filtered_df = filter_cellmap(cell_map_data, minlines=10, line_group='OncotreeLineage')

HELPpy.utility.selection.filter_crispr_by_model(df: DataFrame, df_map: DataFrame, minlines: int = 1, line_colname: str = 'ModelID', line_group: str = 'OncotreeLineage')[source]

Filter a CRISPR DataFrame based on a mapping DataFrame and specified conditions.

Param:: pd.DataFrame df: The CRISPR DataFrame to be filtered.
Param:: pd.DataFrame df_map: The mapping DataFrame containing information about cell lines and models.
Param:: int minlines int: The minimum number of lines required for a tissue in the model. Default is 1.
Param:: str line_colname: The column name in both DataFrames representing the cell line ID. Default is ‘ModelID’.
Param:: str line_group: The column name in the mapping DataFrame representing the tissue/lineage group. Default is ‘OncotreeLineage’.
Returns:: A new DataFrame with CRISPR data filtered based on the selected cell lines and conditions.
Return type:: pd.DataFrame

HELPpy.utility.selection.select_cell_lines(df: DataFrame, df_map: DataFrame, tissue_list: str | List[str], line_group='OncotreeLineage', line_col='ModelID', nested=False, verbose=0)[source]

Select cell lines based on tissue and mapping information.

Parameters:

df (pd.DataFrame) – DataFrame containing cell line information.
df_map (pd.DataFrame) – DataFrame containing mapping information.
tissue_list (List[str]) – List of tissues for which cell lines need to be selected.
line_group (str) – The column in ‘df_map’ to use for line selection (default is ‘ModelID’).
line_col (str) – The column in ‘df_map’ to use for tissue selection (default is ‘OncotreeLineage’).
nested (bool) – Whether to return cell lines as nested lists (lists for each tissue to enable mode of mode in labelling).
verbose (int) – Verbosity level for printing information.

Returns:

List of selected cell lines, either flattened or nested based on the ‘nested’ parameter.

Return type:

List

Example:

df = pd.DataFrame(...)
df_map = pd.DataFrame(...)
tissue_list = ['Tissue1', 'Tissue2']
selected_lines = select_cell_lines(df, df_map, tissue_list, line_group='OncotreeLineage', line_col='ModelID', nested=False, verbose=1)

HELPpy.utility.selection.set_seed(seed=1)[source]

HELPpy.utility package

Submodules

HELPpy.utility.selection module

Module contents