HELPpy.utility package
Submodules
HELPpy.utility.selection module
- HELPpy.utility.selection.EG_tissues_intersect()[source]
Calculate the intersection and differences of gene sets across multiple tissues.
- Parameters:
tissues (Dict[str, pd.DataFrame]) – Dictionary of tissue names and associated dataframes.
common_df (Union[None, pd.DataFrame]) – DataFrame containing common data.
labelname (str) – Name of the label column in the dataframes.
labelval (str) – Value to consider as the target label.
display (bool) – Whether to display a Venn diagram.
verbose (bool) – Whether to print verbose information.
barheight (int) – Height of bars in the Venn diagram.
barwidth (int) – Width of the Venn diagram.
fontsize (int) – Font size of the Venn diagram labels.
- Returns:
A tuple containing sets of genes for each tissue, the intersection of genes, and differences in genes.
- Return type:
Tuple[Dict[str, set], set, Dict[str, set]]
- Example:
tissues = {'Tissue1': pd.DataFrame(...), 'Tissue2': pd.DataFrame(...), ...} common_df = pd.DataFrame(...) # Optional sets, inset, diffs = EG_tissues_intersect(tissues, common_df, labelname='label', labelval='E', display=True)
- HELPpy.utility.selection.EG_tissues_intersect_dolabelling(df: DataFrame, df_map: DataFrame, tissues: List[str] = [], subtract_common: bool = False, three_class: bool = False, display: bool = False, verbose: bool = False, barheight: int = 2, barwidth: int = 10, fontsize: int = 17) DataFrame [source]
Identify overlapping and unique Essential Genes (EGs) by tissues.
- Parameters:
df (pd.DataFrame) – DataFrame containing cell line information.
df_map (pd.DataFrame) – DataFrame containing mapping information.
tissues (List[str]) – List of tissues for which EGs need to be identified.
subtract_common (bool) – Whether to subtract common EGs from pantissue labeling.
three_class (bool) – Whether to use a three-class labeling (E, NE, NC).
display (bool) – Whether to display a Venn diagram.
verbose (bool) – Verbosity level for printing information.
barheight (int) – Height of the Venn diagram.
barwidth (int) – Width of the Venn diagram.
fontsize (int) – Font size for the Venn diagram.
- Returns:
Tuple containing sets of EGs, intersection of EGs, and differences in EGs.
- Return type:
Tuple[List[set], set, Dict[str, set]]
- Example:
df = pd.DataFrame(...) df_map = pd.DataFrame(...) tissues = ['Tissue1', 'Tissue2'] sets, inset, diffs = EG_tissues_intersect_dolabelling(df, df_map, tissues, subtract_common=True, three_class=False, display=True, verbose=True)
- HELPpy.utility.selection.delrows_with_nan_percentage(df: DataFrame, perc: float = 100.0, verbose=False)[source]
Filter rows in a DataFrame based on the percentage of NaN values.
Parameters: :param: pd.DataFrame df: The input DataFrame. :param: float perc: The percentage of NaN values allowed in each row. Default is 0.0.
- Returns:
A new DataFrame with rows filtered based on the specified percentage of NaN values.
- Return type:
pd.DataFrame
- HELPpy.utility.selection.filter_cellmap(df_map: DataFrame, minlines: int = 1, line_group: str = 'OncotreeLineage')[source]
Filters a cell map DataFrame based on the minimum number of lines per group.
Parameters: :param pd.DataFrame df_map: The input DataFrame containing cell map information. :param int minlines: The minimum number of lines required to retain a group. :param str line_group: Column name for the grouping information in the cell map DataFrame. Default: ‘OncotreeLineage’.
- Returns:
Filtered DataFrame containing only the groups that meet the minimum lines criteria.
- Return type:
pd.DataFrame
- Example:
filtered_df = filter_cellmap(cell_map_data, minlines=10, line_group='OncotreeLineage')
- HELPpy.utility.selection.filter_crispr_by_model(df: DataFrame, df_map: DataFrame, minlines: int = 1, line_colname: str = 'ModelID', line_group: str = 'OncotreeLineage')[source]
Filter a CRISPR DataFrame based on a mapping DataFrame and specified conditions.
- Param:
pd.DataFrame df: The CRISPR DataFrame to be filtered.
- Param:
pd.DataFrame df_map: The mapping DataFrame containing information about cell lines and models.
- Param:
int minlines int: The minimum number of lines required for a tissue in the model. Default is 1.
- Param:
str line_colname: The column name in both DataFrames representing the cell line ID. Default is ‘ModelID’.
- Param:
str line_group: The column name in the mapping DataFrame representing the tissue/lineage group. Default is ‘OncotreeLineage’.
- Returns:
A new DataFrame with CRISPR data filtered based on the selected cell lines and conditions.
- Return type:
pd.DataFrame
- HELPpy.utility.selection.select_cell_lines(df: DataFrame, df_map: DataFrame, tissue_list: str | List[str], line_group='OncotreeLineage', line_col='ModelID', nested=False, verbose=0)[source]
Select cell lines based on tissue and mapping information.
- Parameters:
df (pd.DataFrame) – DataFrame containing cell line information.
df_map (pd.DataFrame) – DataFrame containing mapping information.
tissue_list (List[str]) – List of tissues for which cell lines need to be selected.
line_group (str) – The column in ‘df_map’ to use for line selection (default is ‘ModelID’).
line_col (str) – The column in ‘df_map’ to use for tissue selection (default is ‘OncotreeLineage’).
nested (bool) – Whether to return cell lines as nested lists (lists for each tissue to enable mode of mode in labelling).
verbose (int) – Verbosity level for printing information.
- Returns:
List of selected cell lines, either flattened or nested based on the ‘nested’ parameter.
- Return type:
List
- Example:
df = pd.DataFrame(...) df_map = pd.DataFrame(...) tissue_list = ['Tissue1', 'Tissue2'] selected_lines = select_cell_lines(df, df_map, tissue_list, line_group='OncotreeLineage', line_col='ModelID', nested=False, verbose=1)