HELPpy.preprocess package

Submodules

HELPpy.preprocess.embedding module

HELPpy.preprocess.embedding.PPI_embed(df_net: DataFrame, method: str = 'Node2Vec', dimensions: int = 128, walk_number: int = 10, walk_length: int = 80, workers: int = 1, epochs: int = 1, learning_rate: float = 0.05, seed: int = 42, params: Dict = {'min_count': 1, 'p': 1.0, 'q': 1.0, 'window_size': 5}, source: str = 'A', target: str = 'B', weight: str = 'combined_score', verbose: bool = False)[source]

Embeds a protein-protein interaction (PPI) network using graph embedding techniques.

Df_net pd.DataFrame:

The input DataFrame containing the PPI network information.

Method str:

The graph embedding method. Options: ‘DeepWalk’, ‘Node2Vec’, ‘AE’. Default: ‘Node2Vec’.

Dimensions int:

The dimensionality of the embedding. Default: 128.

Walk_number int:

Number of walks per node. Default: 10.

Walk_length int:

Length of each walk. Default: 80.

Workers int:

Number of parallel workers. Default: 4.

Epochs int:

Number of training epochs. Default: 1.

Learning_rate float:

Learning rate for the embedding model. Default: 0.05.

Seed int:

Random seed for reproducibility. Default: 42.

Params Dict:

Additional parameters for the embedding method. Default: {“p”: 1.0, “q”: 1.0, “window_size”: 5, “min_count”: 1}.

Source str:

Column name for the source nodes in the PPI network DataFrame. Default: ‘A’.

Target str:

Column name for the target nodes in the PPI network DataFrame. Default: ‘B’.

Weight str:

Column name for the edge weights in the PPI network DataFrame. Default: ‘combined_score’.

Verbose bool:

Whether to print progress information. Default: False.

Returns:

DataFrame containing the node embeddings.

Return type:

pd.DataFrame

Example:

df_embedding = PPI_embed(ppi_data, method='Node2Vec', dimensions=128, epochs=5, verbose=True)

HELPpy.preprocess.imputer module

HELPpy.preprocess.imputer.imputer_knn(df: DataFrame, n_neighbors: int = 5, missing_values=nan, weights: str = 'uniform') DataFrame[source]

Impute missing values in a DataFrame using K-Nearest Neighbors (KNN) imputation.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with missing values.

  • n_neighbors (int) – Number of neighbors to consider for imputation using KNN. Optional, default is 5.

  • missing_values (any) – The placeholder for missing values in the input DataFrame. Optional, default is np.nan.

  • weights (str) – Weight function used in prediction during KNN imputation. Optional, default is “uniform”.

Returns:

DataFrame with missing values imputed using KNN.

Return type:

pd.DataFrame

Example:

from sklearn.impute import KNNImputer
import numpy as np
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Impute missing values using KNN imputation
result = imputer_knn(df, n_neighbors=3, missing_values=np.nan, weights="distance")
HELPpy.preprocess.imputer.imputer_knn_group(df: DataFrame, df_map: DataFrame, line_group: str = 'OncotreeLineage', line_col: str = 'ModelID', n_neighbors: int = 5, missing_values=nan, weights: str = 'uniform', verbose: bool = False) DataFrame[source]

Impute missing values in a DataFrame grouped by specified lineages using K-Nearest Neighbors (KNN) imputation.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with missing values.

  • df_map (pd.DataFrame) – DataFrame mapping cell lines to lineages.

  • line_group (str) – Column specifying the grouping of cell lines (lineages). Default is ‘OncotreeLineage’.

  • line_col (str) – Column containing cell line identifiers. Default is ‘ModelID’.

  • n_neighbors (int) – Number of neighbors to consider for imputation using KNN. Default is 5.

  • missing_values (any) – The placeholder for missing values in the input DataFrame. Default is np.nan.

  • weights (str) – Weight function used in prediction during KNN imputation. Default is “uniform”.

  • verbose (bool) – If True, print progress information. Default is False.

Returns:

DataFrame with missing values imputed using KNN grouped by lineages.

Return type:

pd.DataFrame

Example:

from sklearn.impute import KNNImputer
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Create a mapping DataFrame
data_map = {'ModelID': ['M1', 'M2', 'M3', 'M4'], 'OncotreeLineage': ['L1', 'L2', 'L1', 'L2']}
df_map = pd.DataFrame(data_map)

# Impute missing values using KNN imputation grouped by lineages
result = imputer_knn_group(df, df_map, line_group='OncotreeLineage', line_col='ModelID', n_neighbors=3, missing_values=np.nan, weights="distance", verbose=True)
HELPpy.preprocess.imputer.imputer_mean(df: DataFrame, missing_values=nan, strategy: str = 'mean') DataFrame[source]

Impute missing values in a DataFrame using mean imputation.

Parameters:
  • df (pd.DataFrame) – Input DataFrame with missing values.

  • missing_values (any) – The placeholder for missing values in the input DataFrame. Optional, default is np.nan.

  • strategy (str) – Imputation strategy, e.g., ‘mean’, ‘median’, ‘most_frequent’. Optional, default is “mean”.

Returns:

DataFrame with missing values imputed using mean imputation.

Return type:

pd.DataFrame

Example:

from sklearn.impute import SimpleImputer
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Impute missing values using mean imputation
result = imputer_mean(df, missing_values=np.nan, strategy='mean')

HELPpy.preprocess.loaders module

HELPpy.preprocess.loaders.feature_assemble(label_file: str, features: List[Dict[str, str | bool]] = [{'fixna': False, 'fname': 'BIO.csv', 'nchunks': 1, 'normalize': 'std'}], colname: str = 'label', subsample: bool = False, seed: int = 1, fold: int = 4, saveflag: bool = False, verbose: bool = False) Tuple[DataFrame, DataFrame][source]

Assemble features and labels for machine learning tasks.

Parameters:
  • label_file (str) – Path to the label file.

  • features (List[Dict[str, Union[str, bool]]]) – List of dictionaries specifying feature files and their processing options. Default is [{‘fname’: ‘BIO.csv’, ‘fixna’ : False, ‘normalize’: ‘std’, ‘nchunks’: 1}]. ‘fname’ : str, filename of attributes (in CSV format) ‘fixna’ : bool, flag to enable fixing missing values with mean in column ‘normalize’: std|max|None, normalization option (z-score, minmax, or no normalization) ‘nchunks’: int, number of chunck the attribute file is split

  • colname (str) – Name of the column in the label file to be used as the target variable. Default is “label”.

  • subsample (bool) – Whether to subsample the data. Default is False.

  • seed (int) – Random seed for reproducibility. Default is 1.

  • fold (int) – Number of folds for subsampling. Default is 4.

  • saveflag (bool) – Whether to save the assembled data to files. Default is False.

  • verbose (bool) – Whether to print verbose messages during processing. Default is False.

Returns:

Tuple containing the assembled features (X) and labels (Y) DataFrames.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example:

label_file = "path/to/label_file.csv"
features = [{'fname': 'path/to/feature_file.csv', 'fixna': True, 'normalize': 'std'}]
colname = "target_column"
subsample = False
seed = 1
fold = 4
saveflag = False
verbose = False

X, Y = feature_assemble(label_file, features, colname, subsample, seed, fold, saveflag, verbose)
HELPpy.preprocess.loaders.feature_assemble_df(lab_df: DataFrame, features: List[Dict[str, str | bool]] = [{'fixna': True, 'fname': 'bio+gtex.csv', 'nchunks': 1, 'normalize': 'std'}], colname: str = 'label', subsample: bool = False, seed: int = 1, fold: int = 4, saveflag: bool = False, verbose: bool = False) Tuple[DataFrame, DataFrame][source]

Assemble features and labels for machine learning tasks.

Parameters:
  • lab_df (pd.DataFrame) – DataFrame of labels (in column named colnname).

  • features (List[Dict[str, Union[str, bool]]]) – List of dictionaries specifying feature files and their processing options. Default is [{‘fname’: ‘BIO.csv’, ‘fixna’ : False, ‘normalize’: ‘std’, ‘nchunks’: 1}]. ‘fname’ : str, filename of attributes (in CSV format) ‘fixna’ : bool, flag to enable fixing missing values with mean in column ‘normalize’: std|max|None, normalization option (z-score, minmax, or no normalization) ‘nchunks’: int, number of chunck the attribute file is split

  • colname (str) – Name of the column in the label file to be used as the target variable. Default is “label”.

  • subsample (bool) – Whether to subsample the data. Default is False.

  • seed (int) – Random seed for reproducibility. Default is 1.

  • fold (int) – Number of folds for subsampling. Default is 4.

  • saveflag (bool) – Whether to save the assembled data to files. Default is False.

  • verbose (bool) – Whether to print verbose messages during processing. Default is False.

Returns:

Tuple containing the assembled features (X) and labels (Y) DataFrames.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example:

label_file = "path/to/label_file.csv"
features = [{'fname': 'path/to/feature_file.csv', 'fixna': True, 'normalize': 'std'}]
colname = "target_column"
subsample = False
seed = 1
fold = 4
saveflag = False
verbose = False

df_label = pd.read_csv("label_file.csv2, index_col=0)
X, Y = feature_assemble_df(df_label, colname='label', features, colname, subsample, seed, fold, saveflag, verbose)
HELPpy.preprocess.loaders.scale_to_essentials(ge_fit, ess_genes, noness_genes)[source]

Scales gene expression data to essential and non-essential genes.

Parameters:
  • ge_fit (pd.DataFrame) – DataFrame containing the gene expression data.

  • ess_genes (list) – List of essential genes for scaling.

  • noness_genes (list) – List of non-essential genes for scaling.

Returns:

Scaled gene expression data.

Return type:

pd.DataFrame

Example:

scaled_data = scale_to_essentials(gene_expression_data, ess_genes_list, noness_genes_list)

Module contents