HELPpy.preprocess package

Submodules

HELPpy.preprocess.embedding module

HELPpy.preprocess.embedding.PPI_embed(df_net: DataFrame, method: str = 'Node2Vec', dimensions: int = 128, walk_number: int = 10, walk_length: int = 80, workers: int = 1, epochs: int = 1, learning_rate: float = 0.05, seed: int = 42, params: Dict = {'min_count': 1, 'p': 1.0, 'q': 1.0, 'window_size': 5}, source: str = 'A', target: str = 'B', weight: str = 'combined_score', verbose: bool = False)[source]

Embeds a protein-protein interaction (PPI) network using graph embedding techniques.

Df_net pd.DataFrame:: The input DataFrame containing the PPI network information.
Method str:: The graph embedding method. Options: ‘DeepWalk’, ‘Node2Vec’, ‘AE’. Default: ‘Node2Vec’.
Dimensions int:: The dimensionality of the embedding. Default: 128.
Walk_number int:: Number of walks per node. Default: 10.
Walk_length int:: Length of each walk. Default: 80.
Workers int:: Number of parallel workers. Default: 4.
Epochs int:: Number of training epochs. Default: 1.
Learning_rate float:: Learning rate for the embedding model. Default: 0.05.
Seed int:: Random seed for reproducibility. Default: 42.
Params Dict:: Additional parameters for the embedding method. Default: {“p”: 1.0, “q”: 1.0, “window_size”: 5, “min_count”: 1}.
Source str:: Column name for the source nodes in the PPI network DataFrame. Default: ‘A’.
Target str:: Column name for the target nodes in the PPI network DataFrame. Default: ‘B’.
Weight str:: Column name for the edge weights in the PPI network DataFrame. Default: ‘combined_score’.
Verbose bool:: Whether to print progress information. Default: False.
Returns:: DataFrame containing the node embeddings.
Return type:: pd.DataFrame
Example:

df_embedding = PPI_embed(ppi_data, method='Node2Vec', dimensions=128, epochs=5, verbose=True)

HELPpy.preprocess.imputer module

HELPpy.preprocess.imputer.imputer_knn(df: DataFrame, n_neighbors: int = 5, missing_values=nan, weights: str = 'uniform') → DataFrame[source]

Impute missing values in a DataFrame using K-Nearest Neighbors (KNN) imputation.

Parameters:

df (pd.DataFrame) – Input DataFrame with missing values.
n_neighbors (int) – Number of neighbors to consider for imputation using KNN. Optional, default is 5.
missing_values (any) – The placeholder for missing values in the input DataFrame. Optional, default is np.nan.
weights (str) – Weight function used in prediction during KNN imputation. Optional, default is “uniform”.

Returns:

DataFrame with missing values imputed using KNN.

Return type:

pd.DataFrame

Example:

from sklearn.impute import KNNImputer
import numpy as np
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Impute missing values using KNN imputation
result = imputer_knn(df, n_neighbors=3, missing_values=np.nan, weights="distance")

HELPpy.preprocess.imputer.imputer_knn_group(df: DataFrame, df_map: DataFrame, line_group: str = 'OncotreeLineage', line_col: str = 'ModelID', n_neighbors: int = 5, missing_values=nan, weights: str = 'uniform', verbose: bool = False) → DataFrame[source]

Impute missing values in a DataFrame grouped by specified lineages using K-Nearest Neighbors (KNN) imputation.

Parameters:

df (pd.DataFrame) – Input DataFrame with missing values.
df_map (pd.DataFrame) – DataFrame mapping cell lines to lineages.
line_group (str) – Column specifying the grouping of cell lines (lineages). Default is ‘OncotreeLineage’.
line_col (str) – Column containing cell line identifiers. Default is ‘ModelID’.
n_neighbors (int) – Number of neighbors to consider for imputation using KNN. Default is 5.
missing_values (any) – The placeholder for missing values in the input DataFrame. Default is np.nan.
weights (str) – Weight function used in prediction during KNN imputation. Default is “uniform”.
verbose (bool) – If True, print progress information. Default is False.

Returns:

DataFrame with missing values imputed using KNN grouped by lineages.

Return type:

pd.DataFrame

Example:

from sklearn.impute import KNNImputer
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Create a mapping DataFrame
data_map = {'ModelID': ['M1', 'M2', 'M3', 'M4'], 'OncotreeLineage': ['L1', 'L2', 'L1', 'L2']}
df_map = pd.DataFrame(data_map)

# Impute missing values using KNN imputation grouped by lineages
result = imputer_knn_group(df, df_map, line_group='OncotreeLineage', line_col='ModelID', n_neighbors=3, missing_values=np.nan, weights="distance", verbose=True)

HELPpy.preprocess.imputer.imputer_mean(df: DataFrame, missing_values=nan, strategy: str = 'mean') → DataFrame[source]

Impute missing values in a DataFrame using mean imputation.

Parameters:

df (pd.DataFrame) – Input DataFrame with missing values.
missing_values (any) – The placeholder for missing values in the input DataFrame. Optional, default is np.nan.
strategy (str) – Imputation strategy, e.g., ‘mean’, ‘median’, ‘most_frequent’. Optional, default is “mean”.

Returns:

DataFrame with missing values imputed using mean imputation.

Return type:

pd.DataFrame

Example:

from sklearn.impute import SimpleImputer
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Impute missing values using mean imputation
result = imputer_mean(df, missing_values=np.nan, strategy='mean')

HELPpy.preprocess.loaders module

HELPpy.preprocess.loaders.feature_assemble(label_file: str, features: List[Dict[str, str | bool]] = [{'fixna': False, 'fname': 'BIO.csv', 'nchunks': 1, 'normalize': 'std'}], colname: str = 'label', subsample: bool = False, seed: int = 1, fold: int = 4, saveflag: bool = False, verbose: bool = False) → Tuple[DataFrame, DataFrame][source]

Assemble features and labels for machine learning tasks.

Parameters:

label_file (str) – Path to the label file.
features (List[Dict[str, Union[str, bool]]]) – List of dictionaries specifying feature files and their processing options. Default is [{‘fname’: ‘BIO.csv’, ‘fixna’ : False, ‘normalize’: ‘std’, ‘nchunks’: 1}]. ‘fname’ : str, filename of attributes (in CSV format) ‘fixna’ : bool, flag to enable fixing missing values with mean in column ‘normalize’: std|max|None, normalization option (z-score, minmax, or no normalization) ‘nchunks’: int, number of chunck the attribute file is split
colname (str) – Name of the column in the label file to be used as the target variable. Default is “label”.
subsample (bool) – Whether to subsample the data. Default is False.
seed (int) – Random seed for reproducibility. Default is 1.
fold (int) – Number of folds for subsampling. Default is 4.
saveflag (bool) – Whether to save the assembled data to files. Default is False.
verbose (bool) – Whether to print verbose messages during processing. Default is False.

Returns:

Tuple containing the assembled features (X) and labels (Y) DataFrames.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example:

label_file = "path/to/label_file.csv"
features = [{'fname': 'path/to/feature_file.csv', 'fixna': True, 'normalize': 'std'}]
colname = "target_column"
subsample = False
seed = 1
fold = 4
saveflag = False
verbose = False

X, Y = feature_assemble(label_file, features, colname, subsample, seed, fold, saveflag, verbose)

HELPpy.preprocess.loaders.feature_assemble_df(lab_df: DataFrame, features: List[Dict[str, str | bool]] = [{'fixna': True, 'fname': 'bio+gtex.csv', 'nchunks': 1, 'normalize': 'std'}], colname: str = 'label', subsample: bool = False, seed: int = 1, fold: int = 4, saveflag: bool = False, verbose: bool = False) → Tuple[DataFrame, DataFrame][source]

Assemble features and labels for machine learning tasks.

Parameters:

lab_df (pd.DataFrame) – DataFrame of labels (in column named colnname).
features (List[Dict[str, Union[str, bool]]]) – List of dictionaries specifying feature files and their processing options. Default is [{‘fname’: ‘BIO.csv’, ‘fixna’ : False, ‘normalize’: ‘std’, ‘nchunks’: 1}]. ‘fname’ : str, filename of attributes (in CSV format) ‘fixna’ : bool, flag to enable fixing missing values with mean in column ‘normalize’: std|max|None, normalization option (z-score, minmax, or no normalization) ‘nchunks’: int, number of chunck the attribute file is split
colname (str) – Name of the column in the label file to be used as the target variable. Default is “label”.
subsample (bool) – Whether to subsample the data. Default is False.
seed (int) – Random seed for reproducibility. Default is 1.
fold (int) – Number of folds for subsampling. Default is 4.
saveflag (bool) – Whether to save the assembled data to files. Default is False.
verbose (bool) – Whether to print verbose messages during processing. Default is False.

Returns:

Tuple containing the assembled features (X) and labels (Y) DataFrames.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Example:

label_file = "path/to/label_file.csv"
features = [{'fname': 'path/to/feature_file.csv', 'fixna': True, 'normalize': 'std'}]
colname = "target_column"
subsample = False
seed = 1
fold = 4
saveflag = False
verbose = False

df_label = pd.read_csv("label_file.csv2, index_col=0)
X, Y = feature_assemble_df(df_label, colname='label', features, colname, subsample, seed, fold, saveflag, verbose)

HELPpy.preprocess.loaders.scale_to_essentials(ge_fit, ess_genes, noness_genes)[source]

Scales gene expression data to essential and non-essential genes.

Parameters:

ge_fit (pd.DataFrame) – DataFrame containing the gene expression data.
ess_genes (list) – List of essential genes for scaling.
noness_genes (list) – List of non-essential genes for scaling.

Returns:

Scaled gene expression data.

Return type:

pd.DataFrame

Example:

scaled_data = scale_to_essentials(gene_expression_data, ess_genes_list, noness_genes_list)

HELPpy.preprocess package

Submodules

HELPpy.preprocess.embedding module

HELPpy.preprocess.imputer module

HELPpy.preprocess.loaders module

Module contents