HELPpy.models package

Submodules

HELPpy.models.labelling module

HELPpy.models.labelling.labelling(df: ~pandas.core.frame.DataFrame, columns: ~typing.List[~typing.List[str]] = [], n_classes: int = 2, verbose: bool = False, labelnames: ~typing.Dict[int, str] = {0: 'E', 1: 'NE'}, mode='flat-multi', rowname: str = 'gene', colname: str = 'label', algorithm='otsu', reducefoo: ~typing.Callable[[~typing.List[int]], int] = <built-in function max>) DataFrame[source]

Main function for HELP labelling algorithm. Genes are labelled based on a selection of columns in the CRISPR DataFrame (columns=[line1, …, lineN]). By default (columns=[]) and the labelling algorithm uses all columns of CRISPR DataFrame. If the columns=[[list1_of_lines], …, [listN_of_lines]] argument is a list of list, it represents a partition of CRISPR lines: in this case the labelling is computed in each partition adn then the mode of mode is applied to compute the final gene labels. =[[…], …, […]]

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • columns (List[List[str]]) – List of column names in DataFrame used for labelling (default: []).

  • three_class (bool) – Flag for three-class labeling (default: False).

  • verbose (bool) – Verbosity level for printing information (default: False).

  • labelnames (Dict[int, str]) – Dictionary mapping class labels to names (default: {}).

  • mode (str) – quantization modes: ‘flat-multi’ - the quantization applies to all matrix (the mode on all rows give the labels). ‘two-by-two’ - a binary quantization is done on all matrix (the mode on all rows give the binary labels), then a binary quantization is applied only on 1-label rows (the mode on these rows give the binary labels 1 and 2) (default: ‘flat-multi’).

  • rowname (str) – Name of the DataFrame index (default: ‘gene’).

  • colname (str) – Name of the label column (default: ‘label’).

  • algorithm (str) – quantization algorithm type otsu|linspace (default: ‘otsu’).

  • reducefoo (Callable[[List[int]], int]) – function used for solving ex-aequo in mode.

Returns:

Output DataFrame with labels.

Return type:

pd.DataFrame

Example:

# Example usage
from HELPpy.models.labelling import labelling
input_df = pd.DataFrame(...)
output_df = labelling(input_df, columns=[], n_classes=2, labelnames={0: 'E', 1: 'NE'}, algorithm='otsu', mode='flat-multi')
HELPpy.models.labelling.labelling_core(df: ~pandas.core.frame.DataFrame, columns: ~typing.List[str] = [], n_classes: int = 2, verbose: bool = False, labelnames: ~typing.Dict[int, str] = {0: 'E', 1: 'NE'}, mode='flat-multi', algorithm='otsu', rowname: str = 'gene', colname: str = 'label', reducefoo: ~typing.Callable[[~typing.List[int]], int] = <built-in function max>) Tuple[DataFrame, ndarray][source]

Core function for HELP labelling algorithm.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • columns (List[str]) – List of column names used for labelling computation.

  • n_classes (int) – Number of classes for ‘flat-multi’ labelling mode, In ‘two-by-two’ mode the param is ignored: two binary labelling are computed (default: 2).

  • verbose (bool) – Verbosity level for printing information (default: False).

  • labelnames (Dict[int, str]) – Dictionary mapping class labels to names (default: {0: ‘E’, 1: ‘NE’}).

  • mode (str) – quantization modes: ‘flat-multi’ - the quantization applies to all matrix (the mode on all rows give the labels). ‘two-by-two’ - a binary quantization is done on all matrix (the mode on all rows give the binary labels), then a binary quantization is applied only on 1-label rows (the mode on these rows give the binary labels 1 and 2) (default: ‘flat-multi’).

  • algorithm (str) – quantization algorithm type otsu|linspace (default: ‘otsu’).

  • rowname (str) – Name of the DataFrame index (default: ‘gene’).

  • colname (str) – Name of the label column (default: ‘label’).

  • reducefoo (Callable[[List[int]], int]) – function used for solving ex-aequo in mode.

Returns:

Output labelled DataFrame and quantized array.

Return type:

Tuple[pd.DataFrame, np.ndarray]

:example

# Example usage
from HELPpy.models.labelling import labelling_core
input_df = pd.DataFrame(...)
columns_list = ['feature1', 'feature2', 'feature3']
output_df, quantized_array = labelling_core(input_df, columns=[], labelnames={0: 'E', 1: 'NE'}, n_classes, algorithm='otsu', mode='flat-multi')
HELPpy.models.labelling.modemax_nan(a: ~numpy.ndarray, reducefoo: ~typing.Callable[[~typing.List[int]], int] = <built-in function max>) ndarray[source]

Computes the mode of an array along each row. In case of ex-aequo modes, return the value computed by reducefoo (default: max).

Parameters:
  • a (np.ndarray) – Input 2D array.

  • reducefoo (Callable[[List[int]], int]) – Reduction function (default: max).

Returns:

Mode values.

Return type:

np.ndarray

Example:

# Example usage
    input_array = np.array([[np.nan, 2, 3, 3, 5, 5],
                     [2, np.nan, 4, 4, 6, 6],
                  [4, 5, 6, 6, 8, 8]])
    mode_values = modemax_nan(input_array, min)
HELPpy.models.labelling.multi_threshold_with_nan_by_column(matrix, num_thresholds, algorithm='otsu')[source]

The matrix quantization algorithm.

Parameters:
  • matrix (np.ndarray) – Input matrix.

  • num_thresholds (int) – number of quantized levels.

  • algorithm (str) – quantization algorithm type ‘otsu|linspace|yen’ (default: ‘otsu’).

Returns:

The quantized array and the thresholds used.

Return type:

Tuple[np.ndarray, c]

Example:

# Example usage
data_matrix = np.array([[1.2, 2.1, 3.6, np.nan],
                        [4.5, 5.7, np.nan, 6.3],
                        [4.1, 8.4, 3.0, 10.2],
                        [7.7, np.nan, 9, 10],
                        [2.9, 12.5, 8.2, 1.0],
                        [1.1, 2.2, 9.0, np.nan]])

num_thresholds = 3  # Adjust the number of thresholds as needed
segmented_matrix, thres = multi_threshold_with_nan_by_column(data_matrix, num_thresholds, mode='otsu')
HELPpy.models.labelling.rows_with_all_nan(df)[source]

Computes the mode of an array along each row. In case of ex-aequo modes, return the value computed by reducefoo (default: max).

Parameters:

df (pd.DatFrame) – the input DataFrame.

Returns:

the array of indices in DataFrame with all NaNs.

Return type:

np.ndarray

HELPpy.models.prediction module

class HELPpy.models.prediction.VotingSplitClassifier(n_voters=10, voting='soft', n_jobs=-1, verbose=False, random_state=42, **kwargs)[source]

Bases: BaseEstimator, ClassifierMixin

fit(X, y)[source]
predict(X, y=None)[source]
predict_proba(X, y=None)[source]
set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') VotingSplitClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

HELPpy.models.prediction.k_fold_cv(X, Y, estimator, n_splits=10, saveflag: bool = False, outfile: str = 'predictions.csv', verbose: bool = False, display: bool = False, seed: int = 42)[source]

Perform cross-validated predictions using a classifier.

Parameters:
  • X (DataFrame) – Features DataFrame.

  • Y (DataFrame) – Target variable DataFrame.

  • n_splits (int) – Number of folds for cross-validation.

  • object (estimator) – Classifier method (must have fit, predict, predict_proba methods)

  • balanced (bool) – Whether to use class weights to balance the classes.

  • saveflag (bool) – Whether to save the predictions to a CSV file.

  • outfile (str or None) – File name for saving predictions.

  • verbose (bool) – Whether to print verbose information.

  • display (bool) – Whether to display a confusion matrix plot.

  • seed (int or None) – Random seed for reproducibility.

Returns:

Summary statistics of the cross-validated predictions, single measures and label predictions

Return type:

Tuple(pd.DataFrame,pd.DataFrame,pd.DataFrame)

Example:

# Example usage
from lightgbm import LGBMClassifier
X_data = pd.DataFrame(...)
Y_data = pd.DataFrame(...)
clf = LGBMClassifier(random_state=0)
df_scores, scores, predictions = k_fold_cv(df_X, df_y, clf, n_splits=5, verbose=True, display=True, seed=42)
HELPpy.models.prediction.predict_cv(X, Y, n_splits=10, method='LGBM', balanced=False, saveflag: bool = False, outfile: str = 'predictions.csv', verbose: bool = False, display: bool = False, seed: int = 42)[source]

Perform cross-validated predictions using a LightGBM classifier.

Parameters:
  • X (DataFrame) – Features DataFrame.

  • Y (DataFrame) – Target variable DataFrame.

  • n_splits (int) – Number of folds for cross-validation.

  • method (str) – Classifier method (default LGBM)

  • balanced (bool) – Whether to use class weights to balance the classes.

  • saveflag (bool) – Whether to save the predictions to a CSV file.

  • outfile (str or None) – File name for saving predictions.

  • verbose (bool) – Whether to print verbose information.

  • display (bool) – Whether to display a confusion matrix plot.

  • seed (int or None) – Random seed for reproducibility.

Returns:

Summary statistics of the cross-validated predictions, single measures and label predictions

Return type:

Tuple(pd.DataFrame,pd.DataFrame,pd.DataFrame)

Example:

# Example usage
X_data = pd.DataFrame(...)
Y_data = pd.DataFrame(...)
result, _, _ = predict_cv(X_data, Y_data, n_splits=5, balanced=True, saveflag=False, outfile=None, verbose=True, display=True, seed=42)
HELPpy.models.prediction.predict_cv_sv(X, Y, n_voters=1, n_splits=5, colname='label', balanced=False, seed=42, verbose=False)[source]

Function to perform cross-validation with stratified sampling using LightGBM classifier and a voting mechanism for binary classification. This function takes in features (df_X) and labels (df_y) DataFrames for a classification problem, and performs cross-validation with stratified sampling using LightGBM classifier. It then employs a voting mechanism to handle imbalanced classes for binary classification tasks. Finally, it evaluates the predictions and returns evaluation scores along with the predicted labels and probabilities.

Parameters:
  • X (DataFrame) – Features DataFrame.

  • Y (DataFrame) – Target variable DataFrame.

  • n_voters (int) – Number of voters to split the majority class.

  • n_splits (int) – Number of folds for cross-validation.

  • colname (str) – Name of the column containing the labels.

  • balanced (bool) – Whether to use class weights to balance the classes.

  • seed (int or None) – Random seed for reproducibility.

  • verbose (bool) – Whether to print verbose information.

Returns:

DataFrame scores: containing evaluation scores. DataFrame predictions: containing predicted labels and probabilities.

Return type:

Tuple(pd.DataFrame,pd.DataFrame)

Example:

# Example usage
X_data = pd.DataFrame(...)
Y_data = pd.DataFrame(...)
result, prediction = predict_cv_sv(X_data, Y_data, n_voters=10, n_splits=5, balanced=True, verbose=True, seed=42)
HELPpy.models.prediction.set_seed(seed=1)[source]

Set random and numpy random seed for reproducibility

Parameters:

seed (int) – inistalization seed

Returns:

None.

Module contents