Statistics

The statistics submodule contains methods for computing statistical measures on data.

Get Correlated Features

This function identifies highly correlated features in your dataset. Use this when you want to:

  • Detect multi-collinearity in your feature set

  • Simplify your model by removing redundant features

  • Understand the relationship between features and the target variable

Insights from this analysis can help in feature selection, reducing overfitting, and improving model interpretability.

ds_utils.preprocess.statistics.get_correlated_features(correlation_matrix: DataFrame, features: List[str], target_feature: str, threshold: float = 0.95) DataFrame[source]

Calculate features correlated above a threshold with target correlations.

Calculate features correlated above a threshold and extract a DataFrame with correlations and correlation to the target feature.

Parameters:
  • correlation_matrix – The correlation matrix.

  • features – List of feature names to analyze.

  • target_feature – Name of the target feature.

  • threshold – Correlation threshold (default 0.95).

Returns:

DataFrame with correlations and correlation to the target feature.

Code Example

This example uses a small sample from a dataset available on Kaggle, which contains loan data from a dummy bank.

Here’s how to use the code:

import pandas as pd
from ds_utils.preprocess.statistics import get_correlated_features

loan_frame = pd.get_dummies(pd.read_csv('path/to/dataset', encoding="latin1", nrows=30))
target = "loan_condition_cat"
features = loan_frame.columns.drop(["loan_condition_cat", "issue_d", "application_type"]).tolist()
correlations = get_correlated_features(loan_frame.corr(), features, target)
print(correlations)

The following table will be the output:

level_0

level_1

level_0_level_1_corr

level_0_target_corr

level_1_target_corr

income_category_Low

income_category_Medium

1.0

0.1182165609358650

0.11821656093586504

term_ 36 months

term_ 60 months

1.0

0.1182165609358650

0.11821656093586504

interest_payments_High

interest_payments_Low

1.0

0.1182165609358650

0.11821656093586504

Extract Statistics DataFrame per Label

This method calculates comprehensive statistical metrics for numerical features grouped by label values. Use this when you want to:

  • Analyze how a numerical feature’s distribution varies across different categories

  • Detect potential patterns or anomalies in feature behavior per group

  • Generate detailed statistical summaries for reporting or analysis

  • Understand the relationship between features and target variables

ds_utils.preprocess.statistics.extract_statistics_dataframe_per_label(df: DataFrame, feature_name: str, label_name: str) DataFrame[source]

Calculate comprehensive statistical metrics for a specified feature grouped by label.

This method computes various statistical measures for a given numerical feature, broken down by unique values in the specified label column. The statistics include count, null count, mean, standard deviation, min/max values and multiple percentiles.

Parameters:
  • df – Input pandas DataFrame containing the data

  • feature_name – Name of the column to calculate statistics on

  • label_name – Name of the column to group by

Returns:

DataFrame with statistical metrics for each unique label value, with columns: - count: Number of non-null observations - null_count: Number of null values - mean: Average value - min: Minimum value - 1_percentile: 1st percentile - 5_percentile: 5th percentile - 25_percentile: 25th percentile - median: 50th percentile - 75_percentile: 75th percentile - 95_percentile: 95th percentile - 99_percentile: 99th percentile - max: Maximum value

Raises:
  • KeyError – If feature_name or label_name is not found in DataFrame

  • TypeError – If feature_name column is not numeric

Code Example

Here’s how to use the method to analyze numerical features across different categories:

import pandas as pd
from ds_utils.preprocess.statistics import extract_statistics_dataframe_per_label

# Load your dataset
df = pd.DataFrame({
    'amount': [100, 200, 150, 300, 250, 175],
    'category': ['A', 'A', 'B', 'B', 'C', 'C']
})

# Calculate statistics for amount grouped by category
stats = extract_statistics_dataframe_per_label(
    df=df,
    feature_name='amount',
    label_name='category'
)
print(stats)

The output will be a DataFrame containing the following statistics for each category:

category

count

null_count

mean

min

1_percentile

5_percentile

25_percentile

median

75_percentile

95_percentile

99_percentile

max

A

2

0

150.0

100

100.0

100.0

100.0

150.0

200.0

200.0

200.0

200.0

B

2

0

225.0

150

150.0

150.0

150.0

225.0

300.0

300.0

300.0

300.0

C

2

0

212.5

175

175.0

175.0

175.0

212.5

250.0

250.0

250.0

250.0

This comprehensive set of statistics helps in understanding the distribution of numerical features across different categories, which can be valuable for:

  • Identifying outliers within specific groups

  • Understanding data skewness per category

  • Detecting potential data quality issues

  • Making informed decisions about feature engineering strategies

Compute Mutual Information

This method computes the mutual information between each feature and a specified target variable. Mutual information measures the dependency between two variables, with a higher value indicating a stronger relationship.

Use this when you want to:

  • Identify the most informative features for a classification task.

  • Perform feature selection based on the relevance of each feature to the target.

  • Gain insight into the underlying relationships within your data, which can guide feature engineering.

ds_utils.preprocess.statistics.compute_mutual_information(df: DataFrame, features: List[str], label_col: str, *, n_neighbors: int = 3, random_state: int | RandomState | None = None, n_jobs: int | None = None, numerical_imputer: TransformerMixin = SimpleImputer(), discrete_imputer: TransformerMixin = SimpleImputer(strategy='most_frequent'), discrete_encoder: TransformerMixin = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)) DataFrame[source]

Compute mutual information scores between features and a target label.

This function calculates mutual information scores for specified features with respect to a target label column. Features are automatically categorized as numerical or discrete (boolean/categorical) and preprocessed accordingly before computing mutual information.

Any feature column that contains only null (NaN) values will be ignored and assigned a mutual information score of 0. A UserWarning will be issued listing any such columns.

Mutual information measures the mutual dependence between two variables - higher scores indicate stronger relationships between the feature and the target label.

Parameters:
  • df – Input pandas DataFrame containing the features and label

  • features – List of column names to compute mutual information for

  • label_col – Name of the target label column

  • n_neighbors – Number of neighbors to use for MI estimation for continuous variables. Higher values reduce variance of the estimation, but could introduce a bias.

  • random_state – Random state for reproducible results. Can be int or RandomState instance

  • n_jobs – The number of jobs to use for computing the mutual information. The parallelization is done on the columns. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • numerical_imputer – Sklearn-compatible transformer for numerical features (default: mean imputation)

  • discrete_imputer – Sklearn-compatible transformer for discrete features (default: most frequent imputation)

  • discrete_encoder – Sklearn-compatible transformer for encoding discrete features (default: ordinal encoding with unknown value handling)

Returns:

DataFrame with columns ‘feature_name’ and ‘mi_score’, sorted by MI score (descending)

Raises:
  • KeyError – If any feature or label_col is not found in DataFrame

  • ValueError – If features list is empty or label_col contains non-finite values

Warns UserWarning:

If one or more feature columns contain only null values.

Code Example

This example uses a sample DataFrame to demonstrate the calculation of mutual information.

Here’s how to use the code:

import pandas as pd
from ds_utils.preprocess.statistics import compute_mutual_information

sample_df = pd.DataFrame({
    "value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
    "category": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
    "text_col": ["x", "y", "z", "x", "y", "z", "x", "y", "z", "x"],
})
target = "category"
mutual_information_df = compute_mutual_information(sample_df, target)
print(mutual_information_df)

The following table will be the output:

feature

mutual_information

value

0.046

text_col

0.941