Statistics
The statistics submodule contains methods for computing statistical measures on data.
Extract Statistics DataFrame per Label
This method calculates comprehensive statistical metrics for numerical features grouped by label values. Use this when you want to:
Analyze how a numerical feature’s distribution varies across different categories
Detect potential patterns or anomalies in feature behavior per group
Generate detailed statistical summaries for reporting or analysis
Understand the relationship between features and target variables
- ds_utils.preprocess.statistics.extract_statistics_dataframe_per_label(df: DataFrame, feature_name: str, label_name: str) DataFrame[source]
Calculate comprehensive statistical metrics for a specified feature grouped by label.
This method computes various statistical measures for a given numerical feature, broken down by unique values in the specified label column. The statistics include count, null count, mean, standard deviation, min/max values and multiple percentiles.
- Parameters:
df – Input pandas DataFrame containing the data
feature_name – Name of the column to calculate statistics on
label_name – Name of the column to group by
- Returns:
DataFrame with statistical metrics for each unique label value, with columns: - count: Number of non-null observations - null_count: Number of null values - mean: Average value - min: Minimum value - 1_percentile: 1st percentile - 5_percentile: 5th percentile - 25_percentile: 25th percentile - median: 50th percentile - 75_percentile: 75th percentile - 95_percentile: 95th percentile - 99_percentile: 99th percentile - max: Maximum value
- Raises:
Code Example
Here’s how to use the method to analyze numerical features across different categories:
import pandas as pd
from ds_utils.preprocess.statistics import extract_statistics_dataframe_per_label
# Load your dataset
df = pd.DataFrame({
'amount': [100, 200, 150, 300, 250, 175],
'category': ['A', 'A', 'B', 'B', 'C', 'C']
})
# Calculate statistics for amount grouped by category
stats = extract_statistics_dataframe_per_label(
df=df,
feature_name='amount',
label_name='category'
)
print(stats)
The output will be a DataFrame containing the following statistics for each category:
category |
count |
null_count |
mean |
min |
1_percentile |
5_percentile |
25_percentile |
median |
75_percentile |
95_percentile |
99_percentile |
max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
A |
2 |
0 |
150.0 |
100 |
100.0 |
100.0 |
100.0 |
150.0 |
200.0 |
200.0 |
200.0 |
200.0 |
B |
2 |
0 |
225.0 |
150 |
150.0 |
150.0 |
150.0 |
225.0 |
300.0 |
300.0 |
300.0 |
300.0 |
C |
2 |
0 |
212.5 |
175 |
175.0 |
175.0 |
175.0 |
212.5 |
250.0 |
250.0 |
250.0 |
250.0 |
This comprehensive set of statistics helps in understanding the distribution of numerical features across different categories, which can be valuable for:
Identifying outliers within specific groups
Understanding data skewness per category
Detecting potential data quality issues
Making informed decisions about feature engineering strategies
Compute Mutual Information
This method computes the mutual information between each feature and a specified target variable. Mutual information measures the dependency between two variables, with a higher value indicating a stronger relationship.
Use this when you want to:
Identify the most informative features for a classification task.
Perform feature selection based on the relevance of each feature to the target.
Gain insight into the underlying relationships within your data, which can guide feature engineering.
- ds_utils.preprocess.statistics.compute_mutual_information(df: DataFrame, features: List[str], label_col: str, *, n_neighbors: int = 3, random_state: int | RandomState | None = None, n_jobs: int | None = None, numerical_imputer: TransformerMixin = SimpleImputer(), discrete_imputer: TransformerMixin = SimpleImputer(strategy='most_frequent'), discrete_encoder: TransformerMixin = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)) DataFrame[source]
Compute mutual information scores between features and a target label.
This function calculates mutual information scores for specified features with respect to a target label column. Features are automatically categorized as numerical or discrete (boolean/categorical) and preprocessed accordingly before computing mutual information.
Any feature column that contains only null (NaN) values will be ignored and assigned a mutual information score of 0. A UserWarning will be issued listing any such columns.
Mutual information measures the mutual dependence between two variables - higher scores indicate stronger relationships between the feature and the target label.
- Parameters:
df – Input pandas DataFrame containing the features and label
features – List of column names to compute mutual information for
label_col – Name of the target label column
n_neighbors – Number of neighbors to use for MI estimation for continuous variables. Higher values reduce variance of the estimation, but could introduce a bias.
random_state – Random state for reproducible results. Can be int or RandomState instance
n_jobs – The number of jobs to use for computing the mutual information. The parallelization is done on the columns. None means 1 unless in a joblib.parallel_backend context.
-1means using all processors.numerical_imputer – Sklearn-compatible transformer for numerical features (default: mean imputation)
discrete_imputer – Sklearn-compatible transformer for discrete features (default: most frequent imputation)
discrete_encoder – Sklearn-compatible transformer for encoding discrete features (default: ordinal encoding with unknown value handling)
- Returns:
DataFrame with columns ‘feature_name’ and ‘mi_score’, sorted by MI score (descending)
- Raises:
KeyError – If any feature or label_col is not found in DataFrame
ValueError – If features list is empty or label_col contains non-finite values
- Warns UserWarning:
If one or more feature columns contain only null values.
Code Example
This example uses a sample DataFrame to demonstrate the calculation of mutual information.
Here’s how to use the code:
import pandas as pd
from ds_utils.preprocess.statistics import compute_mutual_information
sample_df = pd.DataFrame({
"value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
"category": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
"text_col": ["x", "y", "z", "x", "y", "z", "x", "y", "z", "x"],
})
target = "category"
mutual_information_df = compute_mutual_information(sample_df, target)
print(mutual_information_df)
The following table will be the output:
feature |
mutual_information |
|---|---|
value |
0.046 |
text_col |
0.941 |