########## Statistics ########## The statistics submodule contains methods for computing statistical measures on data. *********************** Get Correlated Features *********************** This function identifies highly correlated features in your dataset. Use this when you want to: - Detect multi-collinearity in your feature set - Simplify your model by removing redundant features - Understand the relationship between features and the target variable Insights from this analysis can help in feature selection, reducing overfitting, and improving model interpretability. .. autofunction:: ds_utils.preprocess.statistics.get_correlated_features Code Example ============ This example uses a small sample from a dataset available on `Kaggle `_, which contains loan data from a dummy bank. Here's how to use the code:: import pandas as pd from ds_utils.preprocess.statistics import get_correlated_features loan_frame = pd.get_dummies(pd.read_csv('path/to/dataset', encoding="latin1", nrows=30)) target = "loan_condition_cat" features = loan_frame.columns.drop(["loan_condition_cat", "issue_d", "application_type"]).tolist() correlations = get_correlated_features(loan_frame.corr(), features, target) print(correlations) The following table will be the output: +----------------------+----------------------+--------------------+-------------------+-------------------+ |level_0 |level_1 |level_0_level_1_corr|level_0_target_corr|level_1_target_corr| +======================+======================+====================+===================+===================+ |income_category_Low |income_category_Medium|1.0 |0.1182165609358650 |0.11821656093586504| +----------------------+----------------------+--------------------+-------------------+-------------------+ |term\_ 36 months |term\_ 60 months |1.0 |0.1182165609358650 |0.11821656093586504| +----------------------+----------------------+--------------------+-------------------+-------------------+ |interest_payments_High|interest_payments_Low |1.0 |0.1182165609358650 |0.11821656093586504| +----------------------+----------------------+--------------------+-------------------+-------------------+ ************************************** Extract Statistics DataFrame per Label ************************************** This method calculates comprehensive statistical metrics for numerical features grouped by label values. Use this when you want to: - Analyze how a numerical feature's distribution varies across different categories - Detect potential patterns or anomalies in feature behavior per group - Generate detailed statistical summaries for reporting or analysis - Understand the relationship between features and target variables .. autofunction:: ds_utils.preprocess.statistics.extract_statistics_dataframe_per_label Code Example ============ Here's how to use the method to analyze numerical features across different categories:: import pandas as pd from ds_utils.preprocess.statistics import extract_statistics_dataframe_per_label # Load your dataset df = pd.DataFrame({ 'amount': [100, 200, 150, 300, 250, 175], 'category': ['A', 'A', 'B', 'B', 'C', 'C'] }) # Calculate statistics for amount grouped by category stats = extract_statistics_dataframe_per_label( df=df, feature_name='amount', label_name='category' ) print(stats) The output will be a DataFrame containing the following statistics for each category: +----------+-------+-----------+--------+------+-------------+-------------+--------------+--------+--------------+--------------+--------------+-------+ | category | count | null_count| mean | min | 1_percentile| 5_percentile| 25_percentile| median | 75_percentile| 95_percentile| 99_percentile| max | +==========+=======+===========+========+======+=============+=============+==============+========+==============+==============+==============+=======+ | A | 2 | 0 | 150.0 | 100 | 100.0 | 100.0 | 100.0 | 150.0 | 200.0 | 200.0 | 200.0 | 200.0 | +----------+-------+-----------+--------+------+-------------+-------------+--------------+--------+--------------+--------------+--------------+-------+ | B | 2 | 0 | 225.0 | 150 | 150.0 | 150.0 | 150.0 | 225.0 | 300.0 | 300.0 | 300.0 | 300.0 | +----------+-------+-----------+--------+------+-------------+-------------+--------------+--------+--------------+--------------+--------------+-------+ | C | 2 | 0 | 212.5 | 175 | 175.0 | 175.0 | 175.0 | 212.5 | 250.0 | 250.0 | 250.0 | 250.0 | +----------+-------+-----------+--------+------+-------------+-------------+--------------+--------+--------------+--------------+--------------+-------+ This comprehensive set of statistics helps in understanding the distribution of numerical features across different categories, which can be valuable for: - Identifying outliers within specific groups - Understanding data skewness per category - Detecting potential data quality issues - Making informed decisions about feature engineering strategies ************************** Compute Mutual Information ************************** This method computes the mutual information between each feature and a specified target variable. Mutual information measures the dependency between two variables, with a higher value indicating a stronger relationship. Use this when you want to: * Identify the most informative features for a classification task. * Perform feature selection based on the relevance of each feature to the target. * Gain insight into the underlying relationships within your data, which can guide feature engineering. .. autofunction:: ds_utils.preprocess.statistics.compute_mutual_information Code Example ============ This example uses a sample DataFrame to demonstrate the calculation of mutual information. Here's how to use the code:: import pandas as pd from ds_utils.preprocess.statistics import compute_mutual_information sample_df = pd.DataFrame({ "value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0], "category": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "C"], "text_col": ["x", "y", "z", "x", "y", "z", "x", "y", "z", "x"], }) target = "category" mutual_information_df = compute_mutual_information(sample_df, target) print(mutual_information_df) The following table will be the output: +----------+--------------------+ | feature | mutual_information | +==========+====================+ | value | 0.046 | +----------+--------------------+ | text_col | 0.941 | +----------+--------------------+