Strings
The strings module contains methods that help manipulate and process strings in a DataFrame. These functions are particularly useful for tasks involving text analysis, feature engineering, and data preprocessing in data science and machine learning workflows.
Significant Terms
The extract_significant_terms_from_subset function is used to identify terms that are statistically overrepresented in a subset of documents compared to the full corpus. This can be particularly useful for tasks such as topic modeling, content categorization, or identifying distinctive vocabulary in specific document groups.
- ds_utils.strings.extract_significant_terms_from_subset(data_frame: DataFrame, subset_data_frame: DataFrame, field_name: str, vectorizer: CountVectorizer = CountVectorizer(max_features=500)) Series[source]
Return interesting or unusual occurrences of terms in a subset.
Based on the elasticsearch significant_text aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_scripted
- Parameters:
data_frame – The full dataset.
subset_data_frame – The subset partition data over which the scoring will be calculated. It can be filtered by feature or other boolean criteria.
field_name – The feature to parse.
vectorizer – Text count vectorizer which converts a collection of text to a matrix of token counts. See more info here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- Returns:
Series of terms with scoring over the subset.
- Author:
Eran Hirsch (https://github.com/eranhirs)
Code Example
This example demonstrates how to use the function to extract significant terms from a subset of documents:
import pandas as pd
from ds_utils.strings import extract_significant_terms_from_subset
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?'
]
data_frame = pd.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame, "content")
The output for terms will be:
third |
one |
and |
this |
the |
is |
first |
document |
second |
|---|---|---|---|---|---|---|---|---|
1.0 |
1.0 |
1.0 |
0.67 |
0.67 |
0.67 |
0.5 |
0.25 |
0.0 |
Explanation of output values:
The output is a series of terms with their corresponding significance scores. These scores represent how much more frequent a term is in the subset compared to the full corpus. A score of 1.0 indicates that the term appears exclusively in the subset, while lower scores suggest the term is present in both the subset and the full corpus, but more frequently in the subset. Scores closer to 0.0 indicate terms that are not particularly distinctive to the subset.
In this example:
‘third’, ‘one’, and ‘and’ have scores of 1.0, meaning they only appear in the subset.
‘this’, ‘the’, and ‘is’ have scores of 0.67, indicating they are more common in the subset but also present in the full corpus.
‘document’ has a low score of 0.25, suggesting it’s common throughout the corpus and not particularly distinctive to the subset.
‘second’ has a score of 0.0, meaning it doesn’t appear in the subset at all.
This function is particularly useful for identifying key terms that characterize specific subsets of your data, which can be valuable for tasks like document classification, content summarization, or exploratory data analysis.