Strings

The strings module contains methods that help manipulate and process strings in a DataFrame. These functions are particularly useful for tasks involving text analysis, feature engineering, and data preprocessing in data science and machine learning workflows.

Append Tags to Frame

The append_tags_to_frame function is designed to extract tags from a specified field in a DataFrame and create new binary columns for each unique tag. This is particularly useful for converting comma-separated tag lists into a one-hot encoded format, which is often required for machine learning models.

ds_utils.strings.append_tags_to_frame(X_train: ~pandas.DataFrame, X_test: ~pandas.DataFrame, field_name: str, prefix: str = '', max_features: int | None = 500, min_df: int | float = 1, lowercase: bool = False, sparse: bool = False, tokenizer: ~typing.Callable[[str], ~typing.List[str]] | None = <function _tokenize>) Tuple[DataFrame, DataFrame][source]

Extract tags from a column and append them as binarized features to the dataframe.

This function processes a specified column in the train and test dataframes that contains tags. It supports columns with either string-based tags (e.g., “tag1,tag2”) or list-based tags (e.g., [“tag1”, “tag2”]). The function identifies a vocabulary of tags from the training data, filters them based on frequency, and then creates new binary columns for each tag.

Supported Input Types for the Tags Column:

  • str: Comma-separated tags. The default tokenizer splits by comma, trims whitespace, and removes non-alphanumeric characters (except “_”, “$”, “-“). Empty strings are treated as having no tags.

  • List[str]: A pre-tokenized list of tags. Empty lists are treated as having no tags.

  • NaN/None: Handled as empty.

Tokenization Rules (for string inputs):

  • The default tokenizer splits the input string by commas (“,”).

  • Whitespace around tags is automatically trimmed.

  • Duplicate tags within the same string (e.g., “tag1,tag1”) are treated as a single occurrence for that row.

  • Casing is preserved unless lowercase=True.

min_df Behavior:

  • This parameter filters out tags that are not frequent enough in the training data.

  • If int: The absolute minimum number of rows a tag must appear in to be included.

  • If float (between 0.0 and 1.0): The minimum fraction of rows a tag must appear in.

  • This filtering is applied before the final vocabulary is selected and binarized.

Column Naming Logic:

  • The prefix argument is prepended to each tag to form the new column names.

  • Example: With prefix=”tag_” and a tag “python”, the resulting column will be “tag_python”.

Column Ordering:

  • The generated tag columns are always sorted alphabetically, ensuring a deterministic and stable order that can be relied upon for feature alignment in downstream modeling.

Parameters:
  • X_train – Pandas DataFrame with the train features.

  • X_test – Pandas DataFrame with the test features.

  • field_name – The name of the column to parse for tags.

  • prefix – A string prefix for the new binarized tag columns.

  • max_features – The maximum number of tags to include, based on frequency. Default is 500.

  • min_df – The minimum document frequency for a tag to be included. Can be an int or a float. Default is 1.

  • lowercase – If True, all tags are converted to lowercase. Default is False.

  • sparse – If True, returns a DataFrame with sparse columns. Default is False.

  • tokenizer – A custom function to tokenize string inputs. Defaults to an internal tokenizer.

Returns:

A tuple containing the transformed train and test DataFrames.

Raises:

KeyError – If field_name is not in the input dataframes.

Code Example

In this example, we’ll create a simple dataset and demonstrate how to use the append_tags_to_frame function:

x_train:

article_name

article_tags

1

ds,ml,dl

2

ds,ml

x_test:

article_name

article_tags

3

ds,ml,py

Here’s how to use the function:

import pandas as pd
from ds_utils.strings import append_tags_to_frame

x_train = pd.DataFrame([
    {"article_name": "1", "article_tags": "ds,ml,dl"},
    {"article_name": "2", "article_tags": "ds,ml"}
])
x_test = pd.DataFrame([
    {"article_name": "3", "article_tags": "ds,ml,py"}
])

x_train_with_tags, x_test_with_tags = append_tags_to_frame(x_train, x_test, "article_tags", "tag_")

The output for x_train_with_tags will be:

article_name

tag_ds

tag_ml

tag_dl

1

1

1

1

2

1

1

0

And the output for x_test_with_tags will be:

article_name

tag_ds

tag_ml

tag_dl

3

1

1

0

Significant Terms

The extract_significant_terms_from_subset function is used to identify terms that are statistically overrepresented in a subset of documents compared to the full corpus. This can be particularly useful for tasks such as topic modeling, content categorization, or identifying distinctive vocabulary in specific document groups.

ds_utils.strings.extract_significant_terms_from_subset(data_frame: DataFrame, subset_data_frame: DataFrame, field_name: str, vectorizer: CountVectorizer = CountVectorizer(max_features=500)) Series[source]

Return interesting or unusual occurrences of terms in a subset.

Based on the elasticsearch significant_text aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_scripted

Parameters:
Returns:

Series of terms with scoring over the subset.

Author:

Eran Hirsch (https://github.com/eranhirs)

Code Example

This example demonstrates how to use the function to extract significant terms from a subset of documents:

import pandas as pd
from ds_utils.strings import extract_significant_terms_from_subset

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]
data_frame = pd.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame, "content")

The output for terms will be:

third

one

and

this

the

is

first

document

second

1.0

1.0

1.0

0.67

0.67

0.67

0.5

0.25

0.0

Explanation of output values:

The output is a series of terms with their corresponding significance scores. These scores represent how much more frequent a term is in the subset compared to the full corpus. A score of 1.0 indicates that the term appears exclusively in the subset, while lower scores suggest the term is present in both the subset and the full corpus, but more frequently in the subset. Scores closer to 0.0 indicate terms that are not particularly distinctive to the subset.

In this example:

  • ‘third’, ‘one’, and ‘and’ have scores of 1.0, meaning they only appear in the subset.

  • ‘this’, ‘the’, and ‘is’ have scores of 0.67, indicating they are more common in the subset but also present in the full corpus.

  • ‘document’ has a low score of 0.25, suggesting it’s common throughout the corpus and not particularly distinctive to the subset.

  • ‘second’ has a score of 0.0, meaning it doesn’t appear in the subset at all.

This function is particularly useful for identifying key terms that characterize specific subsets of your data, which can be valuable for tasks like document classification, content summarization, or exploratory data analysis.