The module of strings contains methods that help manipulate and process strings in a dataframe.

Append Tags to Frame

strings.append_tags_to_frame(X_train: pandas.core.frame.DataFrame, X_test: pandas.core.frame.DataFrame, field_name: str, prefix: Optional[str] = '', max_features: Optional[int] = 500, min_df: Union[int, float] = 1, lowercase=False, tokenizer: Optional[Callable[[str], List[str]]] = <function _tokenize>) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]

Extracts tags from a given field and append them as dataframe.

  • X_train – Pandas’ dataframe with the train features.
  • X_test – Pandas’ dataframe with the test features.
  • field_name – the feature to parse.
  • prefix – the given prefix for new tag feature.
  • max_features – int or None, default=500. max tags names to consider.
  • min_df – float in range [0.0, 1.0] or int, default=1. When building the tag name set ignore tags that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.
  • lowercase – boolean, default=False. Convert all characters to lowercase before tokenizing the tag names.
  • tokenizer – callable or None. Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Default splits by “,”, and retain alphanumeric characters with special characters “_”, “$” and “-“.

the train and test with tags appended.

Code Example

In this example we’ll create our own simple dataset that looks like that:


article_name article_tags
1 ds,ml,dl
2 ds,ml


article_name article_tags
3 ds,ml,py

and parse it:

import pandas

from ds_utils.strings import append_tags_to_frame

x_train = pandas.DataFrame([{"article_name": "1", "article_tags": "ds,ml,dl"},
                            {"article_name": "2", "article_tags": "ds,ml"}])
x_test = pandas.DataFrame([{"article_name": "3", "article_tags": "ds,ml,py"}])

x_train_with_tags, x_test_with_tags = append_tags_to_frame(x_train, x_test, "article_tags", "tag_")

And the following table will be the output for x_train_with_tags:

article_name tag_ds tag_ml tag_dl
1 1 1 1
2 1 1 0

And the following table will be the output for x_test_with_tags:

article_name tag_ds tag_ml tag_dl
3 1 1 0

Significant Terms

strings.extract_significant_terms_from_subset(data_frame: pandas.core.frame.DataFrame, subset_data_frame: pandas.core.frame.DataFrame, field_name: str, vectorizer: sklearn.feature_extraction.text.CountVectorizer = CountVectorizer(encoding='latin1', max_features=500)) → pandas.core.series.Series[source]

Returns interesting or unusual occurrences of terms in a subset.

Based on the elasticsearch significant_text aggregation

  • data_frame – the full data set.
  • subset_data_frame – the subset partition data, with over it the scoring will be calculated. Can a filter by feature or other boolean criteria.
  • field_name – the feature to parse.
  • vectorizer – text count vectorizer which converts collection of text to a matrix of token counts. See more info here .

Series of terms with scoring over the subset.


Eran Hirsch

Code Example

This method will help extract the significant terms that will differentiate between subset of documents from the full corpus. Let’s create a simple corpus and extract significant terms from it:

import pandas

from ds_utils.strings import extract_significant_terms_from_subset

corpus = ['This is the first document.', 'This document is the second document.',
          'And this is the third one.', 'Is this the first document?']
data_frame = pandas.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame,

And the following table will be the output for terms:

third one and this the is first document second
1.0 1.0 1.0 0.67 0.67 0.67 0.5 0.25 0.0