Strings

The module of strings contains methods that help manipulate and process strings in a dataframe.

Append Tags to Frame

strings.append_tags_to_frame(X_train: pandas.core.frame.DataFrame, X_test: pandas.core.frame.DataFrame, field_name: str, prefix: Optional[str] = '', max_features: Optional[int] = 500, min_df: Union[int, float] = 1, lowercase=False, tokenizer: Optional[Callable[[str], List[str]]] = <function _tokenize>) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]

Extracts tags from a given field and append them as dataframe.

Parameters:
  • X_train – Pandas’ dataframe with the train features.
  • X_test – Pandas’ dataframe with the test features.
  • field_name – the feature to parse.
  • prefix – the given prefix for new tag feature.
  • max_features – int or None, default=500. max tags names to consider.
  • min_df – float in range [0.0, 1.0] or int, default=1. When building the tag name set ignore tags that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.
  • lowercase – boolean, default=False. Convert all characters to lowercase before tokenizing the tag names.
  • tokenizer – callable or None. Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Default splits by “,”, and retain alphanumeric characters with special characters “_”, “$” and “-“.
Returns:

the train and test with tags appended.

Code Example

In this example we’ll create our own simple dataset that looks like that:

x_train:

article_name article_tags
1 ds,ml,dl
2 ds,ml

x_test:

article_name article_tags
3 ds,ml,py

and parse it:

import pandas

from ds_utils.strings import append_tags_to_frame


x_train = pandas.DataFrame([{"article_name": "1", "article_tags": "ds,ml,dl"},
                            {"article_name": "2", "article_tags": "ds,ml"}])
x_test = pandas.DataFrame([{"article_name": "3", "article_tags": "ds,ml,py"}])

x_train_with_tags, x_test_with_tags = append_tags_to_frame(x_train, x_test, "article_tags", "tag_")

And the following table will be the output for x_train_with_tags:

article_name tag_ds tag_ml tag_dl
1 1 1 1
2 1 1 0

And the following table will be the output for x_test_with_tags:

article_name tag_ds tag_ml tag_dl
3 1 1 0

Significant Terms

strings.extract_significant_terms_from_subset(data_frame: pandas.core.frame.DataFrame, subset_data_frame: pandas.core.frame.DataFrame, field_name: str, vectorizer: sklearn.feature_extraction.text.CountVectorizer = CountVectorizer(encoding='latin1', max_features=500)) → pandas.core.series.Series[source]

Returns interesting or unusual occurrences of terms in a subset.

Based on the elasticsearch significant_text aggregation

Parameters:
  • data_frame – the full data set.
  • subset_data_frame – the subset partition data, with over it the scoring will be calculated. Can a filter by feature or other boolean criteria.
  • field_name – the feature to parse.
  • vectorizer – text count vectorizer which converts collection of text to a matrix of token counts. See more info here .
Returns:

Series of terms with scoring over the subset.

Author:

Eran Hirsch

Code Example

This method will help extract the significant terms that will differentiate between subset of documents from the full corpus. Let’s create a simple corpus and extract significant terms from it:

import pandas

from ds_utils.strings import extract_significant_terms_from_subset

corpus = ['This is the first document.', 'This document is the second document.',
          'And this is the third one.', 'Is this the first document?']
data_frame = pandas.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame,
                                              "content")

And the following table will be the output for terms:

third one and this the is first document second
1.0 1.0 1.0 0.67 0.67 0.67 0.5 0.25 0.0