Strings¶

The module of strings contains methods that help manipulate and process strings in a dataframe.

Append Tags to Frame¶

strings.append_tags_to_frame(X_train: pandas.core.frame.DataFrame, X_test: pandas.core.frame.DataFrame, field_name: str, prefix: Optional[str] = '', max_features: Optional[int] = 500, min_df: Union[int, float] = 1, lowercase=False, tokenizer: Optional[Callable[[str], List[str]]] = <function _tokenize>) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶

Extracts tags from a given field and append them as dataframe.

Parameters:

X_train – Pandas’ dataframe with the train features.
X_test – Pandas’ dataframe with the test features.
field_name – the feature to parse.
prefix – the given prefix for new tag feature.
max_features – int or None, default=500. max tags names to consider.
min_df – float in range [0.0, 1.0] or int, default=1. When building the tag name set ignore tags that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.
lowercase – boolean, default=False. Convert all characters to lowercase before tokenizing the tag names.
tokenizer – callable or None. Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Default splits by “,”, and retain alphanumeric characters with special characters “_”, “$” and “-“.

Returns:

the train and test with tags appended.

Code Example¶

In this example we’ll create our own simple dataset that looks like that:

x_train:

article_name	article_tags
1	ds,ml,dl
2	ds,ml

x_test:

article_name	article_tags
3	ds,ml,py

and parse it:

import pandas

from ds_utils.strings import append_tags_to_frame


x_train = pandas.DataFrame([{"article_name": "1", "article_tags": "ds,ml,dl"},
                            {"article_name": "2", "article_tags": "ds,ml"}])
x_test = pandas.DataFrame([{"article_name": "3", "article_tags": "ds,ml,py"}])

x_train_with_tags, x_test_with_tags = append_tags_to_frame(x_train, x_test, "article_tags", "tag_")

And the following table will be the output for x_train_with_tags:

article_name	tag_ds	tag_ml	tag_dl
1	1	1	1
2	1	1	0

And the following table will be the output for x_test_with_tags:

article_name	tag_ds	tag_ml	tag_dl
3	1	1	0

Significant Terms¶

strings.extract_significant_terms_from_subset(data_frame: pandas.core.frame.DataFrame, subset_data_frame: pandas.core.frame.DataFrame, field_name: str, vectorizer: sklearn.feature_extraction.text.CountVectorizer = CountVectorizer(encoding='latin1', max_features=500)) → pandas.core.series.Series[source]¶

Returns interesting or unusual occurrences of terms in a subset.

Based on the elasticsearch significant_text aggregation

Parameters:	data_frame – the full data set. subset_data_frame – the subset partition data, with over it the scoring will be calculated. Can a filter by feature or other boolean criteria. field_name – the feature to parse. vectorizer – text count vectorizer which converts collection of text to a matrix of token counts. See more info here .
Returns:	Series of terms with scoring over the subset.
Author:	Eran Hirsch

Code Example¶

This method will help extract the significant terms that will differentiate between subset of documents from the full corpus. Let’s create a simple corpus and extract significant terms from it:

import pandas

from ds_utils.strings import extract_significant_terms_from_subset

corpus = ['This is the first document.', 'This document is the second document.',
          'And this is the third one.', 'Is this the first document?']
data_frame = pandas.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame,
                                              "content")

And the following table will be the output for terms:

third	one	and	this	the	is	first	document	second
1.0	1.0	1.0	0.67	0.67	0.67	0.5	0.25	0.0