Strings

The module of strings contains methods that help manipulate and process strings in a dataframe.

Append Tags to Frame

Code Example

In this example we’ll create our own simple dataset that looks like that:

x_train:

article_name	article_tags
1	ds,ml,dl
2	ds,ml

x_test:

article_name	article_tags
3	ds,ml,py

and parse it:

import pandas as pd

from ds_utils.strings import append_tags_to_frame


x_train = pd.DataFrame([{"article_name": "1", "article_tags": "ds,ml,dl"},
                            {"article_name": "2", "article_tags": "ds,ml"}])
x_test = pd.DataFrame([{"article_name": "3", "article_tags": "ds,ml,py"}])

x_train_with_tags, x_test_with_tags = append_tags_to_frame(x_train, x_test, "article_tags", "tag_")

And the following table will be the output for x_train_with_tags:

article_name	tag_ds	tag_ml	tag_dl
1	1	1	1
2	1	1	0

And the following table will be the output for x_test_with_tags:

article_name	tag_ds	tag_ml	tag_dl
3	1	1	0

Significant Terms

Code Example

This method will help extract the significant terms that will differentiate between subset of documents from the full corpus. Let’s create a simple corpus and extract significant terms from it:

import pandas as pd

from ds_utils.strings import extract_significant_terms_from_subset

corpus = ['This is the first document.', 'This document is the second document.',
          'And this is the third one.', 'Is this the first document?']
data_frame = pd.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame,
                                              "content")

And the following table will be the output for terms:

third	one	and	this	the	is	first	document	second
1.0	1.0	1.0	0.67	0.67	0.67	0.5	0.25	0.0