Strings
The module of strings contains methods that help manipulate and process strings in a dataframe.
Significant Terms
Code Example
This method will help extract the significant terms that will differentiate between subset of documents from the full corpus. Let’s create a simple corpus and extract significant terms from it:
import pandas as pd
from ds_utils.strings import extract_significant_terms_from_subset
corpus = ['This is the first document.', 'This document is the second document.',
'And this is the third one.', 'Is this the first document?']
data_frame = pd.DataFrame(corpus, columns=["content"])
# Let's differentiate between the last two documents from the full corpus
subset_data_frame = data_frame[data_frame.index > 1]
terms = extract_significant_terms_from_subset(data_frame, subset_data_frame,
"content")
And the following table will be the output for terms
:
third |
one |
and |
this |
the |
is |
first |
document |
second |
---|---|---|---|---|---|---|---|---|
1.0 |
1.0 |
1.0 |
0.67 |
0.67 |
0.67 |
0.5 |
0.25 |
0.0 |