SentenceEmbeddingTransformer
- class ds_utils.transformers.sentence_embedding.SentenceEmbeddingTransformer(*, model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', batch_size: int = 32, show_progress_bar: bool = False, normalize_embeddings: bool = False, device: str | None = None, precision: str = 'float32', truncate_dim: int | None = None, prompt_name: str | None = None, prompt: str | None = None)[source]
Wrap a
sentence-transformersmodel for use in sklearn pipelines.Loads a SentenceTransformer model lazily on first
fit()and produces a densefloat32(or quantized) embedding matrix from text inputs.The transformer accepts strings, lists of strings,
pandas.Series,pandas.DataFrame(single column), andnumpy.ndarray.NoneandNaNvalues are replaced with empty strings before encoding.Note
This transformer requires the optional
nlpextras:pip install data-science-utils[nlp]
- Parameters:
model_name – Name or path of a
sentence-transformersmodel (default:'sentence-transformers/all-MiniLM-L6-v2').batch_size – Batch size for encoding (default:
32).show_progress_bar – Whether to show a progress bar during encoding (default:
False).normalize_embeddings – Whether to L2-normalize embeddings to unit length (default:
False).device – Device for computation (
'cpu','cuda', etc.).Nonelets the library auto-detect (default:None).precision – Embedding precision —
'float32','int8','uint8','binary', or'ubinary'(default:'float32').truncate_dim – Truncate embeddings to this many dimensions. Useful for Matryoshka models (default:
None— no truncation).prompt_name – Name of a prompt registered in the model’s
promptsdictionary (default:None).prompt – Raw prompt string to prepend to every input sentence (default:
None).
- Variables:
- fit(X: ndarray | Series | DataFrame | Sequence[Any], y: Any = None) SentenceEmbeddingTransformer[source]
Load the sentence-transformer model and record embedding metadata.
The model is loaded lazily on the first call to
fit(). Subsequent calls reuse the cached model unless the transformer is re-created.- Parameters:
X – Text data — array-like of strings,
pandas.Series, single-columnpandas.DataFrame, ornumpy.ndarray.y – Ignored; present for sklearn API compatibility.
- Returns:
This estimator, fitted.
- get_feature_names_out(input_features: None | ndarray | List[str] = None) ndarray[source]
Return output feature names for this transformation.
Names follow
dim_0,dim_1, …,dim_{n-1}.- Parameters:
input_features – Names for the input column(s), or None. When provided, length must match
n_features_in_.- Returns:
numpy.ndarrayof shape(embedding_dimension_,), dtypeobject.
- set_params(**params: Any) SentenceEmbeddingTransformer[source]
Set the parameters of this estimator.
- transform(X: ndarray | Series | DataFrame | Sequence[Any]) ndarray[source]
Encode text inputs into dense embedding vectors.
- Parameters:
X – Same accepted forms as
fit().- Returns:
Embedding matrix of shape
(n_samples, embedding_dimension_). The output dtype depends on theprecisionparameter (e.g.,float32orint8). Note: Forbinaryorubinaryprecision, the output is a packeduint8array where dimensions represent packed bits rather than individual embedding dims.- Raises:
sklearn.exceptions.NotFittedError – If
fit()has not been called.
Prerequisites
This transformer requires the optional nlp dependency group:
pip install data-science-utils[nlp]
Code Examples
The following examples show how to use SentenceEmbeddingTransformer.
Direct usage:
from ds_utils.transformers.sentence_embedding import SentenceEmbeddingTransformer
texts = ["The quick brown fox", "jumps over the lazy dog", "Hello world"]
embed = SentenceEmbeddingTransformer()
embeddings = embed.fit_transform(texts)
names = embed.get_feature_names_out()
embeddings is a numpy array of shape (n_samples, embedding_dimension) (e.g.
(3, 384) for the default sentence-transformers/all-MiniLM-L6-v2 model). Feature
names from get_feature_names_out() follow the pattern dim_0, dim_1, …,
dim_{n-1}. The output will be:
dim_0 |
dim_1 |
dim_2 |
… |
dim_383 |
|---|---|---|---|---|
-0.0123 |
0.0456 |
0.0789 |
… |
0.0012 |
0.0345 |
-0.0678 |
0.0901 |
… |
-0.0234 |
0.0567 |
0.0890 |
-0.0123 |
… |
0.0456 |
Pipeline usage with a classifier:
from ds_utils.transformers.sentence_embedding import SentenceEmbeddingTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('embeddings', SentenceEmbeddingTransformer(
normalize_embeddings=True,
)),
('classifier', RandomForestClassifier()),
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Pipeline usage with pandas output:
from ds_utils.transformers.sentence_embedding import SentenceEmbeddingTransformer
from sklearn.pipeline import Pipeline
pipe = Pipeline([("embed", SentenceEmbeddingTransformer())])
pipe.set_output(transform="pandas")
df = pipe.fit_transform(["hello", "world"])
df will be a pandas.DataFrame with columns dim_0, dim_1, etc.
ColumnTransformer usage:
from ds_utils.transformers.sentence_embedding import SentenceEmbeddingTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
import pandas as pd
df = pd.DataFrame({
"description": ["a product", "another item"],
"price": [9.99, 19.99],
})
pre = ColumnTransformer([
("text", SentenceEmbeddingTransformer(), ["description"]),
("num", StandardScaler(), ["price"]),
])
X_out = pre.fit_transform(df)
Advanced: using prompts and quantized embeddings:
from ds_utils.transformers.sentence_embedding import SentenceEmbeddingTransformer
# Use a prompt for asymmetric search
embed = SentenceEmbeddingTransformer(
prompt="search_query: ",
precision="int8",
truncate_dim=128,
)
embeddings = embed.fit_transform(["How do I reset my password?"])