MultiLabelBinarizerTransformer
- class ds_utils.transformers.multi_label_binarizer.MultiLabelBinarizerTransformer(*, classes: None | ndarray | Sequence[Any] = None, sparse_output: bool = False)[source]
Wrap
sklearn.preprocessing.MultiLabelBinarizerfor sklearn pipelines.Learns a binary indicator matrix for multi-label data. Unlike using
MultiLabelBinarizeralone, this class implementsget_feature_names_out(feature names API, SLEP007) and returns densefloat64output for downstream steps.Pass one iterable of labels per sample. A flat list of strings is invalid: scikit-learn would treat each character as a sample. See MultiLabelBinarizer.
- Parameters:
classes – Optional fixed ordering of class labels (passed to
MultiLabelBinarizer).sparse_output – If True, the inner binarizer may use sparse storage;
transform()still returns a densefloat64ndarray.
- Variables:
mlb – Fitted
MultiLabelBinarizerinstance (set afterfit()).n_features_in – Number of input features (always
1: one multi-label column).
- fit(X: ndarray | Series | DataFrame | Sequence[Any], y: Any = None) MultiLabelBinarizerTransformer[source]
Learn label sets from training multi-label data.
- Parameters:
X – Array-like of shape
(n_samples,)or(n_samples, 1), or a wide 2D layout where each row is one sample and each column entry is a label for that sample.y – Ignored; present for sklearn API compatibility.
- Returns:
This estimator, fitted.
- get_feature_names_out(input_features: None | ndarray | List[str] = None) ndarray[source]
Return output feature names for this transformation.
Names follow
{prefix}_{sanitized_class}. Ifinput_featuresis omitted, the prefix is"label"; otherwise the prefix is the first validated input feature name.- Parameters:
input_features – Names for the input column(s), or None. When provided, length must match
n_features_in_.- Returns:
numpy.ndarrayof shape(n_classes,), dtypeobject, of output names.
Code Examples
The following examples show the three main ways to use MultiLabelBinarizerTransformer.
Direct usage:
from ds_utils.transformers.multi_label_binarizer import MultiLabelBinarizerTransformer
X = [["sci-fi", "action"], ["romance"], ["action", "comedy"]]
mlb = MultiLabelBinarizerTransformer()
X_t = mlb.fit_transform(X)
names = mlb.get_feature_names_out()
X_t is a numpy array of shape (n_samples, n_classes), dtype float64, with columns
corresponding to names (e.g. ['label_action', 'label_comedy', 'label_romance', 'label_sci-fi']).
The output will be:
label_action |
label_comedy |
label_romance |
label_sci-fi |
|---|---|---|---|
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
1.0 |
1.0 |
0.0 |
0.0 |
Pipeline usage with pandas output:
from ds_utils.transformers.multi_label_binarizer import MultiLabelBinarizerTransformer
from sklearn.pipeline import Pipeline
pipe = Pipeline([("mlb", MultiLabelBinarizerTransformer())])
pipe.set_output(transform="pandas")
df = pipe.fit_transform(X)
ColumnTransformer usage:
from ds_utils.transformers.multi_label_binarizer import MultiLabelBinarizerTransformer
from sklearn.compose import ColumnTransformer
import pandas as pd
df = pd.DataFrame({"tags": [["x", "y"], ["z"]], "num": [1.0, 2.0]})
pre = ColumnTransformer(
[("mlb", MultiLabelBinarizerTransformer(), ["tags"])],
remainder="passthrough",
)
X_out = pre.fit_transform(df)