MultiLabelBinarizerTransformer

class ds_utils.transformers.multi_label_binarizer.MultiLabelBinarizerTransformer(*, classes: None | ndarray | Sequence[Any] = None, sparse_output: bool = False)[source]

Wrap sklearn.preprocessing.MultiLabelBinarizer for sklearn pipelines.

Learns a binary indicator matrix for multi-label data. Unlike using MultiLabelBinarizer alone, this class implements get_feature_names_out (feature names API, SLEP007) and returns dense float64 output for downstream steps.

Pass one iterable of labels per sample. A flat list of strings is invalid: scikit-learn would treat each character as a sample. See MultiLabelBinarizer.

Parameters:
  • classes – Optional fixed ordering of class labels (passed to MultiLabelBinarizer).

  • sparse_output – If True, the inner binarizer may use sparse storage; transform() still returns a dense float64 ndarray.

Variables:
  • mlb – Fitted MultiLabelBinarizer instance (set after fit()).

  • n_features_in – Number of input features (always 1: one multi-label column).

fit(X: ndarray | Series | DataFrame | Sequence[Any], y: Any = None) MultiLabelBinarizerTransformer[source]

Learn label sets from training multi-label data.

Parameters:
  • X – Array-like of shape (n_samples,) or (n_samples, 1), or a wide 2D layout where each row is one sample and each column entry is a label for that sample.

  • y – Ignored; present for sklearn API compatibility.

Returns:

This estimator, fitted.

get_feature_names_out(input_features: None | ndarray | List[str] = None) ndarray[source]

Return output feature names for this transformation.

Names follow {prefix}_{sanitized_class}. If input_features is omitted, the prefix is "label"; otherwise the prefix is the first validated input feature name.

Parameters:

input_features – Names for the input column(s), or None. When provided, length must match n_features_in_.

Returns:

numpy.ndarray of shape (n_classes,), dtype object, of output names.

transform(X: ndarray | Series | DataFrame | Sequence[Any]) ndarray[source]

Binarize multi-label data using the vocabulary learned in fit().

Parameters:

X – Same layout as for fit().

Returns:

Binary indicator matrix of shape (n_samples, n_classes), dtype float64.

Code Examples

The following examples show the three main ways to use MultiLabelBinarizerTransformer.

Direct usage:

from ds_utils.transformers.multi_label_binarizer import MultiLabelBinarizerTransformer

X = [["sci-fi", "action"], ["romance"], ["action", "comedy"]]
mlb = MultiLabelBinarizerTransformer()
X_t = mlb.fit_transform(X)
names = mlb.get_feature_names_out()

X_t is a numpy array of shape (n_samples, n_classes), dtype float64, with columns corresponding to names (e.g. ['label_action', 'label_comedy', 'label_romance', 'label_sci-fi']). The output will be:

label_action

label_comedy

label_romance

label_sci-fi

1.0

0.0

0.0

1.0

0.0

0.0

1.0

0.0

1.0

1.0

0.0

0.0

Pipeline usage with pandas output:

from ds_utils.transformers.multi_label_binarizer import MultiLabelBinarizerTransformer
from sklearn.pipeline import Pipeline

pipe = Pipeline([("mlb", MultiLabelBinarizerTransformer())])
pipe.set_output(transform="pandas")
df = pipe.fit_transform(X)

ColumnTransformer usage:

from ds_utils.transformers.multi_label_binarizer import MultiLabelBinarizerTransformer
from sklearn.compose import ColumnTransformer
import pandas as pd

df = pd.DataFrame({"tags": [["x", "y"], ["z"]], "num": [1.0, 2.0]})
pre = ColumnTransformer(
    [("mlb", MultiLabelBinarizerTransformer(), ["tags"])],
    remainder="passthrough",
)
X_out = pre.fit_transform(df)