Unsupervised

The module of unsupervised contains methods that calculate and/or visualize evaluation performance of an unsupervised model.

Mostly inspired by the Interpet Results of Cluster in Google’s Machine Learning Crash Course. See more information here

Plot Cluster Cardinality

unsupervised.plot_cluster_cardinality(labels: numpy.ndarray, *, ax: Optional[matplotlib.axes._axes.Axes] = None, **kwargs) → matplotlib.axes._axes.Axes[source]

Cluster cardinality is the number of examples per cluster. This method plots the number of points per cluster as a bar chart.

Parameters:
  • labels – Labels of each point.
  • ax – Axes object to draw the plot onto, otherwise uses the current Axes.
  • kwargs

    other keyword arguments

    All other keyword arguments are passed to matplotlib.axes.Axes.pcolormesh().

Returns:

Returns the Axes object with the plot drawn onto it.

In following examples we are going to use the iris dataset from scikit-learn. so firstly let’s import it:

from sklearn import datasets


iris = datasets.load_iris()
x = iris.data

We’ll create a simple K-Means algorithm with k=8 and plot how many point goes to each cluster:

from matplotlib import pyplot
from sklearn.cluster import KMeans

from ds_utils.unsupervised import plot_cluster_cardinality


estimator = KMeans(n_clusters=8, random_state=42)
estimator.fit(x)

plot_cluster_cardinality(estimator.labels_)

pyplot.show()

And the following image will be shown:

Cluster Cardinality

Plot Cluster Magnitude

unsupervised.plot_cluster_magnitude(X: numpy.ndarray, labels: numpy.ndarray, cluster_centers: numpy.ndarray, distance_function: Callable[[numpy.ndarray, numpy.ndarray], float], *, ax: Optional[matplotlib.axes._axes.Axes] = None, **kwargs) → matplotlib.axes._axes.Axes[source]

Cluster magnitude is the sum of distances from all examples to the centroid of the cluster. This method plots the Total Point-to-Centroid Distance per cluster as a bar chart.

Parameters:
  • X – Training instances.
  • labels – Labels of each point.
  • cluster_centers – Coordinates of cluster centers.
  • distance_function – The function used to calculate the distance between an instance to its cluster center. The function receives two ndarrays, one the instance and the second is the center and return a float number representing the distance between them.
  • ax – Axes object to draw the plot onto, otherwise uses the current Axes.
  • kwargs

    other keyword arguments

    All other keyword arguments are passed to matplotlib.axes.Axes.pcolormesh().

Returns:

Returns the Axes object with the plot drawn onto it.

Again we’ll create a simple K-Means algorithm with k=8. This time we’ll plot the sum of distances from points to their centroid:

from matplotlib import pyplot
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean

from ds_utils.unsupervised import plot_cluster_magnitude



estimator = KMeans(n_clusters=8, random_state=42)
estimator.fit(x)

plot_cluster_magnitude(x, estimator.labels_, estimator.cluster_centers_, euclidean)

pyplot.show()

And the following image will be shown:

Plot Cluster Magnitude

Magnitude vs. Cardinality

unsupervised.plot_magnitude_vs_cardinality(X: numpy.ndarray, labels: numpy.ndarray, cluster_centers: numpy.ndarray, distance_function: Callable[[numpy.ndarray, numpy.ndarray], float], *, ax: Optional[matplotlib.axes._axes.Axes] = None, **kwargs) → matplotlib.axes._axes.Axes[source]

Higher cluster cardinality tends to result in a higher cluster magnitude, which intuitively makes sense. Clusters are anomalous when cardinality doesn’t correlate with magnitude relative to the other clusters. Find anomalous clusters by plotting magnitude against cardinality as a scatter plot.

Parameters:
  • X – Training instances.
  • labels – Labels of each point.
  • cluster_centers – Coordinates of cluster centers.
  • distance_function – The function used to calculate the distance between an instance to its cluster center. The function receives two ndarrays, one the instance and the second is the center and return a float number representing the distance between them.
  • ax – Axes object to draw the plot onto, otherwise uses the current Axes.
  • kwargs

    other keyword arguments

    All other keyword arguments are passed to matplotlib.axes.Axes.pcolormesh().

Returns:

Returns the Axes object with the plot drawn onto it.

Now let’s plot the Cardinality vs. the Magnitude:

from matplotlib import pyplot
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean

from ds_utils.unsupervised import plot_magnitude_vs_cardinality



estimator = KMeans(n_clusters=8, random_state=42)
estimator.fit(x)

plot_magnitude_vs_cardinality(x, estimator.labels_, estimator.cluster_centers_, euclidean)

pyplot.show()

And the following image will be shown:

Magnitude vs. Cardinality

Optimum Number of Clusters

unsupervised.plot_loss_vs_cluster_number(X: numpy.ndarray, k_min: int, k_max: int, distance_function: Callable[[numpy.ndarray, numpy.ndarray], float], *, algorithm_parameters: Dict[str, Any] = None, ax: Optional[matplotlib.axes._axes.Axes] = None, **kwargs) → matplotlib.axes._axes.Axes[source]

k-means requires you to decide the number of clusters k beforehand. This method runs the KMean algorithm and increases the cluster number at each try. The Total magnitude or sum of distance is used as loss.

Right now the method only works with sklearn.cluster.KMeans.

Parameters:
  • X – Training instances.
  • k_min – The minimum cluster number.
  • k_max – The maximum cluster number.
  • distance_function – The function used to calculate the distance between an instance to its cluster center. The function receives two ndarrays, one the instance and the second is the center and return a float number representing the distance between them.
  • algorithm_parameters – parameters to use for the algorithm. If None, deafult parameters of KMeans will be used.
  • ax – Axes object to draw the plot onto, otherwise uses the current Axes.
  • kwargs

    other keyword arguments

    All other keyword arguments are passed to matplotlib.axes.Axes.pcolormesh().

Returns:

Returns the Axes object with the plot drawn onto it.

Final plot we ca use is Loss vs Cluster Number:

from matplotlib import pyplot
from scipy.spatial.distance import euclidean

from ds_utils.unsupervised import plot_loss_vs_cluster_number



plot_loss_vs_cluster_number(x, 3, 20, euclidean)

pyplot.show()

And the following image will be shown:

Optimum Number of Clusters