Metrics
The module of metrics contains methods that help to calculate and/or visualize evaluation performance of an algorithm.
Plot Confusion Matrix
Code Examples
In following examples we are going to use the iris dataset from scikit-learn. so firstly let’s import it:
import numpy as np
from sklearn import datasets
IRIS = datasets.load_iris()
RANDOM_STATE = np.random.RandomState(0)
Next we’ll add a small function to add noise:
def _add_noisy_features(x, random_state):
n_samples, n_features = x.shape
return numpy.c_[x, random_state.randn(n_samples, 200 * n_features)]
Binary Classification
So We’ll use the only first two classes in the iris dataset, build a SVM classifier and evaluate it:
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import svm
from ds_utils.metrics import plot_confusion_matrix
x = IRIS.data
y = IRIS.target
# Add noisy features
x = _add_noisy_features(x, RANDOM_STATE)
# Limit to the two first classes, and split into training and test
x_train, x_test, y_train, y_test = train_test_split(x[y < 2], y[y < 2], test_size=.5,
random_state=RANDOM_STATE)
# Create a simple classifier
classifier = svm.LinearSVC(random_state=RANDOM_STATE)
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
plot_confusion_matrix(y_test, y_pred, [1, 0])
plt.show()
And the following image will be shown:
Multi-Label Classification
This time we’ll train on all the classes and plot an evaluation:
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm
from ds_utils.metrics import plot_confusion_matrix
x = IRIS.data
y = IRIS.target
# Add noisy features
x = _add_noisy_features(x, RANDOM_STATE)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.5, random_state=RANDOM_STATE)
# Create a simple classifier
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=RANDOM_STATE))
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
plot_confusion_matrix(y_test, y_pred, [0, 1, 2])
plt.show()
And the following image will be shown:
Plot Metric Growth per Labeled Instances
Code Example
In this example we’ll divide the data into train and test sets, decide on which classifiers we want to measure and plot the results:
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from ds_utils.metrics import plot_metric_growth_per_labeled_instances
x = IRIS.data
y = IRIS.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.3, random_state=0)
plot_metric_growth_per_labeled_instances(x_train, y_train, x_test, y_test,
{"DecisionTreeClassifier":
DecisionTreeClassifier(random_state=0),
"RandomForestClassifier":
RandomForestClassifier(random_state=0, n_estimators=5)})
plt.show()
And the following image will be shown:
Visualize Accuracy Grouped by Probability
This method was created due the lack of maintenance of the package EthicalML / xai.
Code Example
The example uses a small sample from of a dataset from kaggle, which a dummy bank provides loans.
Let’s see how to use the code:
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from ds_utils.metrics import visualize_accuracy_grouped_by_probability
loan_data = pandas.read_csv(path/to/dataset, encoding="latin1", nrows=11000,
parse_dates=["issue_d"])
.drop("id", axis=1)
loan_data = loan_data.drop("application_type", axis=1)
loan_data = loan_data.sort_values("issue_d")
loan_data = pandas.get_dummies(loan_data)
train = loan_data.head(int(loan_data.shape[0] * 0.7)).sample(frac=1)
.reset_index(drop=True).drop("issue_d", axis=1)
test = loan_data.tail(int(loan_data.shape[0] * 0.3)).drop("issue_d", axis=1)
selected_features = ['emp_length_int', 'home_ownership_MORTGAGE', 'home_ownership_RENT',
'income_category_Low', 'term_ 36 months', 'purpose_debt_consolidation',
'purpose_small_business', 'interest_payments_High']
classifier = RandomForestClassifier(min_samples_leaf=int(train.shape[0] * 0.01),
class_weight="balanced",
n_estimators=1000, random_state=0)
classifier.fit(train[selected_features], train["loan_condition_cat"])
probabilities = classifier.predict_proba(test[selected_features])
visualize_accuracy_grouped_by_probability(test["loan_condition_cat"], 1, probabilities[:, 1],
display_breakdown=False)
plt.show()
And the following image will be shown:
If we chose to display the breakdown:
visualize_accuracy_grouped_by_probability(test["loan_condition_cat"], 1, probabilities[:, 1],
display_breakdown=True)
plt.show()
And the following image will be shown: