Performance Metrics for Classification in Machine learning

Using the right performance metric for the right task

Model Performance evaluation is a crucial step of any machine learning workflow. Those evaluation metrics estimate if the model is ready for production or needs some parameters fine-tuning.

But how to choose the appropriate metric in your classification use case?

In this article, we will explain the performance measures that are most applicable to classification tasks.

Metrics we will cover:

  1. Confusion matrix
  2. Precision
  3. Recall
  4. AUC-ROC curve

Before we explain performance metrics, we need to explain the confusion matrix.

Confusion matrix:

It is a matrix table that summarises the performance of an ML algorithm with 4 different combinations of predicted and actual values: TP, TN, TFP, FN .

Calculation of Precision, Recall and Accuracy in the confusion matrix. |  Download Scientific Diagram

Those Terms are explained as follows:

True Positive (TP) is an outcome where the model correctly predicts the positive class.

True Negative (TN) is an outcome where the model correctly predicts the negative class.

False Positive (FP) is an outcome where the model incorrectly predicts the positive class.

False Negative (FN) is an outcome where the model incorrectly predicts the negative class.

Based on those 4 terms, we determine other performance metrics such as Accuracy, Recall, Precision, Specificity, and AUC-ROC curves.

Accuracy:

The most fundamental statistic for measuring classification is accuracy, which is defined as the ratio of correct predictions to the total number of samples in the dataset.

Accuracy Formula

or slightly simplified:

Accuracy Formula

However, in the situation of symmetric datasets (imbalanced classes), this measure can be misleading, because it does not demonstrate the predictive capacity of the minority class.

Recall / sensitivity / True Positive Rate:

The recall is intuitively the ability of the classifier to find all the positive samples.

Mathematically, recall is defined as follows: Recall=TP/(TP+FN)

Precision vs Recall - Pinata Data

The recall is used When FN is more important than FP and We should minimize FN,e.g. predicting a disease

For example: in a heart disease prediction problem, for all the patients who actually have heart disease, recall tells us how many we correctly identified as having a heart disease.

Precision:

Precision is the ratio between the True Positives and all the Positives. 

Precision vs Recall - Pinata Data

We use Precision when FP is more important than FN and we should minimize FP,e.g. detecting spam.

In the case of spam detection, precision answers the following question: What proportion of the emails predicted as spam, was actually correct?

Maximizing precision will minimize the false-positive errors, whereas maximizing recall will minimize the false-negative errors.

F beta score:

The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The beta parameter determines the weight of recall in the combined score.

It is a useful metric to use when both precision and recall (both FP and FN ) are important but slightly more attention is needed on one or the other. we choose beta depending on the importance of FP and FN.

Is F1 the appropriate criterion to use? What about F2, F3,…, F beta? | by  Barak Or | Towards Data Science

Three common values for the beta parameter are as follows:

  • F0.5-Measure (beta=0.5): More weight on precision, less weight on recall.
  • F1-Measure (beta=1.0): Balance the weight on precision and recall.
  • F2-Measure (beta=2.0): Less weight on precision, more weight on recall

AUC-ROC Curve

 AUC-ROC Curve is a performance measurement for classification problems at various threshold settings.

Let’s start with explaining ROC:

A ROC curve plots TPR vs. FPR at different classification thresholds in binary classification.

How to Use ROC Curves and Precision-Recall Curves for Classification in  Python

 ROC curve plots two parameters:

  • True Positive Rate
  • False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows: TPR=TPTP+FN

False Positive Rate (FPR) is defined as follows: FPR=FPFP+TN

Let’s explain AUC :

AUC: Area Under the ROC Curve

AUC stands for “Area under the ROC Curve” and is used as a summary of the ROC curve. AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).

 It tells how much the model is capable of distinguishing between classes.  Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1.

An excellent model has an AUC near 1 which means it has a good measure of separability. A poor model has an AUC near 0 which means it has the worst measure of separability.

AUC-ROC for Multi-Class Classification

We can extend this method to multiclass classification problems by using the One vs rest or One vs One technique.

Here is an example using the One vs rest technique, We plot ROC for each class vs the rest and then calculate AUC in each case.

Multiclass ROC

AUC-ROC score vs F beta -score:

In general, the ROC is for many different levels of thresholds and thus it has many F beta score values. The F-beta score is applicable for any particular point on the ROC curve.

You may think of it as a measure of precision and recall at a particular threshold value whereas AUC is the area under the ROC curve. For the F beta score to be high, both precision and recall should be high.

Consequently, when you have a data imbalance between positive and negative samples, you should always use the F beta score because ROC averages over all possible thresholds! [source]

Conclusion

Based on different factors such as how balanced our dataset is and the importance of FN and FP, we can use different performance metrics like accuracy, recall, precision, F beta score, and AUC-ROC curve.