Model metrics

How to view, use, and compare model metrics.

Developer guide: Upload custom metrics to a model run

Model metrics can help you analyze and compare predictions and annotations. They are helpful in surfacing low-confident predictions and areas of agreement/disagreement between model predictions and ground truths. This helps your machine learning team analyze model performance, conduct active learning, detect labeling mistakes, and find model errors.

While some metrics are auto-generated by Labelbox, you can also upload your own custom metrics.

How to use metrics

Metrics (whether auto-generated by Labelbox or uploaded by users) can be used to analyze model predictions and model performance. Users can:

  • Analyze the distribution of metrics in the Metrics View
    • For example, users may want to analyze and compare metrics between data splits, between specific slices of data, or between classes
  • Filter and sort on metrics in the Gallery View
    • For example, you may want to surface low-confident predictions, mispredictions, labeling mistakes, etc.

Auto-generated metrics

These metrics are computed on every feature and aggregated (arithmetic mean) on the data row. These auto-generated metrics are computed for all data rows that contain at least one prediction and at least one annotation.


Ontology schema limit

To view the maximum number of features per ontology allowed for auto metrics to work, visit our limits page.

Once you upload your model predictions to a model run, Labelbox will automatically compute the following metrics:

  • True positive
  • False positive
  • True negative
  • False negative
  • Precision
  • Recall
  • F1 score
  • Intersection over union (IoU)

Confusion matrix

Labelbox automatically generates a confusion matrix. The confusion matrix is designed to help you understand the performance of your model for every class. It also allows you to inspect examples of a specific misprediction.


Confusion matrices for classifcication models/classes

When working with a classification model, in order for the confusion matrix to generate properly, you must set the IOU threshold slider to 0.

Diagonal cells of the confusion matrix indicate true positive predictions by the model (i.e., the predicted feature matches the ground truth feature). Conversely, non-diagonal cells of the confusion matrix correspond to false positives and false negatives (i.e., the predicted feature does not match the ground truth feature).

The confusion matrix also includes an additional feature that is not a part of your model run ontology: the None feature. The None feature is useful in identifying predictions that were not matched to any annotation, as well as annotations that were not matched to any prediction.

The confusion matrix is interactive. If you click on any matrix cell, it opens the gallery view of the model run and keeps only examples corresponding to this specific cell of the confusion matrix.

Precision-recall curve

Labelbox generates a precision-recall curve. It represents the value of precision and recall for your model for every confidence threshold.

This precision-recall curve is critical for picking the optimal confidence threshold for your model. You can pick the balance between precision and recall (between false positives and false negatives) for your specific use case.

You can display the precision-recall curve for all features or a specific feature. This enables you to pick the optimal confidence threshold for your use case for every class.

Confidence thresholds and IoU thresholds


Confidence & IoU thresholds

The confidence threshold is between 0 and 1. Predictions with a confidence score lower than the confidence threshold will be ignored.

The IoU threshold is between 0 and 1. A True Positive is when a prediction and an annotation of the same class have an IoU that is higher than the selected IoU threshold.

Labelbox auto-generates metrics for several confidence thresholds and several IoU thresholds. This helps machine learning teams fine-tune the confidence threshold of their model and the IoU threshold for error analysis.

You can analyze model metrics for various confidence thresholds and the IoU thresholds by changing them in the user interface (see here). By modifying the thresholds, you can analyze how these thresholds impact the auto-generated metrics and the confusion matrix.

By default, Labelbox allows users to toggle between

  • 11 values of confidence thresholds: 0, 0.1, 0.2, 0.3, ..., 0.7, 0.8, 0.9, 1
  • 11 values of IoU thresholds: 0, 0.1, 0.2, 0.3, ..., 0.7, 0.8, 0.9, 1

You can refine these thresholds to cover any range you want. For example, it is possible to explore the range of [0.5, 0.51, 0.52, 0.53, ..., 0.59, 0.6] for confidence thresholds.

To refine the range of thresholds, open the Display panel and click on the settings icon of the confidence threshold and/or IoU threshold. From there, you can customize or delete the 10 values taken by the threshold.

Access the threshold settings

Access the threshold settings

Customize the 10 values taken by the confidence and/or IoU threshold

Customize the 10 values taken by the confidence and/or IoU threshold


Absence of confidence score

If a model prediction is uploaded to a model run without a specified confidence score, it is treated as if it had a confidence score of 1.

How are auto-generated metrics calculated?

To compute auto-generated metrics and the confusion matrix, Labelbox matches predictions to ground truths for each data row. Here are the main steps of the matching algorithm:

  1. Predictions below the selected confidence threshold are discarded
  2. Predictions and annotations are greedily matched by decreasing IoU
  3. For each prediction/annotation pair:
    1. If the IoU is above the IoU threshold, and the prediction and annotation haven't been matched so far, then they are matched together. The pair results in a true positive (i.e., predicted class is the ground truth class) or a false positive (i.e., predicted class is not the ground truth class).
  4. Unmatched annotations result in false negatives. Unmatched predictions result in false positives.

How long do auto-generated metrics take to calculate?

Auto-generated metrics may take a few minutes to compute. Metric filters will not be available until the auto-generated metrics have finished computing.

While auto-generated metrics are computing, there will be a banner to inform you that the current metrics are out-of-sync and are waiting to be updated.

A banner indicates that auto-generated metrics are being computed for 600 data rows

A banner indicates that auto-generated metrics are being computed for 600 data rows

Auto-generated metrics failure state

If the calculation of the auto-generated metric fails, a banner will inform you in the user interface. You can click the Retry button to re-launch the metrics calculation.

Easily re-launch metrics calculation - if they failed

Supported annotation types

Auto-generated metrics are calculated for the following data types and annotation types:

Data TypeAnnotation Type
ImageClassification, bounding box, segmentation, polygon, polyline, point
GeospatialClassification, bounding box, segmentation, polygon, polyline, point
TextClassification, named entity (NER)
Video, Document, DICOM, Audio, JSON, HTML, Conversational textClassification