Model metrics

Model metrics can help you analyze and compare predictions and annotations. They are helpful in surfacing low-confident predictions and areas of agreement / disagreement between model predictions and ground truths. This helps your machine learning team analyze model performance, conduct active learning, detect labeling mistakes and find model errors.

While some metrics are auto-generated by Labelbox, you can also upload your own custom metrics.

How to use metrics

Metrics (whether auto-generated by Labelbox or uploaded by users) can be used to analyze model predictions and model performance. Users can:

  • Analyze the distribution of metrics in the Metrics View
    • For example, users may want to analyze and compare metrics between data splits, between specific slices of data, or between classes
  • Filter and sort on metrics in the Gallery View
    • For example, you may want to surface low-confident predictions, mispredictions, labeling mistakes, etc.

You can read more about how to evaluate your model performance using metrics in this section.

Auto-generated metrics


Once you upload your model predictions to a model run, Labelbox will automatically compute the following metrics:

  • true positive
  • false positive
  • true negative
  • false negative
  • precision
  • recall
  • f1 score
  • intersection over union (IoU)

These metrics are computed on every feature and also aggregated (arithmetic mean) on the data row.


What data rows are auto-generated metrics computed on?

These auto-generated metrics are computed for all data rows that contain at least one prediction and at least one annotation.


Auto-metrics work with ontologies of 50 schemas or less

Auto-metrics are not computed if the model ontology has more than 50 schemas. The limit of 50 schemas includes nested schemas.

Confusion matrix

Labelbox automatically generates a confusion matrix. The confusion matrix is designed to help you understand the performance of your model on every class. It also allows you to inspect examples of a specific misprediction.

Diagonal cells of the confusion matrix indicate true positive predictions by the model (i.e., the predicted feature matches the ground truth feature). Conversely, non-diagonal cells of the confusion matrix correspond to false positives and false negatives (i.e., the predicted feature does not match the ground truth feature).

The confusion matrix also includes an additional feature that isn’t a part of your model run ontology: the None feature. β€˜None’ is useful in identifying predictions that were not matched to any annotation, as well as annotations that were not matched to any prediction.

The confusion matrix is interactive. If you click on any cell of the matrix, it opens the Gallery View of Model and keeps only examples corresponding to this specific cell of the confusion matrix.

Precision-recall curve

Labelbox generates a precision-recall curve. It represents the value of precision and recall, for your model for every confidence threshold.

This precision-recall curve is critical for picking the optimal confidence threshold for your model. You can pick the balance between precision and recall (between false positives and false negatives) for your specific use case.

You can display the precision-recall curve for all features, or for a specific feature. This enables you to pick the optimal confidence threshold for your use case, for every class.

Confidence thresholds and IoU thresholds


Confidence threshold

The confidence threshold is between 0 and 1. Predictions with a confidence score lower than the confidence threshold will be ignored.


IoU threshold

The IoU threshold is between 0 and 1. A True Positive is when a prediction and an annotation of the same class have an IoU that is higher than the selected IoU threshold.

Labelbox auto-generates metrics for several confidence thresholds and several IoU thresholds. This helps machine learning teams fine-tune the confidence threshold of their model and the IoU threshold for error analysis.

You can analyze model metrics for various confidence thresholds and the IoU thresholds by changing them in the user interface (see here). By modifying the thresholds, you can analyze how these thresholds impact the auto-generated metrics and the confusion matrix.

By default, Labelbox allows users to toggle between

  • 10 values of confidence thresholds: 0, 0.1, 0.2, ..., 0.9, 1
  • 10 values of IoU thresholds: 0, 0.1, 0.2, ..., 0.9, 1

You can refine these thresholds to cover any range you want. For example, it is possible to explore the range of [0.5, 0.51, 0.52, 0.53, ..., 0.59, 0.6] for confidence thresholds.

To refine the range of thresholds, open the Display panel and click on the settings icon of the confidence threshold and/or IoU threshold. From there, you can customize or delete the 10 values taken by the threshold.

Access the threshold settings

Access the threshold settings

Customize the 10 values taken by the confidence and/or IoU threshold

Customize the 10 values taken by the confidence and/or IoU threshold


Absence of confidence score

If a model prediction is uploaded to a model run without a specified confidence score, it is treated as if it had a confidence score of 1.

How are auto-generated metrics calculated?

To compute auto-generated metrics and the confusion matrix, Labelbox matches predictions to ground truths for each data row. Here are the main steps of the matching algorithm:

  1. Predictions below the selected confidence threshold are discarded
  2. Predictions and annotations are greedily matched, by decreasing IoU
  3. For each prediction/annotation pair:
    1. If the IoU is above the IoU threshold, and the prediction and annotation haven't been matched so far, then they are matched together. The pair results in a true positive (i.e., predicted class is the ground truth class) or a false positive (i.e., predicted class is not the ground truth class).
  4. Unmatched annotations result in false negatives. Unmatched predictions result in false positives.

How long do auto-generated metrics take to calculate?

Auto-generated metrics may take a few minutes to compute. Metric filters will not be available until the auto-generated metrics have finished computing.

While auto-generated metrics are computing, there will be a banner to inform you that the current metrics are out-of-sync and are waiting to be updated.

A banner indicates that auto-generated metrics are being computed for 600 data rows

A banner indicates that auto-generated metrics are being computed for 600 data rows

Auto-generated metrics failure state

If the calculation of the auto-generated metric fails, a banner will inform you in the user interface. You can click the Retry button to re-launch the metrics calculation.

Easily re-launch metrics calculation - if they failed

Supported annotation types

Auto-generated metrics are calculated for the following data types and annotation types:

Data TypeAnnotation Type
ImageClassification, bounding box, segmentation, polygon, polyline, point
GeospatialClassification, bounding box, segmentation, polygon, polyline, point
TextClassification, named entity (NER)
Video, Document, DICOM, Audio, JSON, HTML, Conversational textClassification

How do I upload custom metrics?

If the auto-generated metrics are not sufficient for your use case, you can upload custom metrics to your model run. This will help you even more precisely evaluate your model performance in Labelbox.

Scalar custom metrics

A ScalarMetric is a custom metric with a single scalar value. It can be uploaded at the following levels of granularity:
1. Data rows
2. Features
3. Nested features

from import (ScalarMetric,
# custom metric on a data row 
data_row_metric = ScalarMetric(metric_name="iou", value=0.5)

# custom metric on a feature
feature_metric = ScalarMetric(metric_name="iou", feature_name="cat", value=0.5)

# custom metric on a nested feature
subclass_metric = ScalarMetric(metric_name="iou",

Aggregation of custom metrics

This is an optional field on the ScalarMetric object, to control how custom metrics are aggergated. By default, the aggregation uses ARITHMETIC_MEAN.

Aggregations occur in the following cases:

  • When you provide a feature or nested-feature metric, Labelbox automatically aggregates the metric across features and nested-features on the data row.
    For example, say you provide a custom metric Bounding Box Width (BBW) on the features "cat" and "dog" . The data row-level metric for BBW is the average of these two values.
  • When you create slices, the custom metric is aggregated across data rows of the Slice.
  • When you filter data inside a Model Run, the custom metric is aggregated across the filtered data rows.
If the following metrics are uploaded then
in the Labelbox App, users will see:
true positives dog = 4
true positives cat = 3
true positives = 7

feature_metric = ScalarMetric(metric_name="true_positives",

feature_metric = ScalarMetric(metric_name="true_positives",