Metrics view
A series of metrics to help you evaluate your data and your model.
When you select a model from the Model tab, you will have three views to choose from, the gallery view, the metrics view, and the projector view.
The metrics view helps you analyze the distribution of annotations and predictions in your data, evaluate the performance of a model, and quantitatively compare two models.

Switch to the metrics view by clicking the metrics icon in the top right corner
Data analytics
The metrics view provides analytics about the distribution of annotations and predictions in the model run.
Annotations distribution
A histogram displays the distribution of annotations in a model run.
Every line in the histogram represents a feature. If the feature has sub-features, you can see the distribution of sub-features by clicking on the arrow to the left of the histogram line.
By default, Labelbox displays the distribution of annotations for the top 100 features. Users can display even more features by clicking on Load more.

The “airplane” feature is the most represented annotation in the model run
Predictions distribution
Similarly, this histogram displays the distribution of predictions in the model run.

The “airplane” feature is the most represented prediction in the model run
Data analytics on a subset of data
The annotations and predictions histograms work exactly like the gallery view. if you use filters to search data in the model run, only the filtered data rows will appear in the histograms. These histograms are designed to help you understand the distribution of annotations and predictions on a specific subset of data.
Data analytics on each data split
Machine learning teams typically want to do the following:
a) analyze the distribution of annotations and predictions on each data split
b) surface discrepancies among splits
To visualize the analytics histograms for a specific data split, click on Training, Validate, or Test. The histograms will update, in the user interface, to reflect the distribution of annotations and predictions on the selected split.
Filter data using analytics histograms
Annotations and predictions histograms are interactive. Indeed, you can simply click on any histogram bar, to visualize the corresponding data rows in the gallery view of the model run.
Here's what is happening behind the scenes:
- Labelbox opens the gallery view of the model run (you were in the metrics view so far) so that you can visualize data rows
- Labelbox adds a filter in the model run, to narrow down to the data rows associated with the histogram bar you clicked
Compare data analytics for two datasets
When comparing two model runs, you can compare their distribution of annotations and predictions.
Model metrics
The metrics view provides quantitative metrics to compare predictions and annotations. These metrics are helpful to surface areas of agreement and disagreement between model predictions and ground truths. This helps machine learning teams analyze model performance, find model errors, find labeling mistakes, and surface low-confident predictions.
Some metrics are auto-generated by Labelbox, and users can upload their own custom metrics.
Auto-generated metrics
Once users upload model predictions to a model run, Labelbox automatically computes some metrics:
- true positive
- false positive
- true negative
- false negative
- precision
- recall
- f1 score
- intersection over union (IoU).
These auto-generated metrics are computed for all data rows that contain at least one prediction and at least one annotation.
Auto-generated metrics histograms
Precision, recall, f1 score, and IoU metrics show up as histograms in the user interface. Each bar of the histogram corresponds to a class.

The model has the highest f1 score on airplanes
You can also see the distribution of these auto-generated metrics. Each bar of the histogram represents the number of data rows for which the auto-generated metric is in a specific range of values.

34 data rows have an f1-score between 0.5 and 0.6
All histograms in this view are interactive. If you click on any bar of any histogram, it will open the gallery view in the Model tab and automatically filter and sort the model run data. More precisely:
- Labelbox will filter only data rows corresponding to the bar of the histogram you clicked on
- Labelbox will sort data rows based on the metric of the histogram you clicked on
These filter and sort capabilities allow you to quickly gain insight into your model's behavior by toggling between a quantitative and qualitative view of your model run.
Auto-generated confusion matrix
Labelbox automatically generates a confusion matrix for your annotation classes. The confusion matrix is designed to help you understand the performance of your model on every class. It also allows you to inspect examples of a specific misprediction.
Every row of the confusion matrix corresponds to a ground truth feature, while every column of the confusion matrix corresponds to a predicted feature.
Diagonal cells of the confusion matrix indicate true positive predictions by the model (i.e., the predicted feature matches the ground truth feature). Conversely, non-diagonal cells of the confusion matrix correspond to false positives and false negatives (i.e., the predicted feature does not match the ground truth feature).
The confusion matrix has one more feature than your model run ontology: the None feature. None is useful to identify predictions that were not matched to any annotation, as well as annotations that were not matched to any prediction.
The confusion matrix is interactive. If you click on any cell of the matrix, it opens the gallery view in the Model tab and keeps only examples corresponding to this specific cell of the confusion matrix.

Click on any cell of the confusion matrix to inspect the corresponding data rows
Auto-generated precision-recall curve
Labelbox generates a precision-recall curve. It represents the value of precision and recall, for your model for every confidence threshold.
This precision-recall curve is crucial for picking the optimal confidence threshold for your model. You can pick the balance between precision and recall (between false positives and false negatives) for your specific use case.
You can display the precision-recall curve for all features, or for a specific feature. This enables you to pick the optimal confidence threshold for your use case, for every class.
Confidence thresholds and IoU thresholds
Confidence threshold
The confidence threshold is between 0 and 1. Predictions with a confidence score lower than the confidence threshold will be ignored.
IoU threshold
The IoU threshold is between 0 and 1. A True Positive is when a prediction and annotation of the same class have an IoU that is higher than the selected IoU threshold.
Labelbox auto-generates metrics for several confidence thresholds and several IoU thresholds. This helps machine learning teams fine-tune the confidence threshold of their model and the IoU threshold for error analysis.
You can analyze model metrics for various confidence thresholds and the IoU thresholds by changing them in the user interface. When you modify the thresholds, you'll be able to see how these thresholds impact the auto-generated metrics and the confusion matrix.
There are 2 ways to change the confidence and IoU thresholds in Model tab.
- Option #1: Go to the Model runs subtab, select the metrics view, and use the sliders to adjust the confidence threshold and the IoU threshold.

Changing the confidence threshold and/or the IoU threshold will update model metrics
- Option #2: Go to the Model runs subtab, click Display, in the Display panel change use the sliders to adjust the confidence threshold and the IoU threshold.

Open the Display panel

Experiment with various confidence thresholds and IoU thresholds
Customize the threshold settings
By default, Labelbox allows users to toggle between
- 10 values of confidence thresholds: 0, 0.1, 0.2, ..., 0.9, 1
- 10 values of IoU thresholds: 0, 0.1, 0.2, ..., 0.9, 1
You can refine these thresholds to cover any range you want. For example, it is possible to explore the range of [0.5, 0.51, 0.52, 0.53, ..., 0.59, 0.6] for confidence thresholds.
To refine the range of thresholds, open the Display panel and click on the settings icon of the confidence threshold and/or IoU threshold. From there, you can customize or delete the 10 values taken by the threshold.

Access the threshold settings

Customize the 10 values taken by the confidence and/or IoU threshold
Absence of confidence score
If a model prediction is uploaded to a model run without a specified confidence score, it is treated as if it had a confidence score of 1.
How auto-generated metrics are calculated
To compute auto-generated metrics and the confusion matrix, Labelbox matches predictions to ground truths for each data row. Here are the main steps of the matching algorithm:
- Predictions below the selected confidence threshold are discarded
- Predictions and annotations are greedily matched, by decreasing IoU
- For each prediction/annotation pair:
- If the IoU is above the IoU threshold, and the prediction and annotation haven't been matched so far, then they are matched together. The pair results in a true positive (i.e., predicted class is the ground truth class) or a false positive (i.e., predicted class is not the ground truth class).
- Unmatched annotations result in false negatives. Unmatched predictions result in false positives.
Auto-generated metrics loading state
Auto-generated metrics take a few minutes to compute.
While auto-generated metrics are computing, a banner will inform you that the metrics are out-of-sync. Metrics filters will not be available until auto-generated metrics have finished computing.

A banner indicates that auto-generated metrics are being computed for 600 data rows
Auto-generated metrics failure state
If the calculation of the auto-generated metric fails, a banner will inform you in the user interface. You can click the Retry button to re-launch the metrics calculation.

Easily re-launch metrics calculation if they fail
Supported annotation types
Auto-generated metrics are calculated for the following data types and annotation types:
Data Type | Annotation Type |
---|---|
Image | Classification, bounding box, segmentation, polygon, polyline, point |
Geospatial | Classification, bounding box, segmentation, polygon, polyline, point |
Text | Classification, named entity (NER) |
Video, Document, DICOM, Audio, JSON, HTML, Conversational text | Classification |
Custom metrics
If auto-generated metrics are not sufficient for your use case, you can upload custom metrics to your model run. This will help you evaluate even more precisely your model performance in Labelbox.
Scalar Metrics
Scalar metrics (positive real value metrics) show up as histograms in the user interface. Each bar of the histogram corresponds to a class.

This custom metric takes its highest value on helicopters
You can also see the distribution of scalar metrics. Each bar of the histogram represents the number of data rows for which the scalar metric is in a specific range of values.

The most represented range for this custom metric is between 0.9 and 1
All histograms in this view are interactive. If you click on any bar of any histogram, it will open the gallery view in the Model tab and automatically filter and sort the model run data. More precisely:
- Labelbox will filter only data rows corresponding to the bar of the histogram you clicked on
- Labelbox will sort data rows based on the metric of the histogram you clicked on
This way, you quickly gain insights about your model's behavior, by toggling between a quantitative and qualitative view of your model run.
Confidence scores
Upload confidence scores alongside every prediction
It is now possible to upload confidence scores alongside every prediction in Labelbox.
Labelbox allows users to upload a confidence score alongside every prediction. See here for more details.
Filtering and sorting on metrics and confidence scores
You can filter and sort on metrics (both auto-generated and custom) as well as on confidence scores. These filters can be scoped to a specific class if desired.
These filters apply to both the metrics view and the gallery view.

Filtering on IoU and FP count and sorting on IoU
Metrics on a subset of data
Metrics (auto-generated and custom) update dynamically based on the data you are searching. If you filter data in the model run, only the filtered data rows will contribute to the metrics. This is designed to help yo analyze model metrics on a specific subset of data.

Model metrics update dynamically based on the filters you apply (FP and IOU here)
Metrics on each data split
Machine learning teams typically want to analyze and compare model metrics on each data split. To analyze and compare model metrics on a specific data split, click Training, Validate, or Test and the model metrics will update, in the user interface, to reflect the selected split.

Easily compare model metrics on each split
Updated about 15 hours ago