Metrics view

A series of metrics to help you evaluate your data and your model.

When you select a model from the Model tab, you will have three views to choose from, the gallery view, the metrics view, and the projector view.

The metrics view helps you analyze the distribution of annotations and predictions in your data, evaluate the performance of a model, and quantitatively compare two models.

Switch to the metrics view by clicking this metrics icon in the top right corner.

Switch to the metrics view by clicking the metrics icon in the top right corner

Data analytics

The metrics view provides analytics about the distribution of annotations and predictions in the model run.

Annotations distribution

A histogram displays the distribution of annotations in a model run.

Every line in the histogram represents a feature. If the feature has sub-features, you can see the distribution of sub-features by clicking on the arrow to the left of the histogram line.

By default, Labelbox displays the distribution of annotations for the top 100 features. Users can display even more features by clicking on Load more.

The "airplane" feature is the most represented annotation in the Model Run

The “airplane” feature is the most represented annotation in the model run

Predictions distribution

Similarly, this histogram displays the distribution of predictions in the model run.

The "airplane" feature is the most represented prediction in the Model Run

The “airplane” feature is the most represented prediction in the model run

Data analytics on a subset of data

The annotations and predictions histograms work exactly like the gallery view. if you use filters to search data in the model run, only the filtered data rows will appear in the histograms. These histograms are designed to help you understand the distribution of annotations and predictions on a specific subset of data.

Data analytics on each data split

Machine learning teams typically want to do the following:
a) analyze the distribution of annotations and predictions on each data split
b) surface discrepancies among splits

To visualize the analytics histograms for a specific data split, click on Training, Validate, or Test. The histograms will update, in the user interface, to reflect the distribution of annotations and predictions on the selected split.

Filter data using analytics histograms

Annotations and predictions histograms are interactive. Indeed, you can simply click on any histogram bar, to visualize the corresponding data rows in the gallery view of the model run.

Here's what is happening behind the scenes:

  • Labelbox opens the gallery view of the model run (you were in the metrics view so far) so that you can visualize data rows
  • Labelbox adds a filter in the model run, to narrow down to the data rows associated with the histogram bar you clicked

Compare data analytics for two datasets

When comparing two model runs, you can compare their distribution of annotations and predictions.

Model metrics

The metrics view provides quantitative metrics to compare predictions and annotations. These metrics are helpful to surface areas of agreement and disagreement between model predictions and ground truths. This helps machine learning teams analyze model performance, find model errors, find labeling mistakes, and surface low-confident predictions.

Some metrics are auto-generated by Labelbox, and users can upload their own custom metrics.

Auto-generated metrics

Once users upload model predictions to a model run, Labelbox automatically computes some metrics:

  • true positive
  • false positive
  • true negative
  • false negative
  • precision
  • recall
  • f1 score
  • intersection over union (IoU).

These auto-generated metrics are computed for all data rows that contain at least one prediction and at least one annotation.

Auto-generated metrics histograms

Precision, recall, f1 score, and IoU metrics show up as histograms in the user interface. Each bar of the histogram corresponds to a class.

The model has the highest f1 score on airplanes

The model has the highest f1 score on airplanes

You can also see the distribution of these auto-generated metrics. Each bar of the histogram represents the number of data rows for which the auto-generated metric is in a specific range of values.

34 data rows have an f1-score between 0.5 and 0.6

34 data rows have an f1-score between 0.5 and 0.6

All histograms in this view are interactive. If you click on any bar of any histogram, it will open the gallery view in the Model tab and automatically filter and sort the model run data. More precisely:

  • Labelbox will filter only data rows corresponding to the bar of the histogram you clicked on
  • Labelbox will sort data rows based on the metric of the histogram you clicked on

These filter and sort capabilities allow you to quickly gain insight into your model's behavior by toggling between a quantitative and qualitative view of your model run.

Auto-generated confusion matrix

Labelbox automatically generates a confusion matrix for your annotation classes. The confusion matrix is designed to help you understand the performance of your model on every class. It also allows you to inspect examples of a specific misprediction.

Every row of the confusion matrix corresponds to a ground truth feature, while every column of the confusion matrix corresponds to a predicted feature.

Diagonal cells of the confusion matrix indicate true positive predictions by the model (i.e., the predicted feature matches the ground truth feature). Conversely, non-diagonal cells of the confusion matrix correspond to false positives and false negatives (i.e., the predicted feature does not match the ground truth feature).

The confusion matrix has one more feature than your model run ontology: the None feature. None is useful to identify predictions that were not matched to any annotation, as well as annotations that were not matched to any prediction.

The confusion matrix is interactive. If you click on any cell of the matrix, it opens the gallery view in the Model tab and keeps only examples corresponding to this specific cell of the confusion matrix.

Click on any cell of the confusion matrix to inspect the corresponding data rows

Click on any cell of the confusion matrix to inspect the corresponding data rows

Auto-generated precision-recall curve

Labelbox generates a precision-recall curve. It represents the value of precision and recall, for your model for every confidence threshold.

This precision-recall curve is crucial for picking the optimal confidence threshold for your model. You can pick the balance between precision and recall (between false positives and false negatives) for your specific use case.

You can display the precision-recall curve for all features, or for a specific feature. This enables you to pick the optimal confidence threshold for your use case, for every class.

Confidence thresholds and IoU thresholds

📘

Confidence threshold

The confidence threshold is between 0 and 1. Predictions with a confidence score lower than the confidence threshold will be ignored.

📘

IoU threshold

The IoU threshold is between 0 and 1. A True Positive is when a prediction and annotation of the same class have an IoU that is higher than the selected IoU threshold.

Labelbox auto-generates metrics for several confidence thresholds and several IoU thresholds. This helps machine learning teams fine-tune the confidence threshold of their model and the IoU threshold for error analysis.

You can analyze model metrics for various confidence thresholds and the IoU thresholds by changing them in the user interface. When you modify the thresholds, you'll be able to see how these thresholds impact the auto-generated metrics and the confusion matrix.
There are 2 ways to change the confidence and IoU thresholds in Model tab.

  • Option #1: Go to the Model runs subtab, select the metrics view, and use the sliders to adjust the confidence threshold and the IoU threshold.
Changing the confidence threshold and/or the IoU threshold will update model metrics

Changing the confidence threshold and/or the IoU threshold will update model metrics

  • Option #2: Go to the Model runs subtab, click Display, in the Display panel change use the sliders to adjust the confidence threshold and the IoU threshold.
Open the Display panel

Open the Display panel

Expriment with various confidence thresholds and IoU thresholds

Experiment with various confidence thresholds and IoU thresholds

Customize the threshold settings

By default, Labelbox allows users to toggle between

  • 10 values of confidence thresholds: 0, 0.1, 0.2, ..., 0.9, 1
  • 10 values of IoU thresholds: 0, 0.1, 0.2, ..., 0.9, 1

You can refine these thresholds to cover any range you want. For example, it is possible to explore the range of [0.5, 0.51, 0.52, 0.53, ..., 0.59, 0.6] for confidence thresholds.

To refine the range of thresholds, open the Display panel and click on the settings icon of the confidence threshold and/or IoU threshold. From there, you can customize or delete the 10 values taken by the threshold.

Access the threshold settings

Access the threshold settings

Customize the 10 values taken by the confidence and/or IoU threshold

Customize the 10 values taken by the confidence and/or IoU threshold

📘

Absence of confidence score

If a model prediction is uploaded to a model run without a specified confidence score, it is treated as if it had a confidence score of 1.

How auto-generated metrics are calculated

To compute auto-generated metrics and the confusion matrix, Labelbox matches predictions to ground truths for each data row. Here are the main steps of the matching algorithm:

  1. Predictions below the selected confidence threshold are discarded
  2. Predictions and annotations are greedily matched, by decreasing IoU
  3. For each prediction/annotation pair:
    1. If the IoU is above the IoU threshold, and the prediction and annotation haven't been matched so far, then they are matched together. The pair results in a true positive (i.e., predicted class is the ground truth class) or a false positive (i.e., predicted class is not the ground truth class).
  4. Unmatched annotations result in false negatives. Unmatched predictions result in false positives.

Auto-generated metrics loading state

Auto-generated metrics take a few minutes to compute.

While auto-generated metrics are computing, a banner will inform you that the metrics are out-of-sync. Metrics filters will not be available until auto-generated metrics have finished computing.

A banner indicates that auto-generated metrics are being computed for 600 data rows

A banner indicates that auto-generated metrics are being computed for 600 data rows

Auto-generated metrics failure state

If the calculation of the auto-generated metric fails, a banner will inform you in the user interface. You can click the Retry button to re-launch the metrics calculation.

Easily re-launch metrics calculation - if they failed

Easily re-launch metrics calculation if they fail

Supported annotation types

Auto-generated metrics are calculated for the following data types and annotation types:

Data TypeAnnotation Type
ImageClassification, bounding box, segmentation, polygon, polyline, point
GeospatialClassification, bounding box, segmentation, polygon, polyline, point
TextClassification, named entity (NER)
Video, Document, DICOM, Audio, JSON, HTML, Conversational textClassification

Custom metrics

If auto-generated metrics are not sufficient for your use case, you can upload custom metrics to your model run. This will help you evaluate even more precisely your model performance in Labelbox.

Scalar Metrics

Scalar metrics (positive real value metrics) show up as histograms in the user interface. Each bar of the histogram corresponds to a class.

This custom metric takes its highest value on helicopters

This custom metric takes its highest value on helicopters

You can also see the distribution of scalar metrics. Each bar of the histogram represents the number of data rows for which the scalar metric is in a specific range of values.

The most represented range for this custom metric is between 0.9 and 1

The most represented range for this custom metric is between 0.9 and 1

All histograms in this view are interactive. If you click on any bar of any histogram, it will open the gallery view in the Model tab and automatically filter and sort the model run data. More precisely:

  • Labelbox will filter only data rows corresponding to the bar of the histogram you clicked on
  • Labelbox will sort data rows based on the metric of the histogram you clicked on

This way, you quickly gain insights about your model's behavior, by toggling between a quantitative and qualitative view of your model run.

Confidence scores

📘

Upload confidence scores alongside every prediction

It is now possible to upload confidence scores alongside every prediction in Labelbox.

Labelbox allows users to upload a confidence score alongside every prediction. See here for more details.

Filtering and sorting on metrics and confidence scores

You can filter and sort on metrics (both auto-generated and custom) as well as on confidence scores. These filters can be scoped to a specific class if desired.

These filters apply to both the metrics view and the gallery view.

Filtering on IoU and FP count, sorting on IoU

Filtering on IoU and FP count and sorting on IoU

Metrics on a subset of data

Metrics (auto-generated and custom) update dynamically based on the data you are searching. If you filter data in the model run, only the filtered data rows will contribute to the metrics. This is designed to help yo analyze model metrics on a specific subset of data.

Model metrics update dynamically based on the filters you apply (FP and IOU here)

Model metrics update dynamically based on the filters you apply (FP and IOU here)

Metrics on each data split

Machine learning teams typically want to analyze and compare model metrics on each data split. To analyze and compare model metrics on a specific data split, click Training, Validate, or Test and the model metrics will update, in the user interface, to reflect the selected split.

Easily compare model metrics on each split

Easily compare model metrics on each split


What’s Next