Model run metrics

How to view, use, and compare model metrics.

Developer guide: Upload custom metrics to a model run

Model metrics can help you analyze and compare predictions and annotations. They help identify low-confidence predictions and places where model predictions agree with ground truths (or not).

Metrics help you analyze model performance, conduct active learning, detect labeling mistakes, find model errors.

While some metrics are generated automatically by Labelbox, you can also upload your own custom metrics .

How to use metrics

Metrics (whether generated automatically by Labelbox or uploaded by users) can be used to analyze model predictions and model performance. Users can:

  • Analyze the distribution of metrics in the Metrics View

    You can compare metrics between data splits, between individual slices, or between classes.

  • Filter and sort on metrics in the Gallery View, which helps identify low-confidence predictions, invalid predictions, labeling mistakes, and so on.

Custom metrics

You can use the Python SDK to create and upload custom metrics to a model run.

You can create custom values specific to your data requirements and use them to filter and sort your data rows. This helps you study and better understand your data and your model performance.

You can associate the following types of custom metrics with predictions or annotations:

  • Bounding boxes
  • Bounding boxes with radio subclasses
  • Checklist questions
  • Radio button questions
  • Nested checklist and radio button questions and answers
  • Free-form text
  • Points
  • Polygon
  • Polylines
  • Segmentation masks

View custom metrics

To view custom metrics for a given data row:

  1. Choose Model from the Labelbox main menu and then select the Experiment type.
  2. Use the list of experiments to select the model run containing your custom metrics.
  3. Select a data row to open Detail view.
  4. Use the Annotations and Predictions panels to view custom metrics.

In Model, you can use custom metrics to filter and sort model run data rows. To learn more, see Filtering on custom metrics.

In Annotate, you can filter data rows to values uploaded with custom metrics.

Automatic metrics

These metrics are computed on every feature and can be sorted and filtered at the feature level. These are computed automatically for all data rows containing at least one prediction and at least one annotation.

Automatic metrics are supported only for ontologies with fewer than 4,000 features. To learn more about these and other limits, see Limits.

Once you upload your model predictions to a model run, Labelbox will automatically compute the following metrics:

  • True positive
  • False positive
  • True negative
  • False negative
  • Precision
  • Recall
  • F1 score
  • Intersection over union (IoU)

Confusion matrix

Labelbox automatically generates a confusion matrix. The confusion matrix helps you understand your model's performance for every class. You can also inspect invalid predictions.

You can use a confusion matrix with classification models, but only when IOU threshold is set to zero (0).

Diagonal cells of the confusion matrix indicate true positive predictions by the model (i.e., the predicted feature matches the ground truth feature). Conversely, non-diagonal cells of the confusion matrix correspond to false positives and false negatives (i.e., the predicted feature does not match the ground truth feature).

The confusion matrix also includes an additional feature that is not a part of your model run ontology: the None feature. The None feature is useful in identifying predictions that were not matched to any annotation, as well as annotations that were not matched to any prediction.

The confusion matrix is interactive. If you click on any matrix cell, it opens the gallery view of the model run and keeps only examples corresponding to this specific cell of the confusion matrix.

Precision-recall curve

Labelbox generates a precision-recall curve. It represents the value of precision and recall for your model for every confidence threshold.

This precision-recall curve is critical for picking the optimal confidence threshold for your model. You can pick the balance between precision and recall (between false positives and false negatives) for your specific use case.

You can display the precision-recall curve for all features or a specific feature. This lets you choose the optimal confidence threshold for your use case for every class.

Confidence thresholds and IoU thresholds

Labelbox automatically generates metrics for several confidence thresholds and several IoU thresholds. This helps machine learning teams fine-tune the confidence threshold of their model and the IoU threshold for error analysis.

Valid values for each threshold range between zero (0) and one (1). Predictions with confidence levels below zero are ignored. Predictions uploaded without confidence scores are treated as if their confidence score was set to one (1).

A true positive occurs when a prediction and an annotation of the same class have an IoU value higher than the selected IoU threshold.

When browsing model run data rows, you can use threshold values to filter the data rows to [different value ranges]. This helps you understand how each threshold affects the automatic metrics and the confusion matrix.

By default, you can toggle between

  • 11 values of confidence thresholds: 0, 0.1, 0.2, 0.3, ..., 0.7, 0.8, 0.9, 1
  • 11 values of IoU thresholds: 0, 0.1, 0.2, 0.3, ..., 0.7, 0.8, 0.9, 1

You can refine these thresholds to cover any supported range. For example, it is possible to explore the range of [0.5, 0.51, 0.52, 0.53, ..., 0.59, 0.6] for confidence thresholds.

To refine the range of thresholds, open the Display panel and select the Settings icon for the metric you want to change. From there, you can customize or delete the 10 values taken by the threshold.

Access the threshold settings

Access the threshold settings

You can customize the predefined values

You can customize the predefined values

Automatic metric calculations

To generate automatic metrics and the confusion matrix, Labelbox matches predictions to ground truths for each data row.

Here's how the matches are calculated:

  1. Predictions below the selected confidence threshold are discarded
  2. Predictions and annotations are greedily matched by decreasing IoU
  3. For each prediction/annotation pair:
    1. If the IoU is above the IoU threshold, and the prediction and annotation haven't been matched so far, then they are matched.
    2. The new pair then classified as either a true positive (predicted class is the ground truth class) or a false positive (predicted class is not the ground truth class).
  4. Unmatched annotations are considered false negatives.
  5. Unmatched predictions are false positives.

Automatic metric update timing

Automatic metrics may take a few minutes to calculate. Metric filters become available after metrics are generated. This means there can be a brief delay before metrics can be filtered.

Notification banners are displayed while metrics are generated.

A notification banner will appear while Labelbox calculates metrics with an exact count of data rows remaining

A notification banner will appear while Labelbox calculates metrics with an exact count of data rows remaining

Automatic metric failure state

A notification banner appears when automatic generation fails. Select Retry to try again.

Supported types

Automatic metrics are calculated for the following asset data types and annotation types:

Data TypeAnnotation Type
ImageClassification, bounding box, segmentation, polygon, polyline, point
GeospatialClassification, bounding box, segmentation, polygon, polyline, point
TextClassification, named entity (NER)