Model diagnostics

Throughout the process of developing a machine learning (ML) model, you will want to investigate model failures to prioritize labeling efforts. Understanding patterns in model error after each training iteration will help you understand whether you need to re-work annotations, the ontology, or sample more data. Following this process will reduce labeling costs, speed up iteration, and increase trust in your models.

Supported type

Data type

Annotation type


Bounding Box, Polygon, Segmentation, Point, Polyline, Classification


Bounding box, Polygon, Polyline, Classification



Model diagnostics workflow

Labelbox now offers a Model Diagnostics workflow where you can run experiments on your ML model and analyze the performance of your model's predictions across each experiment.

With the Model Diagnostics workflow, you can:

  • Visualize model predictions and ground truth to better understand model behavior

  • View aggregate and plot model metrics from model training​

  • Identify latent patterns in data and model performance with the embedding projector

Here are some important terms that will help you understand how the Model Diagnostics feature works:




In this workflow, the Model object represents a model or machine learning product you are developing. Functionally a Model is a container for related Model Runs.

Model Run

A Model Run represents an experiment or version of a Model. A Model Run could be a different configuration of hyper-parameters or an expanded training dataset. Model Runs under the same Model can be compared with one another.


A Slice represents a subset of your training dataset bound by a common characteristic. From the Models tab in the app, you can create slices to visually inspect in your training dataset and view the IoU metrics reported on each slice. More detail on slices is in the following section.


​After you build your Model Diagnostics analysis pipeline, you can use the ​Models tab​​ to visually compare your model’s predictions against your ground truth annotations on your training dataset.

More specifically, you can preview a Data Row and toggle on/off the predictions/annotations to visually assess your model's ability to predict a given annotation class.

Then, you can use Slices to assess your model's performance by annotation class. If you create a filter, Annotation = “Mailbox”, Labelbox returns all of the Data Rows that contain a “Mailbox” ground truth annotation or prediction. You can then turn that subset of Data Rows into a ​slice​​ called “Mailboxes” and use the IoU metrics to assess the Data Rows in this slice for discrepancies between the ground truth labels and the model predictions.

Once you fix the errors you identified in your training dataset, you can use the improved training dataset to retrain your ML model to improve its capability to detect mailboxes.

Complete tutorial in Python SDK

Python Tutorial


Google Colab

Model diagnostics guide

Open in Github

Open in Google Colab

Metrics basics

Open in Github

Open in Google Colab

Metrics demo

Open in Github

Open in Google Colab

Did this page help you?