Find model errors (error analysis)

Error analysis is the process ML teams use to analyze where model predictions disagree with ground truth labels. A disagreement between model predictions and ground truth labels can be due to a model error (poor model prediction) or a labeling mistake (ground truth is wrong). In this section, we detail 3 workflows to surface model errors—edge cases on which the model is struggling—using Labelbox.

Before you start

Before engaging in error analysis, you should:

  1. Go to the Model tab.

  2. Open the model you want to perform error analysis on.

  3. Select the model run you want to perform error analysis on.

Workflow 1: Find model errors using filters in the gallery view

Follow these steps to learn how to use filters and metrics, in the gallery view, to surface model errors. You can adjust this workflow to your specific use case.

  1. Inside the model run, go to the gallery view by clicking the gallery icon on the right.

  2. Optionally, select the validation or test split. Some machine learning teams prefer doing error analysis on the validation or test splits only, rather than on all the training data.

  3. Filter data rows to keep only disagreements between model predictions and ground truth labels. To do so, you can add a filter on metrics to keep only data rows with low metrics (IOU between 0 and 0.5 in our example). In the image model example below, you can see that 307 data rows matching these filters are surfaced.

  4. [ Option 1 ] Surface data rows where the disagreement is the highest. To do so, you can sort data rows by increasing metrics (IOU in our example). Predictions that have the lowest metrics (IOU in our example) are likely to be model errors.

3448

Filter and sort to keep the largest disagreements between model predictions and ground truth labels, on images.

3456

Filter and sort to keep the largest disagreements between model predictions and ground truth labels, on text.

  1. [ Option 2 ] Surface data rows where the model is least confident. To do so, you can sort data rows by increasing the order of confidence. This assumes you have uploaded model confidence, as a scalar metric, to the model run. Predictions that have low metrics (IOU) and low model confidence are likely to be edge cases on which the model is struggling.

  2. Then, manually inspect in detail some of these surfaced data rows. The goal is to find patterns of edge cases on which the model is struggling. It is common practice to manually inspect hundreds of data rows, to find these patterns. To do so, click on the thumbnails which will open a detailed view. For the best error analysis experience, change the display setting to Color by feature. This way, you can easily visualize where predictions and labels disagree.

3456

The Detailed view helps you inspect disagreements. The goal is to find patterns of model failures on the image.

In this image example, you can see several occurrences of data rows where the model predicts a basketball_court instead of a ground_track_field. We found a pattern of model failures: "The model seems to struggle to distinguish ground track fields and basketball courts, especially when they have green and brown colors".

3456

The Detailed view helps you inspect disagreements. The goal is to find patterns of model failures in the text.

In this text example, you can see several occurrences of data rows where the model makes poor predictions on scientific words such as "Gelidium". We found a pattern of model failures (i.e., the model seems to struggle with text related to scientific concepts).

  1. Double-check that you have found a pattern of model failure.

For example, in the image example, we filter to keep only data rows that contain a basketball_court annotation or a ground_track_field annotation and have low IOU. This surfaces many examples of the exact edge case we discovered above: "The model struggles to distinguish ground track fields and basketball courts, especially when they have green and brown colors".

3456

Many ground track fields and basketball courts are being mispredicted (low IOU). This is a pattern of model failure.

By browsing through examples in this pattern of model failure, you can see that many basketball courts have brown and green colors, just like ground track fields.

3456

Labelbox helps you find patterns of model failures. In this case, the model struggles to distinguish ground track fields and basketball courts, especially when they have green and brown colors.

For example, in the text example, we filter to keep only data rows that contain a miscellaneous annotation and that have low IOU. This surfaces many examples of the exact edge case we discovered above: "The model struggles to make accurate NER predictions on scientific concepts".

3456

Labelbox helps you find patterns of model failures. In this case, struggles to make accurate NER predictions on scientific concepts.

After you surface edge cases on which the model is struggling, and found a pattern of model failure, you can take action to fix model errors and improve model performance.

Workflow 2: Find model errors using model metrics

The metrics view is a powerful tool for doing error analysis.

By looking at the scalar metrics, we could have noticed that the model is struggling to detect basketball_court ground truths. Then, we could have clicked on the histogram bar corresponding to basketball courts: the gallery view opens, with filtering and sorting activated, to show data rows in this basketball_court bar of this histogram.

This is an alternative to steps 1-4 described in the previous section.

6912

The metrics view is a good way to identify classes on which the model is struggling.

Workflow 3: Find model errors using the projector view

The projector view is a powerful way to do error analysis.

In the image below you can see that data rows containing the basketball_court annotation and those containing the ground_track_field annotation overlap. The two classes are not easy to separate in the embedding space. This is an indicator that the model is likely to struggle with the data rows at the intersection of the two clusters.

In the projector view, you can select the data rows that are at the intersection of the basketball_court cluster and the ground_track_field cluster. The model is likely to struggle with these data rows. Once the data rows are selected, you can switch back to the grid view, and inspect these data rows.

This is an alternative to steps 1-4 described in the previous section.


What’s Next