Skip to main content
Once you have created a model run and uploaded your predictions, you can start analyzing your model’s performance. This section will guide you through the various tools and features available in Labelbox to help you understand where your model is succeeding and where it is failing.

Filtering and sorting

Labelbox provides powerful tools for exploring the data within your model run. You can filter and sort your data based on a wide range of attributes, including:
  • Annotations: Find data rows with or without certain annotations.
  • Predictions: Filter data based on your model’s predictions.
  • Metrics: Sort your data by performance metrics like IoU and confidence.
  • Metadata: Use your own custom metadata to filter your data.

Save data as a slice

When you have a set of filters that you want to reuse, you can save them as a slice. A slice is a dynamic, saved query that will automatically update as you add new data to your project. You can use slices to:
  • Track the performance of your model on specific subsets of your data.
  • Identify high-impact data for re-labeling.
  • Create automated data curation workflows.
To manually create a slice in the UI:
  1. Select a subset of data rows using filters in the model run view.
  2. Select the data rows to include.
  3. Click Save slice. You will be prompted to give the slice a name (3 to 30 characters) and an optional description.
To create a slice programmatically, see our Model run slice developer guide. After you create a slice, it will appear in the left side panel of the model run view. You may modify the attributes of the slice later by updating its filters.
See Limits to learn the limits for creating slices.

Auto-generated slices

When you create a model run and associate ground truth labels, Labelbox automatically generates a set of default slices. These slices act as powerful, pre-built filters that help you immediately begin diagnosing your model’s performance. They provide the foundational building blocks for a comprehensive model error analysis workflow. Here is a detailed breakdown of each auto-generated slice and how to leverage it for model improvement.
Auto-generated sliceDescription
True positiveWhat it is: This slice contains every data row where your model correctly predicted an object that matched a ground truth label, according to the IoU threshold you’ve set.
Why it’s useful: This is your “success” bucket. Analyzing your true positives helps you understand what your model is getting right and how confidently it’s doing so. By sorting this slice by confidence score, you can find examples where your model was “hesitantly correct” (low confidence true positives), which can be just as interesting as your errors.
False positiveWhat it is: This slice contains every data row where your model made a prediction for which there was no corresponding ground truth label. In essence, your model is “hallucinating” or seeing things that aren’t there.
Why it’s useful: This is one of the most critical slices for debugging. It isolates the specific examples where your model is being confused by background textures, reflections, unusual shapes, or other patterns. Analyzing your false positives is key to reducing noise and increasing your model’s precision.
False negativeWhat it is: This slice contains every data row where a ground truth label exists, but your model completely failed to predict it. This is your “missed detections” bucket.
Why it’s useful: This slice reveals your model’s blind spots. Are you consistently missing small objects? Objects in shadow? Objects with a rare orientation? False negatives highlight the specific categories and scenarios where your model lacks the ability to see what it’s supposed to. Reducing false negatives is critical for improving your model’s recall and ensuring it is reliable in production.
True negativeWhat it is: This slice contains every data row where there is no ground truth label and your model also made no prediction. This concept is most relevant for global or document-level classification tasks. For object detection, this slice represents the background where your model correctly remained silent.
Why it’s useful: This slice represents your model’s ability to correctly identify and ignore irrelevant data or background noise. It’s the foundation of a “clean” model that doesn’t produce excessive false alarms. Analyzing this slice helps confirm that your model is confident in its decision to not make a prediction.
Low precisionWhat it is: This slice contains data rows where your model’s precision for a specific class is low. Precision measures the accuracy of your model’s positive predictions (True Positives / (True Positives + False Positives)). In simple terms, a low precision score means the model is making a high number of False Positive predictions for that class.
Why it’s useful: This slice is your go-to filter for understanding “hallucinations.” It isolates the classes where your model is most frequently predicting objects that aren’t actually there. It’s a more targeted version of the general “False Positive” slice, helping you prioritize which classes are contributing the most to model noise.
Low recallWhat it is: This slice contains data rows where your model’s recall for a specific class is low. Recall (also known as sensitivity) measures the model’s ability to find all of the actual positive examples (True Positives / (True Positives + False Negatives)). A low recall score means the model is missing a high number of objects for that class.
Why it’s useful: This is your starting point for diagnosing “blind spots.” It immediately shows you which classes your model is struggling to detect. It is a more targeted version of the general “False Negative” slice, helping you prioritize which classes need more representative data.
Low F1-scoreWhat it is: This slice identifies data rows belonging to classes with a low F1-Score. The F1-score is the harmonic mean of precision and recall, providing a single, balanced measure of a model’s performance. A low F1-score indicates a problem with either precision, recall, or both.
Why it’s useful: This is the ultimate “where should I start?” filter. Sorting your classes by F1-score and starting with the lowest-performing one is often the most efficient way to begin your error analysis. It guides you to the classes that have the most room for improvement, without you having to guess whether precision or recall is the bigger problem.
Low confidenceWhat it is: This slice contains every prediction where your model’s confidence score was below a certain threshold (e.g., less than 50%). It’s important to note that these are not necessarily incorrect predictions; they are simply predictions where the model is expressing uncertainty.
Why it’s useful: This slice is a direct window into the “edge of your model’s knowledge.” These are the examples that your model found most challenging or ambiguous. This makes them prime candidates for active learning. Reviewing this data can be more impactful than reviewing thousands of high-confidence, “easy” examples.
Candidate mislabelsWhat it is: This powerful slice identifies data rows where your model made a high-confidence prediction that directly disagrees with the ground truth label. For example, the model predicts “car” with 98% confidence, but the ground truth label says “truck”.
Why it’s useful: This slice flips the script from debugging your model to debugging your data. A high-confidence disagreement often indicates an error in the ground truth label, not in the model’s prediction. Finding and fixing these labeling errors is one of the most effective ways to create a high-quality dataset and, consequently, a better model.

Splits: Training, validation, and test

Labelbox makes it easy to split your data into training, validation, and test sets. This is a crucial step in preventing your model from overfitting to your training data. You can configure the data splits when you create a new model run, or you can modify them later.