Quality analysis

Learn how to set up benchmarks and consensus scoring to analyze the quality of your labels.

Labelbox provides two quality analysis approaches to ensure the accuracy and performance of your labels:

  • Benchmarking allows you to designate a labeled data row as a gold standard for other labels. It’s useful for assessing the accuracy of annotations.
  • Consensus scoring compares annotations on the same data row by different labelers to determine a consensus winner. It’s useful for facilitating immediate corrective actions to improve both training data and model performance.

You can apply one or both benchmarking and consensus scoring to an Annotate project.

Benchmarks

Benchmarks serve as the gold standard for other labels. You can designate specific data rows with annotations as benchmarks, and all other annotations on these data rows are automatically compared to these benchmark reference labels to calculate benchmark agreement scores.

For a benchmark agreement to be calculated, one benchmark label and at least one non-benchmark label need to be on the data row. Whenever annotations are added or updated, the benchmark agreement is recalculated as long as at least one non-benchmark label exists on the data row.

A benchmark agreement score ranges from 0 to 1. A score of 0 indicates no agreement with the benchmark reference label, and a score of 1 represents complete agreement.

📘

Benchmarks in the queue

Labelbox ensures that the first five data rows in your labeling queue are benchmark labels. After these initial five, the order of the remaining benchmarks is not guaranteed. Beyond the first five, there is a 10% chance that any subsequent data row you encounter will be a benchmark label. If you have fewer than five benchmarks, they are the first data rows in your queue, and no additional benchmarks appear unless you add more.

Supported types

Currently, benchmarking is supported for the following asset and annotation types:

Asset typeBounding boxPolygonPolylinePointSegmentation maskEntityRadioChecklist
ImagesN/A
Videos--N/A
AudioN/AN/AN/AN/AN/AN/A
TextN/AN/AN/AN/AN/A
Tiled imagery-N/A
DocumentsN/AN/AN/AN/A
HTMLN/AN/AN/AN/AN/AN/A
Conversational textN/AN/AN/AN/AN/A
Human-generated responsesN/AN/AN/AN/AN/AN/A

Set up benchmarks

To set an individual data row as a benchmark reference:

  1. In the editor, label the data row and click Submit.
  2. Navigate back to the project homepage.
  3. Go to the Data rows tab.
  4. Select the labeled data row from the list. This will open the Data row browser.
  5. Click on the three dots next to the data row and select Add as benchmark from the dropdown.

To bulk-assign labeled data rows as benchmark references:

  1. On the Data rows tab, select the data rows that you want to set as benchmarks.

  2. Click the selected dropdown and select Assign labels as benchmarks.

  3. If you have permissions to access benchmark scores, you can select from the following two options:

    • Infer automatically: Set the labels created within the benchmark quality mode as benchmark references. If the project also uses consensus as the quality mode, this option also sets consensus winners as benchmark references.
    • Target label by score: Select or set a range of benchmark scores to set labeled data rows matching the filter as benchmark references.

    If you don't have permissions to access benchmark scores or if no benchmark scores are available, you can only see and use the Infer automatically option.

  4. Click Submit.

Once a label is designated as a benchmark, the data row is automatically moved to Done. Benchmarked data rows will be served to all labelers in a project. Benchmarked data rows can't be moved to any other step unless the benchmark is removed.

Search and filter data using benchmark scores

You can use the Benchmark agreement filter to set a range of benchmark scores to find qualified data rows on:

  • An Annotate project’s Data Rows tab
  • The Catalog page

Consensus

Consensus represents the agreement between your labeling workforce. Consensus agreement scores are calculated in real-time for features and labels with multiple annotations by different labelers. Whenever an annotation is created, updated, or deleted, the consensus score is recalculated for data rows with two or more labels. There are two scopes of consensus:

  • Feature-level consensus refers to a specific feature in the ontology.
  • Label-level consensus aggregates the tools and classifications of the entire label.

A consensus agreement score ranges from 0 to 1. A score of 0 indicates no agreement among labelers, and a score of 1 indicates complete agreement.

Supported types

Currently, consensus scoring is supported for the following asset and annotation types:

Asset typeBounding boxPolygonPolylinePointSegmentation maskEntityRelationshipRadioChecklistFree-form text
ImageN/AN/A-
Video--N/AN/A-
TextN/AN/AN/AN/AN/A--
ChatN/AN/AN/AN/AN/A--
AudioN/AN/AN/AN/AN/AN/AN/A-
GeospatialN/AN/A-
DocumentsN/AN/AN/AN/AN/A
HTMLN/AN/AN/AN/AN/AN/AN/A
Human-generated responsesN/AN/AN/AN/AN/AN/AN/A

Set up consensus scoring

When adding data rows to an Annotate project, use the Queue batch option to enable consensus scoring and configure additional settings. You can't change these settings after submission.

Consensus settingDescription
Data row priorityThe position in the labeling queue these data rows will be slotted based on priority.
% coverageThe percentage of the data rows in the batch will enter the labeling queue as consensus data rows for multi-labeling. Defaults to 0.
# labelsThe number of labels each consensus data row will be added. Defaults to 2 labels.

📘

Consensus calculation can take up to five minutes

Select consensus winners

After a data row is labeled and enters the review stage, the first set of annotations entered for a data row represents consensus by default. Reviewers can reassign consensus to another set of annotations one the data row has more than one label.

If your data row has been labeled more than once, you'll view all of the label entries on that data row in the data row browser. The following example shows a data row with two sets of labels. The green trophy icon indicates that the first set of annotations is considered "consensus."

To change consensus, click the trophy icon next to the preferred annotations.

🚧

Recalculation of consensus agreement scores

The consensus score reflects agreement among labelers, so changing the winning label might lead to a recalculation of the score based on the new consensus.

📘

Set consensus winners as benchmark references

You can designate consensus winners as benchmarks. See Set up benchmarks.

Search and filter data using consensus scores

You can use the Consensus agreement filter to set a range of label-level consensus scores to find qualified data rows on the Catalog page.

To filter by feature-level consensus scores, navigate to a project and scroll down to view the Ground Truth statistics, which shows the percentage agreement between features. When you click on this score, the activity table automatically applies the correct filter to view the labels.

View and assess quality analysis performance

On a project’s Data Rows tab, you can view the average benchmark and consensus agreement scores for each benchmark data row, along with the number of labels associated with the consensus agreement. Additionally, you can assess performance using these scores on the Performance tab:

  • Labeler performance: In the Members Performance section, click on an individual labeler to see their average benchmark and consensus scores.
  • Overall project performance: Under Performance charts > Quality, the left histogram shows the average benchmark score by date range, and the right histogram displays the number of labels within specific benchmark score ranges.

Export quality analysis scores

You can export labels along with their benchmark and consensus agreement scores using the SDK. For every exported label with a benchmark or consensus agreement, you can find their benchmark_score and consensus-score values in the performance_details section of exported labels in the resulting JSON file.

For a benchmark data row labeled multiple times, all non-benchmark labels contain a benchmark_reference_label field, which is the ID of the benchmark label they reference. The benchmark label itself doesn’t have a benchmark_reference_label field or an associated benchmark score, as it serves as the standard for comparison, not as a label being compared.

Calculation methodologies

Benchmark and consensus agreement scores for different types of labels are calculated using different methodologies. The scores for each annotation type are averaged to produce an overall benchmark score for a data row, which helps assess the accuracy and consistency of labels.

Benchmarking methodologies

Object-type annotations

The benchmark agreement for bounding box, polygon, and segmentation mask annotations is calculated using Intersection over Union (IoU). The agreement between point annotations and polyline annotations is calculated based on proximity.

  1. First, Labelbox compares each annotation to its corresponding benchmark annotation to generate IoU scores for each annotation. The algorithm first finds the pairs of annotations to maximize the total IoU score, then it assigns an IoU value of 0 for the unmatched annotations.

  2. Then, Labelbox averages the IoU scores for each annotation belonging to the same annotation class to create an overall score for that annotation class.

"Tree" annotation class agreement = 0.99 + 0.99 + 0.97 + 0 + 0 / 5 = 0.59

Classifications

The calculation for each classification type varies. One commonality, however, is that if two classifications of the same type are compared and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

  • A radio classification can only have one selected answer. Therefore, the agreement between the two radio classifications will either be 0% or 100%. 0% means no agreement and 100% means agreement.

  • A checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

For child classifications, if two annotations containing child classifications have 0 agreement (resulting in a false positive), the child classifications will automatically be assigned a score of 0 as well.

Labelbox then creates a score for each annotation class by averaging all of the per-annotation scores.

For example, when Image X loads in the editor, a labeler has 3 classification questions to choose from (Q1, Q2, Q3), each with two answers. The green boxes indicate the benchmark answers.

472

Say, for example, a labeler answers Q1 correctly but answers Q2 and Q3 incorrectly.

For classifications, the benchmark agreement is calculated based on: 1. How many unique answer schemas the labeler selects AND 2. Out of those selected, how many are correct?

  • Q1-A: 1

  • Q1-B: N/A <-- not included in the final calculation.

  • Q2-A: 0

  • Q2-B: 0

  • Q3-A: 0

  • Q3-B: 0

So the final benchmark calculation for the classifications on Image X is:

(1 + 0 + 0 + 0 + 0) / 5 = .20

Overall score

Labelbox averages the scores for each annotation class (object-type & classification-type) to create an overall score for the asset. Each annotation class is weighted equally. Below is a simplified example:

Benchmark score = (tree annotation class agreement + radio class agreement) / total annotation classes

0.795 = (0.59 + 1.00) / 2

For text and conversations, such as human-generated responses from prompt and response generation, Labelbox also creates a model-based similarity score that accounts for various ways of expressing the same idea, such as using active versus passive voice, synonyms for the same concept, and other variations in writing style.

You can use this metric as an initial indicator of label quality, the clarity of your ontology, and/or the clarity of your labeling instructions.

Consensus scoring methodologies

Object-type annotations

Consensus agreement for bounding box, polygon, and segmentation mask annotations is calculated using Intersection over Union (IoU). The agreement between point annotations and polyline annotations is calculated based on proximity.

  1. First, Labelbox compares each annotation to its corresponding annotation to generate IoU scores for each annotation. The algorithm first finds the pairs of annotations to maximize the total IoU score, then it assigns the IoU value of 0 to any unmatched annotations.

  2. Labelbox then averages the IoU scores for each annotation belonging to the same annotation class to create an overall score for that annotation class.

"Tree" annotation class agreement = 0.99 + 0.99 + 0.97 + 0 + 0 / 5 = 0.59

Text (NER) annotations

The consensus score for two text entity annotations is calculated at the character level. If two entity annotations do not overlap, the consensus score will be 0. Overlapping text entity annotations will have a non-zero score. When there is overlap, Labelbox computes the weighted sum of the overlap length ratios, discounting for already counted overlaps. Whitespace is included in the calculation.

  1. Since the consensus agreement for NER is calculated at the character level, spans of text are partly inclusive. For example, If two labelers make an overlapping text entity annotation on the word "house" and the first labeler submits an annotation with house and the second labeler submits an annotation on the same word in the text file with hous, the agreement score between these two annotations would be 0.80.
  2. Labelbox then averages the agreements for each annotation created using that annotation class to create an overall score for that annotation class.

Classifications

The calculation method for each classification type is different. One commonality, however, is that if two classifications of the same type are compared, and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

  • A radio classification can only have one selected answer. Therefore, the agreement between the two radio classifications will either be 0% or 100%. 0% means no agreement, and 100% means agreement.

  • A checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

For child classifications, if two annotations containing child classifications have 0 agreement (false positive), the child classifications will automatically be assigned a score of 0 as well.

Labelbox then creates a score for each annotation class by averaging all of the annotation scores.

For example, when Image X loads in the editor, the labelers have 3 classification questions to choose from (Q1, Q2, Q3) each with two answers.

470

Each of the dotted boxes represents a unique answer choice/answer schema.

Say, for example, these 2 labelers have the same answer for Q1 but different answers for Q2 and Q3.

464

Labeler 1

504

Labeler 2

For classifications, the consensus agreement is calculated based on how many unique answer schemas are selected by all labelers.

  • Q1-A: 1 (both labelers picked this answer)

  • Q1-B: N/A (neither labeler picked this answer) <-- not included in the final calculation.

  • Q2-A: 0 (Labeler 1 selected, Labeler 2 did not)

  • Q2-B: 0 (Labeler 2 selected, Labeler 1 did not)

  • Q3-A: 0 (Labeler 1 selected, Labeler 2 did not)

  • Q3-B: 0 (Labeler 2 selected, Labeler 1 did not)

So the final consensus calculation for the classifications on Image X is:

(1 + 0 + 0 + 0 + 0) / 5 = .20

Overall score

Labelbox averages the scores for each annotation class (object-type & classification-type) to create an overall score for the asset. Each annotation class is weighted equally. Below is a simplified example:

Consensus score = (annotation class agreement + radio class agreement) / total annotation classes

0.795 = (0.59 + 1.00) / 2

For text and conversations, such as human-generated responses from prompt and response generation, Labelbox also creates a model-based similarity score that accounts for various ways of expressing the same idea, such as using active versus passive voice, synonyms for the same concept, and other variations in writing style.

You can use the metric as an initial indicator of label quality, the clarity of your ontology, and/or your labeling instructions.