How to set up benchmarks for quality analysis and then view results.

Benchmarks enable you to designate a labeled asset as a “gold standard” and automatically compare all other annotations on that asset to the benchmark label.

In order for a benchmark agreement to be calculated, there must be one benchmark label and at least one non-benchmark label on the asset. When a data row is labeled (or the annotations are updated), the benchmark agreement will be recalculated as long as there is at least one non-benchmark label on the data row.

When the benchmarks tool is active for your project, the individual performance section under the performance tab will display a benchmarks column that indicates the average benchmark score for that labeler.

Currently, benchmark agreement calculations are only supported for the following asset and annotation types:

Asset type Bounding box Polygon Polyline Point Segmentation mask Entity Radio Checklist Dropdown Free-form text
Image N/A -
Video - - - - - N/A - - - -
Audio N/A N/A N/A N/A N/A N/A -
Text N/A N/A N/A N/A N/A -
Tiled imagery - N/A -
Human-generated responses N/A N/A N/A N/A N/A N/A N/A


Benchmarks in the queue

The Labelbox queuing system will serve benchmarks for all data types. If the annotation type or data type is not supported by our calculation, no benchmark score will be shown in our application, but the labels will be grouped together under the benchmark.

After the first 5 benchmarks that are at the top of the queue, Labelbox does not guarantee any order on the remaining benchmarks. After the first 5, there is a 10% chance that the next asset you see will be a benchmark label.

Set up benchmarks

When you create a project, you will be prompted to select one or more quality setting for your project. The default option is Benchmarks.


Cannot disable selected quality mode

Once you enable benchmarks for your project upon project creation or on the settings page, you cannot disable it afterward.

You can designate labeled data rows as benchmarks, including those with consensus labels. To designate a data row as a benchmark:

  1. In the editor, label the data row and click Submit.
  2. Navigate back to the project homepage.
  3. Go to the Data rows tab.
  4. Select the labeled data row from the list. This will open the Data row browser.
  5. Click on the three dots next to the data row and select Add as benchmark from the dropdown.

Once a label is designated as a benchmark, the data row will automatically be moved to Done. Benchmarked data rows will be served to all labelers in a project. Benchmarked data rows can not be moved to any other step unless the benchmark is removed.

View benchmark results

Within a project, navigate to Performance > Quality, and you will see two charts. The histogram on the left displays the average benchmark score for labels created in certain date ranges. The histogram on the right shows the number of labels created with a benchmark score within the specified range.

The benchmark column in the data row activity table contains the average benchmark score for each benchmark data row.

When you click on an individual labeler in the performance tab, the benchmark column reflects the average benchmark score for that labeler.


View benchmark scores in a label export

The benchmark_score field in the JSON for exported labels will have a value between 0 and 1 that denotes the associated benchmark score for the label. This field can be found in the performance_details section of an exported label.

For a benchmark data row that has been labeled more than once, the benchmark_reference_label field contains the ID of the label that has been selected as the benchmark. The actual benchmark label won't have an associated benchmark score or a benchmark_reference_label field.


How are object-type annotations factored into the benchmark calculation?

The benchmark agreement for bounding box, polygon, and segmentation mask annotations is calculated using Intersection over Union (IoU). The agreement between point annotations and polyline annotations is calculated based on proximity.

  1. First, Labelbox compares each annotation to its corresponding benchmark annotation to generate IoU scores for each annotation. The algorithm first finds the pairs of annotations to maximize the total IoU score, then it assigns an IoU value of 0 for the unmatched annotations.

  2. Then, Labelbox averages the IoU scores for each annotation belonging to the same annotation class to create an overall score for that annotation class.

"Tree" annotation class agreement = 0.99 + 0.99 + 0.97 + 0 + 0 / 5 = 0.59

How are classifications factored into the benchmark calculation?

The calculation for each classification type varies. One commonality, however, is that if two classifications of the same type are compared and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

  • A radio classification can only have one selected answer. Therefore, the agreement between the two radio classifications will either be 0% or 100%. 0% means no agreement and 100% means agreement.

  • A checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

  • A dropdown classification can have only one selected answer, however, the answer choices can be nested. The calculation for dropdown is similar to that of checklist classification, except that the agreement calculation divides the number of overlapping answers by the total depth of the selection (how many levels). Answers nested under different top-level classifications can still overlap if the classifications at the next level match. On the flip side, answers that do not match exactly can still overlap if they are under the same top-level classification.

For child classifications, if two annotations containing child classifications have 0 agreement (resulting in a false positive), the child classifications will automatically be assigned a score of 0 as well.

Labelbox then creates a score for each annotation class by averaging all of the per-annotation scores.

For example, when Image X loads in the editor, a labeler has 3 classification questions to choose from (Q1, Q2, Q3), each with two answers. The green boxes indicate the benchmark answers.


Say, for example, a labeler answers Q1 correctly but answers Q2 and Q3 incorrectly.

For classifications, the benchmark agreement is calculated based on: 1. How many unique answer schemas the labeler selects AND 2. Out of those selected, how many are correct?

  • Q1-A: 1

  • Q1-B: N/A <-- not included in the final calculation.

  • Q2-A: 0

  • Q2-B: 0

  • Q3-A: 0

  • Q3-B: 0

So the final benchmark calculation for the classifications on Image X is:

(1 + 0 + 0 + 0 + 0) / 5 = .20

How is the benchmark score calculated for the data row?

Labelbox averages the scores for each annotation class (object-type & classification-type) to create an overall score for the asset. Each annotation class is weighted equally. Below is a simplified example:

Benchmark score = (tree annotation class agreement + radio class agreement) / total annotation classes

0.795 = (0.59 + 1.00) / 2

For text and conversations, such as human-generated responses from prompt and response generation, Labelbox also creates a model-based similarity score that accounts for various ways of expressing the same idea, such as using active versus passive voice, synonyms for the same concept, and other variations in writing style.

You can use this metric as an initial indicator of label quality, the clarity of your ontology, and/or the clarity of your labeling instructions.