Benchmarks

The BenchmarksBenchmarks - Benchmarks is an advanced QA tool in Labelbox. When you designate a set of annotations on an asset as a Benchmark, you designate it as the "Gold standard" for a given asset. Labelbox then compares any human-made annotations on that asset to this Benchmark label to help you understand weaknesses in labeling efforts and/or problematic ontology configurations. See Benchmarks methodology to learn more. tool allows you to designate a labeled asset as a “gold standard” and automatically compare all other annotations on that asset to the Benchmark. In order for a Benchmark agreement to be calculated, there must be one Benchmark label and at least one non-Benchmark label on the asset. When a new Label is created or updated, the Benchmark agreement will be recalculated as long as there is at least one non-Benchmark label on the Data Row.

Benchmark labels are marked with a gold star in the Activity table under the Labels tab. When the Benchmarks tool is active for your project, the Individual performance section under the Performance tab will display a Benchmarks column that indicates the average Benchmark score for that labeler.

🚧

Caution

Switching from Benchmarks to Consensus (or vice versa) mid-project may result in duplicate labels.

Currently, Benchmark agreement calculations are only supported for the following:

Asset type

Bounding box

Polygon

Polyline

Point

Segmentation mask

Entity

Radio

Checklist

Dropdown

Free-form text

Images

N/A

Tiled imagery

N/A

Text

N/A

N/A

N/A

N/A

N/A

N/A

Video

N/A

📘

Queuing benchmarks is supported for all data types in Labelbox

The Labelbox queuing system will serve benchmarks for all data types. If the annotation type or data type is not supported by our calculation, no benchmark score will be shown in our application but the labels will be grouped together under the benchmark.

How are object-type annotations factored into the Benchmark calculation?

Benchmark agreement for Bounding box, Polygon, and Segmentation mask annotations is calculated using Intersection over Union (IoU). The agreement between Point annotations and Polyline annotations is calculated based on proximity.

  1. First, Labelbox compares each annotation to its corresponding Benchmark annotation to generate IoU scores for each annotation. The algorithm first finds the pairs of annotations to maximize the total IoU score, then it assigns an IoU value of 0 for the unmatched annotations.

  2. Then, Labelbox averages the IoU scores for each annotation belonging to the same annotation class to create an overall score for that annotation class.

"Tree" annotation class agreement = 0.99 + 0.99 + 0.97 + 0 + 0 / 5 = 0.59

How are classifications factored into the Benchmark calculation?

The calculation for each classification type varies. One commonality, however, is that if two classifications of the same type are compared and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

  • A Radio classification can only have one selected answer. Therefore, the agreement between two radio classifications will either be 0% or 100%. 0% means no agreement and 100% means agreement.

  • A Checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

  • A Dropdown classification can have only one selected answer, however, the answer choices can be nested. The calculation for dropdown is similar to that of checklist classification, except that the agreement calculation divides the number of overlapping answers by the total depth of the selection (how many levels). Answers nested under different top-level classifications can still have overlap if the classifications at the next level match. On the flip side, answers that do not match exactly can still have overlap if they are under the same top-level classification.

For child classifications, if two annotations containing child classifications have 0 agreement (resulting in a false positive), the child classifications will automatically be assigned a score of 0 as well.

Labelbox then creates a score for each annotation class by averaging all of the per-annotation scores.

Radio "Is it daytime?" = "Yes" & "Yes" = 1.00

Radio "Is it daytime?" = "Yes" & "Yes" = 1.00

"Is it daytime?" radio class agreement = 1.00 + 1.00 / 2 = 1.00

How is the Benchmark score calculated for the Data Row?

Labelbox averages the scores for each annotation class (object-type & classification-type) to create an overall score for the asset. Each annotation class is weighted equally. Below is a simplified example.

Benchmark score = (Tree annotation class agreement + Radio class agreement) / Total annotation classes

0.795 = (0.59 + 1.00) / 2

You can use this metric as an initial indicator of label quality, the clarity of your ontology, and/or the clarity of your labeling instructions.

How do I set up Benchmarks?

Either Benchmarks or Consensus can be turned on for a project at any given time, but it is not possible to have both on at the same time.

  1. Create a project or select an existing one.

  2. Navigate to Settings > Quality and select Benchmarks to turn on this feature for your project.

  3. You can designate a Benchmark label by selecting a label from the Activity table. From the Label browser, click the context menu (three dots) on the desired label and click Add as Benchmark.

  4. Under the Labels tab, there is a Benchmarks table that contains a list of all Benchmarks labels for that project. Benchmark labels are marked with a gold star. Click on View Results to see all labels associated with that Benchmark label.


Did this page help you?