Labelbox documentation


Benchmarks is a Labelbox QA tool that automatically compares all labels on a data row to a “gold standard” label you choose. Once an asset with a Benchmark label gets a human- or computer-generated label, the Benchmark agreement score is automatically calculated.

Currently, Benchmark scores are only supported on images for the following annotation types: Bounding boxes, Polygons, and Masks.


Labelbox follows a similar methodology for calculating the agreement scores for both Benchmarks and Consensus. The only difference in the calculations is the entity to which the Labels are compared.

Benchmarks works by interspersing data to be labeled, for which there is a Benchmark label, to each person labeling. These labeled data are compared against their respective Benchmark and an accuracy score between 0 and 100 percent is calculated.

When a Label is created or updated, the Benchmarks score will be recalculated as long as there is one Label on the Data Row. if a Label gets deleted, no benchmark score will appear for that Data Row.

Generally speaking, calculating agreement for the polygons of a Label involves Intersection-over-Union and a series of averages to calculate the final agreement between two Labels on an image.

There are three global classification types supported in Benchmarks: radio, checklist, and dropdown. The calculation method for each classification type is different. One commonality, however, is that if two classifications of the same type are compared and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

Radio classification can only have one selected answer. Therefore, the agreement between two radio classifications will either be 0% or 100%. 0% means no agreement and 100% means agreement.

Checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

Dropdown classification can have only one selected answer, however, the answer choices can be nested. The calculation for dropdown is similar to that of checklist classification, except that the agreement calculation divides the number of overlapping answers by the total depth of the selection (how many levels). Answers nested under different top-level classifications can still have overlap if the classifications at the next level match. On the flip side, answers that do not match exactly can still have overlap if they are under the same top-level classification.

Set up Benchmarks

Either Benchmarks or Consensus can be turned on for a project at any given time, but it is not possible to have both on at the same time.

  1. Create a project or select an existing one.

  2. Navigate to Settings > Quality and select Benchmarks to turn on this feature for your project.

  3. You can designate a Benchmark label by selecting a label from the Activity table. From the Label browser, click the context menu (three dots) on the desired label and click Add as Benchmark.


View results

Benchmark labels are marked with a gold star in the Activity table under the Labels tab.

Under the Labels tab, there is also a Benchmarks table where you’ll see a list of all the Benchmarks labels for that project. Click on “View Results” to see all labels associated with that benchmark label.

When the Benchmarks tool is active for your project, the Individual performance section under the Performance tab will display a Benchmarks column that indicates the average Benchmark score for that labeler.