Benchmark - Labelbox

Benchmarks serve as the gold standard for other labels. You can designate specific data rows with annotations as benchmarks, and all other annotations on these data rows are automatically compared to these benchmark reference labels to calculate benchmark agreement scores.

Benchmarks in the queue

Labelbox ensures that the first five data rows in your labeling queue are benchmark labels. After these initial five, the order of the remaining benchmarks is not guaranteed. Beyond the first five, there is a 10% chance that any subsequent data row you encounter will be a benchmark label. If you have fewer than five benchmarks, they are the first data rows in your queue, and no additional benchmarks appear unless you add more.

Supported types

Currently, benchmarking is supported for the following asset and annotation types:

Asset type	Bounding box	Polygon	Polyline	Point	Segmentation mask	Entity
Images						N/A
Videos		-			-	N/A
Audio	N/A	N/A	N/A	N/A	N/A	N/A
Text	N/A	N/A	N/A	N/A	N/A
Tiled imagery					-	N/A
Documents		N/A	N/A	N/A	N/A
HTML	N/A	N/A	N/A	N/A	N/A	N/A
Conversational text	N/A	N/A	N/A	N/A	N/A
Human-generated responses	N/A	N/A	N/A	N/A	N/A	N/A

Set up benchmarks

To set an individual data row as a benchmark reference:

In the editor, label the data row and click Submit.
Navigate back to the project homepage.
Go to the Data rows tab.
Select the labeled data row from the list. This will open the Data row browser.
Click on the three dots next to the data row and select Add as benchmark from the dropdown.

To bulk-assign labeled data rows as benchmark references:

On the Data rows tab, select the data rows that you want to set as benchmarks.
Click the selected dropdown and select Assign labels as benchmarks.
If you have permissions to access benchmark scores, you can select from the following two options:
- Infer automatically: Set the labels created within the benchmark quality mode as benchmark references. If the project also uses consensus as the quality mode, this option also sets consensus winners as benchmark references.
- Target label by score: Select or set a range of benchmark scores to set labeled data rows matching the filter as benchmark references.
If you don’t have permissions to access benchmark scores or if no benchmark scores are available, you can only see and use the Infer automatically option.
Click Submit.

Once a label is designated as a benchmark, the data row is automatically moved to Done. Benchmarked data rows will be served to all labelers in a project. Benchmarked data rows can’t be moved to any other step unless the benchmark is removed. For a benchmark agreement to be calculated, one benchmark-reference label and at least one non-benchmark label need to be on the data row. Whenever annotations are added or updated, the benchmark agreement is recalculated as long as at least one non-benchmark label exists on the data row.

Search and filter data using benchmark scores

The Benchmark agreement filter helps you find qualified data rows based on benchmark scores. You can apply this filter in the following locations:

The Data Rows tab
The Workflow tab
The Catalog page

When using the filter, you can configure the following options:

Scope: Specify the type of agreement to measure:
- Feature-level measures the alignment between annotators’ labels and the predefined benchmark reference labels for each data row. If you select this option, further specify one or more feature schemas in the ontology using the dropdown menu.
- - Label-level evaluates the overall agreement of all annotations within a single data row compared to the benchmark reference label.
Calculation: Choose whether to calculate the agreement as an absolute or average score.
Range (0-1): Set the score range from 0 to 1, 0 indicates no agreement with the benchmark reference label and 1 represents complete agreement.

​Benchmarks in the queue

​Supported types

​Set up benchmarks

​Search and filter data using benchmark scores

Benchmarks in the queue

Supported types

Set up benchmarks

Search and filter data using benchmark scores