Labelbox documentation


Consensus is a Labelbox QA tool that compares a single Label on an asset to all of the other Labels on that asset. Once an asset has been labeled more than once, a Consensus score is automatically calculated. Consensus works in real-time so you can take immediate and corrective actions towards improving your training data and model performance.

Consensus task management (collecting votes) works for all data types. However, Consensus scores are only supported for the following annotation types: Bounding boxes on images, Polygons on images, Masks on images, and NER annotations on text.

Whenever an annotation is created, updated, or deleted, the consensus score will be recalculated as long as at least 2 Labels exist on that data row. Recalculations may take up to 5 minutes or so depending on the complexity of the labeled asset.

Consensus methodology for image annotations

Generally speaking, calculating agreement for the polygons of a set of annotations involves Intersection-over-Union and a series of averages to calculate the final agreement between the set of annotations on an image asset.

Consensus methodology for text annotations (NER)

The Consensus score for two spans of text is calculated at the character-level.


Whitespace is included in the calculation.

If two spans of text do not overlap, the Consensus score will be 0.

Otherwise, partial overlaps between two spans of text will have a non-zero score. When there is overlap, Labelbox will compute the weighted sum of the overlap length ratios, discounting for already counted overlaps. Since Consensus is calculated at the character-level, spans of text are partly inclusive. For example, if one labeler annotates a URL this way: wiki /Bicycle

And another labeler annotates the same URL this way: iki /Bicycle

The Consensus agreement between these two Entity annotations would be 75%.

Then, any global classifications are averaged into the Consensus agreement.

Consensus methodology for classifications

There are four global classification types: radio, checklist, text, and dropdown. The calculation method for each classification type is different. One commonality, however, is that if two classifications of the same type are compared and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

A Radio classification can only have one selected answer. Therefore, the agreement between two radio classifications will either be 0% or 100%. 0% means no agreement and 100% means agreement. Similarly, a Free-form text classification answer must be an exact match in order to receive a 100% agreement. Otherwise, it is a 0% agreement.

A Checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

A Dropdown classification can have only one selected answer, however, the answer choices can be nested. The calculation for dropdown is similar to that of checklist classification, except that the agreement calculation divides the number of overlapping answers by the total depth of the selection (how many levels). Answers nested under different top-level classifications can still have overlap if the classifications at the next level match. On the flip side, answers that do not match exactly can still have overlap if they are under the same top-level classification.

Set up Consensus

  1. Create a project or select an existing one.

  2. Navigate to Settings > Quality and select Consensus to turn this QA feature on for your project.

  3. Choose the Coverage percentage and the number of Votes. The number of Votes indicates how many times the assets in the Coverage percentage get labeled. For example, 25% of the assets will get labeled 3 times.


View results

The chart at the bottom of the Overview tab displays the Consensus scores across all labels in the project. The x-axis indicates the agreement percentage and the y-axis indicates the label count.


The Consensus column in the Activity table contains the agreement score for each label and how many labels are associated with that score. When you click on the consensus icon, the Activity table will automatically apply the correct filter to view the labels associated with that consensus score.

When you click on an individual labeler in the Performance tab, the Consensus column reflects the average Consensus score for that labeler.