Consensus

The consensus tool allows you to automatically compare the annotations on a given asset to all other annotations on that asset. Consensus works in real-time so you can take immediate and corrective actions toward improving your training data and model performance.

Once an asset is labeled more than once, a consensus score is automatically calculated. Whenever an annotation is created, updated, or deleted, the consensus score will be recalculated as long as at least 2 Labels exist on that data row. Recalculations may take up to 5 minutes or so depending on the complexity of the labeled asset.

Consensus agreement calculations are only supported for the following asset and annotation types.

Asset typeBounding boxPolygonPolylinePointSegmentation maskEntityRelationshipRadioChecklistDropdownFree-form text
ImageN/A--
Video-----N/A-----
TextN/AN/AN/AN/AN/A--
AudioN/AN/AN/AN/AN/A---
Tiled imageryNAN/A--

📘

Queuing consensus submissions is supported for all data types in Labelbox

Consensus task management (collecting votes) works for all data types. However, if the annotation type or data type is not supported by our calculation, no consensus score will be shown in our application but the submissions will be grouped together.

Configure consensus settings

When you create a new project, you’ll be prompted to select a quality setting (benchmark or consensus). You cannot switch your quality setting once the project has been created.

If you have consensus enabled for your project, you'll be able to configure consensus settings for any batch added to that project.

Consensus settingDescription
Data row priorityThis value indicates where in the labeling queue these data rows will be slotted, based on priority.
% coverageThis value indicates what percentage of the data rows in the batch will enter the labeling queue.
# labelsOut of the data rows to be labeled, this value indicates how many times the data rows should be labeled.

Select a set of annotations as the “winner”

At project creation, when you select consensus as the quality mode and batches and the queueing mode, Labelbox automatically enables multi-labeling for the data rows queued to that project. This means that data rows included in the % coverage (see section above) can be labeled by more than one labeler.

After a data row is labeled, it can be reviewed. By default, Labelbox will preselect the first entry of annotations on a data row as the “winner”. During the review process, the reviewer can reassign the consensus selection to another set of annotations after the data row is labeled more times.

If your data row has been labeled more than once, you'll be able to see all of the labeler entries on that data row in the data row browser. In the example below, you can see that the data row has been labeled twice and the first entry of annotations is designated as the “winner” (indicated by the green trophy icon).

You can change the “winner” designation to another set of annotations by clicking on the trophy icons.

Watch this video to learn about approving and rejecting data rows in a consensus project.

View consensus results

Within a project, navigate to Performance > Quality and you will see two charts. The histogram on the left displays the average consensus score for labels created in certain date ranges. The histogram on the right shows the number of labels created that have a consensus score within the specified range.

The consensus column in the activity table contains the agreement score for each labeled data row and how many labels are associated with that score. When you click on the consensus icon, the activity table will automatically apply the correct filter to view the labels associated with that consensus score.

When you click on an individual labeler in the performance tab, the consensus column reflects the average consensus score for that labeler.

📘

View consensus scores in a label export

The agreement field in the JSON for exported labels will have a value between 0 and 1 that denotes the associated Consensus score for the label.

For assets that have only been labeled once, and thus do not have a Consensus score, the agreement field in the label export will have a value of -1.

FAQ

How are object-type calculations factored into the Consensus calculation?

Consensus agreement for bounding box, polygon, and segmentation mask annotations is calculated using Intersection over Union (IoU). The agreement between point annotations and polyline annotations is calculated based on proximity.

  1. First, Labelbox compares each annotation to its corresponding annotation to generate IoU scores for each annotation. The algorithm first finds the pairs of annotations to maximize the total IoU score, then it assigns the IoU value of 0 to any unmatched annotations.

  2. Labelbox then averages the IoU scores for each annotation belonging to the same annotation class to create an overall score for that annotation class.

"Tree" annotation class agreement = 0.99 + 0.99 + 0.97 + 0 + 0 / 5 = 0.59

How are text (NER) annotations factored into the consensus calculation?

The consensus score for two text entity annotations is calculated at the character level. If two entity annotations do not overlap, the consensus score will be 0. Overlapping text entity annotations will have a non-zero score. When there is overlap, Labelbox computes the weighted sum of the overlap length ratios, discounting for already counted overlaps. Whitespace is included in the calculation.

  1. Since the consensus agreement for NER is calculated at the character level, spans of text are partly inclusive.

📘

For example:

If labeler #1 labels a word like this label and labeler #2 labels the same word in the text file like this labe the agreement score between these two annotations would be 0.80.

  1. Labelbox then averages the agreements for each annotation created using that annotation class to create an overall score for that annotation class.

How is consensus calculated for classifications?

The calculation method for each classification type is different. One commonality, however, is that if two classifications of the same type are compared and there are no corresponding selections between the two classifications at all, the agreement will be 0%.

  • A radio classification can only have one selected answer. Therefore, the agreement between the two radio classifications will either be 0% or 100%. 0% means no agreement and 100% means agreement.

  • A checklist classification can have more than one selected answer, which makes the agreement calculation a little more complex. The agreement between two checklist classifications is generated by dividing the number of overlapping answers by the number of selected answers.

  • A dropdown classification can have only one selected answer, however, the answer choices can be nested. The calculation for dropdown is similar to that of checklist classification, except that the agreement calculation divides the number of overlapping answers by the total depth of the selection (how many levels). Answers nested under different top-level classifications can still overlap if the classifications at the next level match. On the flip side, answers that do not match exactly can still overlap if they are under the same top-level classification.

For child classifications, if two annotations containing child classifications have 0 agreement (false positive), the child classifications will automatically be assigned a score of 0 as well.

Labelbox then creates a score for each annotation class by averaging all of the annotation scores.

For example, when Image X loads in the editor, the labelers have 3 classification questions to choose from (Q1, Q2, Q3) each with two answers.

470470

Each of the dotted boxes represents a unique answer choice/answer schema.

Say, for example, these 2 labelers have the same answer for Q1 but different answers for Q2 and Q3.

464464

Labeler 1

504504

Labeler 2

For classifications, the consensus agreement is calculated based on how many unique answer schemas are selected by all labelers.

  • Q1-A: 1 (both labelers picked this answer)

  • Q1-B: N/A (neither labeler picked this answer) <-- not included in the final calculation.

  • Q2-A: 0 (Labeler 1 selected, Labeler 2 did not)

  • Q2-B: 0 (Labeler 2 selected, Labeler 1 did not)

  • Q3-A: 0 (Labeler 1 selected, Labeler 2 did not)

  • Q3-B: 0 (Labeler 2 selected, Labeler 1 did not)

So the final consensus calculation for the classifications on Image X is:

(1 + 0 + 0 + 0 + 0) / 5 = .20

How is the consensus score calculated for objects + classifications?

Labelbox averages the scores for each annotation class (object-type & classification-type) to create an overall score for the asset. Each annotation class is weighted equally. Below is a simplified example.

Consensus score = (annotation class agreement + radio class agreement) / total annotation classes

0.795 = (0.59 + 1.00) / 2

You can use the metric as an initial indicator of label quality, the clarity of your ontology, and/or the clarity of your labeling instructions.