Developer guide for importing annotations on document (PDF) data and sample import formats.
Overview
To import annotations in Labelbox, you need to create an annotations payload. In this section, we provide this payload for every supported annotation type.
Annotation payload types
Labelbox supports two formats for the annotations payload:
- Python annotation types (recommended)
- Provides a seamless transition between third-party platforms, machine learning pipelines, and Labelbox.
- Allows you to build annotations locally with local file paths, numpy arrays, or URLs
- Easily convert Python Annotation Type format to NDJSON format to quickly import annotations to Labelbox
- Supports one-level nested classification (radio, checklist, or free-form text) under a tool or classification annotation.
- JSON
- Skips formatting annotation payload in the Labelbox Python annotation type
- Supports any levels of nested classification (radio, checklist, or free-form text) under a tool or classification annotation.
Label import types
Labelbox additionally supports two types of label imports:
- Model-assisted labeling (MAL)
- This workflow allows you to import computer-generated predictions (or simply annotations created outside of Labelbox) as pre-labels on an asset.
- Ground truth
- This workflow functionality allows you to bulk import your ground truth annotations from an external or third-party labeling system into Labelbox Annotate. Using the label import API to import external data is a useful way to consolidate and migrate all annotations into Labelbox as a single source of truth.
Supported annotations
The following annotations are supported for a document data row:
- Radio
- Checklist
- Free-form text
- Bounding box
- Entity
- Relationship
Annotations are global and page based
Classification annotations are only supported globally while page-specific tool annotations are supported at a per page basis.
Classifications
Radio (single-choice, global)
radio_annotation = lb_types.ClassificationAnnotation(
name="radio_question",
value=lb_types.Radio(answer =
lb_types.ClassificationAnswer(name = "first_radio_answer")
)
)
radio_annotation_ndjson = {
"name": "radio_question",
"answer": {"name": "first_radio_answer"}
}
Checklist (multi-choice, global)
checklist_annotation = lb_types.ClassificationAnnotation(
name="checklist_question",
value=lb_types.Checklist(answer = [
lb_types.ClassificationAnswer(name = "first_checklist_answer"),
lb_types.ClassificationAnswer(name = "second_checklist_answer")
])
)
checklist_annotation_ndjson = {
"name": "checklist_question",
"answer": [
{"name": "first_checklist_answer"},
{"name": "second_checklist_answer"}
]
}
Free-form text (global)
text_annotation = lb_types.ClassificationAnnotation(
name="free_text", # must match your ontology feature"s name
value=lb_types.Text(answer="sample text")
)
text_annotation_ndjson = {
"name": "free_text",
"answer": "sample text"
}
Nested Classifications (global)
nested_checklist_annotation = lb_types.ClassificationAnnotation(
name="nested_checklist_question",
value=lb_types.Checklist(
answer=[lb_types.ClassificationAnswer(
name="first_checklist_answer",
classifications=[
lb_types.ClassificationAnnotation(
name="sub_checklist_question",
value=lb_types.Checklist(
answer=[lb_types.ClassificationAnswer(
name="first_sub_checklist_answer"
)]
))
]
)]
)
)
nested_radio_annotation = lb_types.ClassificationAnnotation(
name="nested_radio_question",
value=lb_types.Radio(
answer=lb_types.ClassificationAnswer(
name="first_radio_answer",
classifications=[
lb_types.ClassificationAnnotation(
name="sub_radio_question",
value=lb_types.Radio(
answer=lb_types.ClassificationAnswer(
name="first_sub_radio_answer"
)
)
)
]
)
)
)
nested_checklist_annotation_ndjson = {
"name": "nested_checklist_question",
"answer": [{
"name": "first_checklist_answer",
"classifications" : [
{
"name": "sub_checklist_question",
"answer": {"name": "first_sub_checklist_answer"}
}
]
}]
}
nested_radio_annotation_ndjson = {
"name": "nested_radio_question",
"answer": {
"name": "first_radio_answer",
"classifications": [{
"name":"sub_radio_question",
"answer": { "name" : "first_sub_radio_answer"}
}]
}
}
Tools
Bounding box (page specific)
When importing a bounding box annotation, you need to specify its DocumentRectangle
value, which defines its precise area on a document page. You can choose from the following RectangleUnit
options, which determine the measurement unit used to define the dimensions and coordinates of the bounding box:
- INCHES
- PIXELS
- POINTS
- PERCENT
bbox_annotation = lb_types.ObjectAnnotation(
name="bounding_box", # must match your ontology feature's name
value=lb_types.DocumentRectangle(
start=lb_types.Point(x=102.771, y=135.3), # x = left, y = top
end=lb_types.Point(x=518.571, y=245.143), # x = left + width , y = top + height
page=0,
unit=lb_types.RectangleUnit.POINTS
)
)
bbox_annotation_ndjson = {
"name": "bounding_box",
"bbox": {
"top": 135.3,
"left": 102.771,
"height": 109.843,
"width": 415.8
},
"page": 0,
"unit": "POINTS"
}
Entity (page specific)
textSelections
is the payload required for each entity annotation. EachtextSelections
item in the list requires the following fields:
- The
groupId
associated with a group of words. - A list of
tokenIds
for each word in the group of words. - The
page
of the document (1-indexed).
Both tokenIds
and groupdId
can extracted from the text layer URL attached to the data row. For more information on text layers, visit our import document data guide.
entities_annotations = lb_types.ObjectAnnotation(
name="named_entity",
value= lb_types.DocumentEntity(
name="named_entity",
textSelections=[
lb_types.DocumentTextSelection(
token_ids=[],
group_id="",
page=1
)
]
)
)
entities_annotations_ndjson = {
"name": "named_entity",
"textSelections": [
{
"tokenIds": [
"<UUID>", ## ids associated with each word in a group
],
"groupId": "<UUID>", ## id associated with a group of words
"page": 1,
}
]
}
Tool with nested classifications (page specific)
bbox_with_radio_subclass_annotation = lb_types.ObjectAnnotation(
name="bbox_with_radio_subclass",
value=lb_types.DocumentRectangle(
start=lb_types.Point(x=317.271, y=226.757), # x = left, y = top
end=lb_types.Point(x=566.657, y=420.986), # x = left + width , y = top + height
unit=lb_types.RectangleUnit.POINTS,
page=1
),
classifications=[
lb_types.ClassificationAnnotation(
name="sub_radio_question",
value=lb_types.Radio(
answer=lb_types.ClassificationAnswer(
name="first_sub_radio_answer",
classifications=[
lb_types.ClassificationAnnotation(
name="second_sub_radio_question",
value=lb_types.Radio(
answer=lb_types.ClassificationAnswer(
name="second_sub_radio_answer"
)
)
)
]
)
)
)
]
)
ner_with_checklist_subclass_annotation = lb_types.ObjectAnnotation(
name="ner_with_checklist_subclass",
value=lb_types.DocumentEntity(
name="ner_with_checklist_subclass",
text_selections=[
lb_types.DocumentTextSelection(
token_ids=[],
group_id="",
page=1
)
]
),
classifications=[
lb_types.ClassificationAnnotation(
name="sub_checklist_question",
value=lb_types.Checklist(
answer=[lb_types.ClassificationAnswer(name="first_sub_checklist_answer")]
)
)
]
)
bbox_with_radio_subclass_annotation_ndjson = {
"name": "bbox_with_radio_subclass",
"classifications": [
{
"name": "sub_radio_question",
"answer": {
"name": "first_sub_radio_answer",
"classifications": [
{
"name": "second_sub_radio_question",
"answer": {
"name": "second_sub_radio_answer"}
}
]
}
}
],
"bbox": {
"top": 226.757,
"left": 317.271,
"height": 194.229,
"width": 249.386
},
"page": 1,
"unit": "POINTS"
}
Relationships (page specific)
bbox_source = lb_types.ObjectAnnotation(
name="bounding_box",
value=lb_types.DocumentRectangle(
start=lb_types.Point(x=188.257, y=68.875), # x = left, y = top
end=lb_types.Point(x=270.907, y=149.556), # x = left + width , y = top + height
unit=lb_types.RectangleUnit.POINTS,
page=1
),
)
bbox_target = lb_types.ObjectAnnotation(
name="bounding_box",
value=lb_types.DocumentRectangle(
start=lb_types.Point(x=96.424, y=66.251),
end=lb_types.Point(x=179.074, y=146.932),
unit=lb_types.RectangleUnit.POINTS,
page=1
),
)
bbox_relationship = lb_types.RelationshipAnnotation(
name="relationship",
value=lb_types.Relationship(
source=bbox_source,
target=bbox_target,
type=lb_types.Relationship.Type.UNIDIRECTIONAL,
))
entity_source = lb_types.ObjectAnnotation(
name="named_entity",
value= lb_types.DocumentEntity(
name="named_entity",
textSelections=[
lb_types.DocumentTextSelection(
token_ids=[],
group_id="",
page=1
)
]
)
)
entity_target = lb_types.ObjectAnnotation(
name="named_entity",
value=lb_types.DocumentEntity(
name="named_entity",
textSelections=[
lb_types.DocumentTextSelection(
token_ids=[],
group_id="",
page=1
)
]
)
)
entity_relationship = lb_types.RelationshipAnnotation(
name="relationship",
value=lb_types.Relationship(
source=entity_source,
target=entity_target,
type=lb_types.Relationship.Type.UNIDIRECTIONAL,
))
uuid_source = str(uuid.uuid4())
uuid_target = str(uuid.uuid4())
entity_source_ndjson = {
"name": "named_entity",
"uuid": uuid_source,
"textSelections": [
{
"tokenIds": [
""
],
"groupId": "",
"page": 1
}
]
}
entity_target_ndjson = {
"name": "named_entity",
"uuid": uuid_target,
"textSelections": [
{
"tokenIds": [
""
],
"groupId": "",
"page": 1
}
]
}
ner_relationship_annotation_ndjson = {
"name": "relationship",
"relationship": {
"source": uuid_source, # UUID reference to source annotation
"target": uuid_target, # UUID reference to target annotation
"type": "unidirectional"
}
}
Creating text entity annotation
For importing entity annotations, you can use your own text_layer_url
or a Labelbox-generated text_layer_url
.
You can get the Labelbox-generated text_layer_url
by exporting the data row. The below code snippet demonstrates this process by exporting from a data row.
# Export data row
task = lb.DataRow.export(client=client,global_keys=[global_key])
task.wait_till_done()
if task.has_result():
stream = task.get_buffered_stream()
text_layer = ""
for output in stream:
output_json = output.json
text_layer = output_json['media_attributes']['text_layer_url']
print(text_layer)
import requests
import json
# Helper method for updating text selections
def update_text_selections(annotation, group_id, list_tokens, page):
return annotation.update({
"textSelections": [
{
"groupId": group_id,
"tokenIds": list_tokens,
"page": page
}
]
})
# Fetch the content of the text layer
res = requests.get(text_layer)
# Phrases that we want to annotation obtained from the text layer url
content_phrases = ["Metal-insulator (MI) transitions have been one of the" ,
"T. Sasaki, N. Yoneyama, and N. Kobayashi",,
"Organic charge transfer salts based on the donor",
"the experimental investigations on this issue have not"]
# Parse the text layer
text_selections = []
text_selections_ner = []
text_selections_source = []
text_selections_target = []
for obj in json.loads(res.text):
for group in obj["groups"]:
if group["content"] == content_phrases[0]:
list_tokens = [x["id"] for x in group["tokens"]]
# build text selections for Python Annotation Types
document_text_selection = lb_types.DocumentTextSelection(groupId=group["id"], tokenIds=list_tokens, page=1)
text_selections.append(document_text_selection)
# build text selection for the NDJson annotations
update_text_selections(annotation=entities_annotations_ndjson,
group_id=group["id"], # id representing group of words
list_tokens=list_tokens, # ids representing individual words from the group
page=1)
if group["content"] == content_phrases[1]:
list_tokens_2 = [x["id"] for x in group["tokens"]]
# build text selections for Python Annotation Types
ner_text_selection = lb_types.DocumentTextSelection(groupId=group["id"], tokenIds=list_tokens_2, page=1)
text_selections_ner.append(ner_text_selection)
# build text selection for the NDJson annotations
update_text_selections(annotation=ner_with_checklist_subclass_annotation_ndjson,
group_id=group["id"], # id representing group of words
list_tokens=list_tokens_2, # ids representing individual words from the group
page=1)
if group["content"] == content_phrases[2]:
relationship_source = [x["id"] for x in group["tokens"]]
# build text selections for Python Annotation Types
text_selection_entity_source = lb_types.DocumentTextSelection(groupId=group["id"], tokenIds=relationship_source, page=1)
text_selections_source.append(text_selection_entity_source)
# build text selection for the NDJson annotations
update_text_selections(annotation=entity_source_ndjson,
group_id=group["id"], # id representing group of words
list_tokens=relationship_source, # ids representing individual words from the group
page=1)
if group["content"] == content_phrases[3]:
relationship_target = [x["id"] for x in group["tokens"]]
# build text selections for Python Annotation Types
text_selection_entity_target = lb_types.DocumentTextSelection(group_id=group["id"], tokenIds=relationship_target, page=1)
text_selections_target.append(text_selection_entity_target)
# build text selections forthe NDJson annotations
update_text_selections(annotation=entity_target_ndjson,
group_id=group["id"], # id representing group of words
list_tokens=relationship_target, # ids representing individual words from the group
page=1)
Re-write the Python annotations to include text selections (only required for Python annotation types)
# re-write the entity annotation with text selections
entities_annotation_document_entity = lb_types.DocumentEntity(name="named_entity", textSelections = text_selections)
entities_annotation = lb_types.ObjectAnnotation(name="named_entity",value=entities_annotation_document_entity)
# re-write the entity annotation + subclassification with text selections
classifications = [
lb_types.ClassificationAnnotation(
name="sub_checklist_question",
value=lb_types.Checklist(
answer=[lb_types.ClassificationAnswer(name="first_sub_checklist_answer")]
)
)
]
ner_annotation_with_subclass = lb_types.DocumentEntity(name="ner_with_checklist_subclass", textSelections= text_selections_ner)
ner_with_checklist_subclass_annotation = lb_types.ObjectAnnotation(name="ner_with_checklist_subclass",
value=ner_annotation_with_subclass,
classifications=classifications)
# re-write the entity source and target annotations withe text selectios
entity_source_doc = lb_types.DocumentEntity(name="named_entity", text_selections= text_selections_source)
entity_source = lb_types.ObjectAnnotation(name="named_entity", value=entity_source_doc)
entity_target_doc = lb_types.DocumentEntity(name="named_entity", text_selections=text_selections_target)
entity_target = lb_types.ObjectAnnotation(name="named_entity", value=entity_target_doc)
# re-write the entity relationship with the re-created entities
entity_relationship = lb_types.RelationshipAnnotation(
name="relationship",
value=lb_types.Relationship(
source=entity_source,
target=entity_target,
type=lb_types.Relationship.Type.UNIDIRECTIONAL,
))
print(f"entities_annotations_ndjson={entities_annotations_ndjson}")
print(f"entities_annotation={entities_annotation}")
print(f"nested_entities_annotation_ndjson={ner_with_checklist_subclass_annotation_ndjson}")
print(f"nested_entities_annotation={ner_with_checklist_subclass_annotation}")
print(f"entity_source_ndjson={entity_source_ndjson}")
print(f"entity_target_ndjson={entity_target_ndjson}")
print(f"entity_source={entity_source}")
print(f"entity_target={entity_target}")
Example: Import pre-labels or ground truths
The steps to import annotations as pre-labels (machine-assisted learning) are similar to those to import annotations as ground truth labels. However, they vary slightly, and we will describe the differences for each scenario.
Before you start
The below imports are needed to use the code examples in this section.
import uuid
import json
import labelbox as lb
import labelbox.types as lb_types
Replace the value of API_KEY
with a valid API key to connect to the Labelbox client.
API_KEY = None
client = lb.Client(API_KEY)
Step 1: Import data rows
Data rows must first be uploaded to Catalog to attach annotations.
This example shows how to create a data row in Catalog by attaching it to a dataset .
global_key = "0801.3483.pdf"
img_url = {
"row_data": {
"pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
},
"global_key": global_key
}
dataset = client.create_dataset(name="pdf_demo_dataset")
task = dataset.create_data_rows([img_url])
task.wait_till_done()
print(f"Failed data rows: {task.failed_data_rows}")
print(f"Errors: {task.errors}")
Step 2: Set up ontology
Your project ontology should support the tools and classifications required by your annotations. To ensure accurate schema feature mapping, the value used as the name
parameter should match the value of the name
field in your annotation.
For example, when we created an annotation above, we provided a nameannotation_name
. Now, when we set up our ontology, we must ensure that the name of our bounding box tool is also anotations_name
. The same alignment must hold true for the other tools and classifications we create in our ontology.
This example shows how to create an ontology containing all supported annotation types .
ontology_builder = lb.OntologyBuilder(
classifications=[ # List of Classification objects
lb.Classification(
class_type=lb.Classification.Type.RADIO,
name="radio_question",
scope = lb.Classification.Scope.GLOBAL,
options=[
lb.Option(value="first_radio_answer"),
lb.Option(value="second_radio_answer")
]
),
lb.Classification(
class_type=lb.Classification.Type.CHECKLIST,
name="checklist_question",
scope = lb.Classification.Scope.GLOBAL,
options=[
lb.Option(value="first_checklist_answer"),
lb.Option(value="second_checklist_answer")
]
),
lb.Classification(
class_type=lb.Classification.Type.TEXT,
name="free_text",
scope = lb.Classification.Scope.GLOBAL
),
lb.Classification(
class_type=lb.Classification.Type.RADIO,
name="nested_radio_question",
scope = lb.Classification.Scope.GLOBAL,
options=[
lb.Option("first_radio_answer",
options=[
lb.Classification(
class_type=lb.Classification.Type.RADIO,
name="sub_radio_question",
options=[lb.Option("first_sub_radio_answer")]
)
])
]
),
lb.Classification(
class_type=lb.Classification.Type.CHECKLIST,
name="nested_checklist_question",
scope = lb.Classification.Scope.GLOBAL,
options=[
lb.Option("first_checklist_answer",
options=[
lb.Classification(
class_type=lb.Classification.Type.CHECKLIST,
name="sub_checklist_question",
options=[lb.Option("first_sub_checklist_answer")]
)
])
]
),
],
tools=[ # List of Tool objects
lb.Tool( tool=lb.Tool.Type.BBOX,name="bounding_box"),
lb.Tool(tool=lb.Tool.Type.NER, name="named_entity"),
lb.Tool(tool=lb.Tool.Type.RELATIONSHIP,name="relationship"),
lb.Tool(tool=lb.Tool.Type.NER,
name="ner_with_checklist_subclass",
classifications=[
lb.Classification(
class_type=lb.Classification.Type.CHECKLIST,
name="sub_checklist_question",
options=[
lb.Option(value="first_sub_checklist_answer")
]
)
]),
lb.Tool( tool=lb.Tool.Type.BBOX,
name="bbox_with_radio_subclass",
classifications=[
lb.Classification(
class_type=lb.Classification.Type.RADIO,
name="sub_radio_question",
options=[
lb.Option(
value="first_sub_radio_answer" ,
options=[
lb.Classification(
class_type=lb.Classification.Type.RADIO,
name="second_sub_radio_question",
options=[lb.Option("second_sub_radio_answer")]
)]
)]
)]
)]
)
ontology = client.create_ontology("Document Annotation Import Demo",
ontology_builder.asdict(),
media_type=lb.MediaType.Document)
Step 3: Set Up a Labeling Project
# Create a Labelbox project
project = client.create_project(name="PDF_annotation_demo",
queue_mode=QueueMode.Batch,
media_type=lb.MediaType.Document)
project.connect_ontology(ontology)
Step 4: Send Data Rows to Project
project.create_batch(
"PDF_annotation_batch", # Each batch in a project must have a unique name
global_keys=[global_key] , # a list of global keys, data rows, or data row ids
priority=5 # priority between 1(highest) - 5(lowest)
)
Step 5: Create annotation payloads
For help understanding annotation payloads, see overview. To declare payloads, you can use Python annotation types (preferred) or NDJSON objects.
These examples demonstrate each format and how to compose annotations into labels attached to data rows.
# create a Label
labels = []
labels.append(
lb_types.Label(
data={"global_key" : global_key },
annotations = [
entities_annotation,
checklist_annotation,
nested_checklist_annotation,
text_annotation,
radio_annotation,
nested_radio_annotation,
bbox_annotation,
bbox_with_radio_subclass_annotation,
ner_with_checklist_subclass_annotation,
entity_source,
entity_target,
entity_relationship,# Only supported for MAL imports
bbox_source,
bbox_target,
bbox_relationship # Only supported for MAL imports
]
)
)
label_ndjson = []
for annot in [
entities_annotations_ndjson,
checklist_annotation_ndjson,
nested_checklist_annotation_ndjson,
text_annotation_ndjson,
radio_annotation_ndjson,
nested_radio_annotation_ndjson,
bbox_annotation_ndjson,
bbox_with_radio_subclass_annotation_ndjson,
ner_with_checklist_subclass_annotation_ndjson,
entity_source_ndjson,
entity_target_ndjson,
ner_relationship_annotation_ndjson, # Only supported for MAL imports
bbox_source_ndjson,
bbox_target_ndjson,
bbox_relationship_annotation_ndjson # Only supported for MAL imports
]:
annot.update({
"dataRow": {"globalKey": global_key},
})
label_ndjson.append(annot)
Step 6: Import annotation payload
For prelabeled (model-assisted labeling) scenarios, pass your payload as the value of the predictions
parameter. For ground truths, pass the payload to the labels
parameter.
Warning
Relationship annotations are not supported for ground truth import jobs.
Option A: Upload as prelabels (model assisted labeling)
This option is helpful for speeding up the initial labeling process and reducing the manual labeling workload for high-volume datasets.
# Upload MAL label for this data row in project
upload_job = lb.MALPredictionImport.create_from_objects(
client = client,
project_id = project.uid,
name="mal_job"+str(uuid.uuid4()),
predictions=label
)
print(f"Errors: {upload_job.errors}", )
print(f"Status of uploads: {upload_job.statuses}"
Option B: Upload to a labeling project as ground truth
This option is helpful for loading high-confidence labels from another platform or previous projects that just need review rather than manual labeling effort.
# Upload label for this data row in project
upload_job = lb.LabelImport.create_from_objects(
client = client,
project_id = project.uid,
name="label_import_job"+str(uuid.uuid4()),
labels=label
)
print(f"Errors: {upload_job.errors}", )
print(f"Status of uploads: {upload_job.statuses}")