Import document data

How to import document data and sample import formats.

Specifications

Format: PDF
Recommended size: 100 pages or fewer
Import methods:

  • IAM Delegated Access
  • Signed URLs (https URLs only)
  • Direct upload of local files(256 MB max file size)
    Note: Direct upload currently does not support adding additional metadata and attachments, see below Python example.

Format: JSON
Recommended size: 100 pages or fewer
Import methods:

  • IAM Delegated Access
  • Signed URLs (https URLs only)

When importing document data to Labelbox, you have to provide two assets: the PDF itself, as well as OCR extract in the form of a JSON. This JSON will be your text layer, rendered on top of your PDF in the Document Editor.

We currently support OCR generated from:
Adobe OCR
AWS Textract OCR
GCP OCR

In case the user uploads only a PDF: we will render PDf.js output as the text layer, but this will disable the export raw text feature.

Conversion Scripts

Once you have your PDF asset and the OCR text layer in a JSON, convert your assets to Labelbox ingestible format through our conversion format.

You can find our conversion scripts here.

Text Layer Validation Schema

Your textLayer JSON file must adhere to the following JSON schema.

{
  "type": "array",
  "items": {
    "$ref": "#/$defs/page"
  },
  "$defs": {
    "page": {
      "type": "object",
      "properties": {
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        },
        "number": {
          "type": "number"
        },
        "units": {
          "enum": ["POINTS", "PERCENT"]
        },
        "groups": {
          "type": "array",
          "items": {
            "$ref": "#/$defs/group"
          }
        }
      },
      "required": ["number", "units", "groups"]
    },
    "group": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "geometry": {
          "$ref": "#/$defs/geometry"
        },
        "tokens": {
          "type": "array",
          "items": {
            "$ref": "#/$defs/token"
          }
        }
      },
      "required": ["id", "content", "geometry", "tokens"]
    },
    "geometry": {
      "type": "object",
      "properties": {
        "left": {
          "type": "number"
        },
        "top": {
          "type": "number"
        },
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        }
      },
      "required": ["left", "top", "width", "height"]
    },
    "token": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "geometry": {
          "$ref": "#/$defs/geometry"
        }
      },
      "required": ["id", "geometry", "content"]
    }
  }
}

Parameters

ParameterRequiredDescription
row_dataYesA dictionary of
{ "pdf_url": str, "text_layer_url": str }

For IAM Delegated Access, this URL must be in virtual-hosted-style format.
row_data['pdf_url']Yeshttps path to a cloud-hosted PDF. It must be specified within row_data dictionary.
row_data['text_layer_url']Yeshttps path to a cloud-hosted JSON extract of the PDF. It must be specified within row_data dictionary.
global_keyNoUnique user-generated file name or ID for the file. Global keys are enforced to be unique in your org. Data rows will not be imported if its global keys are duplicated to existing data rows.
media_typeNo"PDF" (optional media type to provide better validation and error messaging)
metadata_fieldsNoSee Metadata.
attachmentsNoSee Attachments and Asset overlays.

Import format

[
  {
    "row_data": {
      "pdf_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0801.3483.pdf",
      "text_layer_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  },
  {
    "row_data": {
      "pdf_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0803.1972.pdf",
      "text_layer_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0803.1972-lb-textlayer.json"
    },
    "global_key": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0803.1972.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"name": "<metadata_field_name>", "value": "tag_string"}],
    "attachments": [{"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}]
  }
]
[
  {
    "row_data": {
      "pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
      "text_layer_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  }
]

Python example

from labelbox import Client
from uuid import uuid4 ## to generate unique IDs
import datetime 

client = Client(api_key="<YOUR_API_KEY>")

dataset = client.create_dataset(name="Bulk import example - Documents")

assets = [
  {
    "row_data": {
      "pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
      "text_layer_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"name": "<metadata_field_name>", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  }
]

task = dataset.create_data_rows(assets)
task.wait_till_done()
print(task.errors)
local_file_paths = ['path/to/local/file1', 'path/to/local/file1'] # limit: 15k files


new_dataset = client.create_dataset(name = "Local files upload")

try:
    task = new_dataset.create_data_rows(local_file_paths)
    task.wait_till_done()
except Exception as err:
    print(f'Error while creating labelbox dataset -  Error: {err}')

Verify files are processed

🚧

File processing can take up to 20 mins

Since PDFs and OCR'ed files can be very large, the conversion can sometimes take up to 20 minutes to perform a data upload.

You can verify whether a file conversion is complete by checking the Media Attributes section.

  • If the Is text layer valid is true, the file was successfully processed.
  • If the Is text layer url is not present, it was not uploaded successfully.