Import document data

How to import document data and sample import formats.

Specifications

Format: PDF
Recommended size: 100 pages or fewer
Import methods:

  • IAM Delegated Access
  • Signed URLs (https URLs only)

When importing document data to Labelbox, you are no longer required to provide an OCR extract in the form of a JSON file. Labelbox generates text layers automatically during PDF import using Google Document AI if the data row doesn't include a text layer. The JSON file generated will be your text layer, rendered on top of your PDF in the Document Editor.

Note: Previously generated PDF documents without text layers will not be retroactively filled with the text layer generated by Labelbox.

Google Document AI has the following limitations:

  • The document must have no more than 15 pages
  • The file size should not exceed 20 MB.

Additionally, Google Document AI optimizes documents before OCR processing. This optimization might include rotating images or pages to ensure text appears horizontally. Consequently, token coordinates are calculated based on the rotated/optimized images, resulting in potential discrepancies with the original PDF document.

For example, in a landscape-oriented PDF, the document is rotated 90 degrees before processing. As a result, all tokens in the text layer are also rotated by 90 degrees.

Uploading a text layer of your choice will continue to be supported. We currently support OCR generated from:

  • Adobe OCR
  • AWS Textract OCR
  • GCP OCR

Conversion Scripts for custom generated OCR

Once you have your PDF asset and the OCR text layer in a JSON, convert your assets to Labelbox ingestible format through our conversion format.

You can find our conversion scripts here.

Text Layer Validation Schema

Your textLayer JSON file must adhere to the following JSON schema.

{
  "type": "array",
  "items": {
    "$ref": "#/$defs/page"
  },
  "$defs": {
    "page": {
      "type": "object",
      "properties": {
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        },
        "number": {
          "type": "number"
        },
        "units": {
          "enum": ["POINTS", "PERCENT"]
        },
        "groups": {
          "type": "array",
          "items": {
            "$ref": "#/$defs/group"
          }
        }
      },
      "required": ["number", "units", "groups"]
    },
    "group": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "geometry": {
          "$ref": "#/$defs/geometry"
        },
        "tokens": {
          "type": "array",
          "items": {
            "$ref": "#/$defs/token"
          }
        }
      },
      "required": ["id", "content", "geometry", "tokens"]
    },
    "geometry": {
      "type": "object",
      "properties": {
        "left": {
          "type": "number"
        },
        "top": {
          "type": "number"
        },
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        }
      },
      "required": ["left", "top", "width", "height"]
    },
    "token": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "geometry": {
          "$ref": "#/$defs/geometry"
        }
      },
      "required": ["id", "geometry", "content"]
    }
  }
}

Parameters

ParameterRequiredDescription
row_dataYesA dictionary of
{ "pdf_url": str, "text_layer_url": str }

For IAM Delegated Access, this URL must be in virtual-hosted-style format.
row_data['pdf_url']Yeshttps path to a cloud-hosted PDF. It must be specified within row_data dictionary.
row_data['text_layer_url']Nohttps path to a cloud-hosted JSON extract of the PDF.
global_keyNoUnique user-generated file name or ID for the file. Global keys are enforced to be unique in your org. Data rows will not be imported if its global keys are duplicated to existing data rows.
media_typeNo"PDF" (optional media type to provide better validation and error messaging)
metadata_fieldsNoSee Metadata.
attachmentsNoSee Attachments and Asset overlays.

Import format

[
  {
    "row_data": {
      "pdf_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0801.3483.pdf",
      // You don't need to provide a text_layer_url. Labelbox automatically generates a text layer when importing an asset without one.
      "text_layer_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  },
  {
    "row_data": {
      "pdf_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0803.1972.pdf",
       // You don't need to provide a text_layer_url. Labelbox automatically generates a text layer when importing an asset without one.
      "text_layer_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0803.1972-lb-textlayer.json"
    },
    "global_key": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/0803.1972.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"name": "<metadata_field_name>", "value": "tag_string"}],
    "attachments": [{"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}]
  }
]
[
  {
    "row_data": {
      "pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
      // You don't need to provide a text_layer_url. Labelbox automatically generates a text layer when importing an asset without one.
      "text_layer_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  }
]

Python example

from labelbox import Client
from uuid import uuid4 ## to generate unique IDs
import datetime 

client = Client(api_key="<YOUR_API_KEY>")

dataset = client.create_dataset(name="Bulk import example - Documents")

assets = [
  {
    "row_data": {
      "pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
		# You don't need to provide a text_layer_url. Labelbox automatically generates a text layer when importing an asset without one.
      "text_layer_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"name": "<metadata_field_name>", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  }
]

task = dataset.create_data_rows(assets)
task.wait_till_done()
print(task.errors)
local_file_paths = ['path/to/local/file1', 'path/to/local/file1'] # limit: 15k files


new_dataset = client.create_dataset(name = "Local files upload")

try:
    task = new_dataset.create_data_rows(local_file_paths)
    task.wait_till_done()
except Exception as err:
    print(f'Error while creating labelbox dataset -  Error: {err}')

Verify files are processed

🚧

File processing can take up to 20 mins

Since PDFs and OCR'ed files can be very large, the conversion can sometimes take up to 20 minutes to perform a data upload.

By checking the Media Attributes section, you can verify whether a file conversion using a custom or Labelbox-generated text layer is complete.

  • If Is text layer valid = true, the file was successfully processed.