Specifications

Format: PDF
Recommended size: 100 pages or fewer
Import methods:

  • Direct upload (256 MB max file size)
  • IAM Delegated Access
  • Signed URLs (https URLs only)

Format: JSON
Recommended size: 100 pages or fewer
Import methods:

  • Direct upload (256 MB max file size)
  • IAM Delegated Access
  • Signed URLs (https URLs only)

When importing document data to Labelbox, you have to provide two assets: the PDF itself, as well as OCR extract in the form of a JSON. This JSON will be your text layer, rendered on top of your PDF in the Document Editor.

We currently support Adobe OCR text layer as well as AWS Textract OCR extract. Support for GCP OCR is coming soon.

In case the user uploads only a PDF: we will render PDf.js output as the text layer.

Conversion Scripts

Once you have your PDF asset and the OCR text layer in a JSON, convert your assets to Labelbox ingestible format through our conversion format.

You can find our conversion scripts here.

Text Layer Validation Schema

Your textLayer JSON file must adhere to the following JSON schema.

{
  "type": "array",
  "items": {
    "$ref": "#/$defs/page"
  },
  "$defs": {
    "page": {
      "type": "object",
      "properties": {
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        },
        "number": {
          "type": "number"
        },
        "units": {
          "const": "POINTS"
        },
        "groups": {
          "type": "array",
          "items": {
            "$ref": "#/$defs/group"
          }
        }
      },
      "required": ["width", "height", "number", "units", "groups"]
    },
    "group": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "geometry": {
          "$ref": "#/$defs/geometry"
        },
        "typography": {
          "$ref": "#/$defs/typography"
        },
        "tokens": {
          "type": "array",
          "items": {
            "$ref": "#/$defs/token"
          }
        }
      },
      "required": ["id", "content", "geometry", "tokens"]
    },
    "geometry": {
      "type": "object",
      "properties": {
        "left": {
          "type": "number"
        },
        "top": {
          "type": "number"
        },
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        }
      },
      "required": ["left", "top", "width", "height"]
    },
    "typography": {
      "type": "object",
      "properties": {
        "fontSize": {
          "type": "number"
        },
        "fontFamily": {
          "type": "string"
        }
      },
      "required": ["fontSize", "fontFamily"]
    },
    "token": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "geometry": {
          "$ref": "#/$defs/geometry"
        }
      },
      "required": ["id", "geometry", "content"]
    }
  }
}

Parameters

ParameterRequiredDescription
row_dataYesA dictionary of
{ "pdf_url": str, "text_layer_url": str }

For IAM Delegated Access, this URL must be in virtual-hosted-style format.
row_data['pdf_url']Yeshttps path to a cloud-hosted PDF. It must be specified within row_data dictionary.
row_data['text_layer_url']Yeshttps path to a cloud-hosted JSON extract of the PDF. It must be specified within row_data dictionary.
global_keyNoUnique user-generated file name or ID for the file. Global keys are enforced to be unique in your org. Data rows will not be imported if its global keys are duplicated to existing data rows.
media_typeNo"PDF"
Optional media type to provide better validation and error messages for the users.
metadata_fieldsNoUser-generated file name or ID for the file. For the best experience, this ID should be unique.
attachmentsNoAttachments
Asset overlays

Import format

[
  {
    "row_data": {
      "pdf_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/sample-document-1.pdf",
      "text_layer_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/sample-document-custom-text-layer.json"
    },
    "global_key": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/sample-document-1.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  },
  {
    "row_data": {
      "pdf_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/sample-document-2.pdf",
      "text_layer_url": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/sample-document-2-custom-text-layer.json"
    },
    "global_key": "https://lb-test-data.s3.us-west-1.amazonaws.com/document-samples/sample-document-2.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}]
  }
]
[
  {
    "row_data": {
      "pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
      "text_layer_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  }
]

Python example

from labelbox import Client
from uuid import uuid4 ## to generate unique IDs
import datetime 

client = Client(api_key="<YOUR_API_KEY>")

dataset = client.create_dataset(name="Bulk import example - Documents")

assets = [
  {
    "row_data": {
      "pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
      "text_layer_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483-lb-textlayer.json"
    },
    "global_key": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
    "media_type": "PDF",
    "metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
    "attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
  }
]

task = dataset.create_data_rows(assets)
task.wait_till_done()
print(task.errors)

📘

For additional questions or special requests, please reach out to [email protected]

Verify files are processed

🚧

File processing can take up to 20 mins

Since PDFs and OCR'ed files can be very large, the conversion can sometimes take up to 20 minutes to perform a data upload.

You can verify whether a file conversion is complete by checking the Media Attributes section.

  • If the Is text layer valid is true, the file was successfully processed.
  • If the Is text layer url is not present, it was not uploaded successfully.