Create a dataset
Data row vs dataset
In Labelbox, a data row represents an asset and all of its relevant information. A dataset is a collection of data rows imported to Labelbox at one time.

Key definitions
Term | Definition |
---|---|
Data row | Contains all of the following information for a single asset: - URL to your cloud-hosted file - Metadata - Media attributes (e.g., data type, size, etc.) - Attachments (files that provide context for your labelers) |
Dataset | A set of data rows from a single domain or source |
Asset | A single cloud-hosted file to be labeled (e.g., an image, a video, or a text file). |
Attachment | Supplementary information you can attach to an asset that provides contextual information used as an aid during labeling. Learn more about attachments and image layers. |
Global key (recommended) | A customer-specified ID for each data row asset. It is an optional field, but it is a good practice to use global keys to map your external database/file path to your Labelbox assets for easy retrieval. Global keys are uniquely enforced at the Catalog (organization) level, so it helps prevent duplicate data upload. This is the preferred ID to use to identify all your assets. |
External ID | Optional ID to map a data row in Labelbox with your external database. It is not uniquely enforced. We recommend using global keys as IDs for your assets. |
Supported data types
Name | Kinds | Import specs |
---|---|---|
Images | PNG, JPEG, BMP | Image import format |
Video | MP4 | Video import format |
Text | TXT (UTF-8) | Text import format |
Geospatial imagery | Tile Map Server | Geospatial import format |
Simple tiled | Tile Map Server | Simple tiled import format |
Audio | MP3, WAV, M4A | Audio import format |
Documents (beta) | Documents import format | |
HTML (beta) | HTML | HTML import format |
DICOM | DCM | DICOM import format |
Best practices for creating datasets
Organization
It is best to put data from a single domain or source into a single dataset. Organizing your data this way will make it easier to set up your labeling workflows. For example, it would be easiest to organize a set of images coming from a particular type of medical device into a single dataset. You can then use metadata to better organize and filter the Data Rows within that dataset.
Naming
Clear names that explain the source and purpose of a dataset are best. For example medical-device-type-1
would help identify this dataset as data relating to a particular version of a device. You can use the dataset description to include more context.
Option 1: Create a dataset via the Python SDK (recommended)
Recommended: Python SDK and Delegated Access
The most common method way of importing data is via Python SDK and configuring cloud integration using Integrations
The most common method way of importing data is via Python SDK, after setting up an IAM delegated access Integration. With the IAM delegated access integration, you can keep your data in your cloud bucket and grant Labelbox limited access to the data on demand.
Limit on uploading data rows in one SDK operation
To ensure performance, we recommended uploading up to 150k data rows at one time with the
dataset.create_data_rows
methods. If you are including metadata in the same call, 30k is the limits. If you have a large dataset to upload, you can split your data rows into chunks and upload them in sequence.
The example script below imports a set of images along with:
-
Global keys
import labelbox
from uuid import uuid4 ## to generate unique IDs
import datetime
#Enter your API key
LB_API_KEY = "<INSERT API KEY>"
client = labelbox.Client(api_key=LB_API_KEY)
metadata_ontology = client.get_data_row_metadata_ontology()
dataset = client.create_dataset(name="Bulk import example")
assets = [{"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
{"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
{"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
{"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
{"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}]
asset_metadata_fields = [{"schema_id": metadata_ontology.reserved_by_name["captureDateTime"].uid, "value": datetime.datetime.utcnow()},
{"schema_id": metadata_ontology.reserved_by_name["tag"].uid, "value": "tag_string"},
{"schema_id": metadata_ontology.reserved_by_name["split"]["train"].parent, "value": metadata_ontology.reserved_by_name["split"]["train"].uid}]
asset_attachments = [{"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" },
{"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", "name": "CIR"},
{"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", "name": "Weeds"},
{"type": "TEXT", "value": "IOWA, Zone 2232, June 2022 [Text string]"},
{"type": "TEXT", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
{"type": "IMAGE", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"},
{"type": "VIDEO", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"},
{"type": "HTML", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"}]
for item in assets:
item["metadata_fields"] = asset_metadata_fields
item["attachments"] = asset_attachments
task = dataset.create_data_rows(assets)
task.wait_till_done()
print(task.errors)
import labelbox
#Enter your API key
API_KEY = ""
client = labelbox.Client(api_key=API_KEY)
#create a new dataset
dataset = client.create_dataset(name="Data Row attachment example")
#Create metadata fields
metadata_fields = [{"schema_id": metadata_ontology.reserved_by_name["captureDateTime"].uid, "value": datetime.datetime.utcnow()},
{"schema_id": metadata_ontology.reserved_by_name["tag"].uid, "value": "tag_string"},
{"schema_id": metadata_ontology.reserved_by_name["split"]["train"].parent, "value": metadata_ontology.reserved_by_name["split"]["train"].uid}]
#create a data row with external ID
data_row = dataset.create_data_row(row_data="https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
external_id="base_image", metadata_fields=metadata_fields)
#Create multiple attachments
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", attachment_name="RGB")
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", attachment_name="CIR")
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", attachment_name="Weeds")
data_row.create_attachment(attachment_type="TEXT", attachment_value="IOWA, Zone 2232, June 2022 [Text string]")
data_row.create_attachment(attachment_type="TEXT", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt")
data_row.create_attachment(attachment_type="IMAGE", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg")
data_row.create_attachment(attachment_type="VIDEO", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4")
data_row.create_attachment(attachment_type="HTML", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html")
Special character handling
Please note that certain characters like
#
are not supported in URLs and should be avoided in your file names to prevent loading issues. A good litmus test for special character handling is to test URLs in your browser address bar; if it doesn't load properly in your browser, it won't load in Labelbox.
See Common SDK methods for other dataset and data row methods.
Option 2: Upload a JSON file (with Delegated Access URLs)
This option is useful if you are unable to use Python SDK. By uploading a JSON file, your data is able to remain in your cloud bucket. See Integrations to learn how to set up IAM delegated access.
-
Create a JSON file containing data formatted as per data type
-
Go to the Create a dataset page.
-
Drag and drop your JSON file onto the page.
Give it a try using the examples below. Copy and paste the content into a text editor and save it as a JSON file (.json extension)
[
{
"row_data": "https://storage.googleapis.com/labelbox-datasets/image_sample_data/image-sample-1.jpg",
"global_key": "https://storage.googleapis.com/labelbox-datasets/image_sample_data/image-sample-1.jpg",
"media_type": "IMAGE",
"metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
"attachments": [{"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" }]
},
{
"row_data": "https://storage.googleapis.com/labelbox-datasets/image_sample_data/image-sample-2.jpg",
"global_key": "https://storage.googleapis.com/labelbox-datasets/image_sample_data/image-sample-2.jpg",
"media_type": "IMAGE",
"metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
"attachments": [{"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}]
}
]
[
{
"row_data": "https://storage.googleapis.com/labelbox-datasets/video-sample-data/sample-video-1.mp4",
"global_key": "https://storage.googleapis.com/labelbox-datasets/video-sample-data/sample-video-1.mp4",
"media_type": "VIDEO",
"metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
"attachments": [{"type": "VIDEO", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4" }]
},
{
"row_data": "https://storage.googleapis.com/labelbox-datasets/video-sample-data/sample-video-2.mp4",
"global_key": "https://storage.googleapis.com/labelbox-datasets/video-sample-data/sample-video-2.mp4",
"media_type": "VIDEO",
"metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
"attachments": [{"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}]
}
]
[
{
"row_data": {
"pdf_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
"text_layer_url": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483-lb-textlayer.json"
},
"global_key": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf",
"media_type": "PDF",
"metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
"attachments": [{"type": "HTML", "value": "https://www.wikipedia.org/" }]
}
]
[
{
"row_data":{
"tile_layer_url": "https://s3-us-west-1.amazonaws.com/lb-tiler-layers/mexico_city/{z}/{x}/{y}.png",
"bounds": [
[
19.405662413477728,
-99.21052827588443
],
[
19.400498983095076,
-99.20534818927473
]
],
"min_zoom": 12,
"max_zoom": 20,
"epsg": "EPSG4326",
"alternative_layers": [
{
"tile_layer_url": "https://api.mapbox.com/styles/v1/mapbox/satellite-streets-v11/tiles/{z}/{x}/{y}?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpejY4NXVycTA2emYycXBndHRqcmZ3N3gifQ.rJcFIG214AriISLbB6B5aw",
"name": "Satellite"
},
{
"tile_layer_url": "https://api.mapbox.com/styles/v1/mapbox/navigation-guidance-night-v4/tiles/{z}/{x}/{y}?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpejY4NXVycTA2emYycXBndHRqcmZ3N3gifQ.rJcFIG214AriISLbB6B5aw",
"name": "Guidance"
}
]
},
"global_key": "https://s3-us-west-1.amazonaws.com/lb-tiler-layers/mexico_city/{z}/{x}/{y}.png",
"media_type": "TMS_GEO",
"metadata_fields": [{"schema_id": "cko8s9r5v0001h2dk9elqdidh", "value": "tag_string"}],
"attachments": [{"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}]
}
]
Option 3: Upload data directly
Upload file size limitation
You can upload a file up to 500MB each.
Upload files directly via the UI
-
Log into Labelbox.
-
Go to the Catalog and select New dataset.
-
Upload any supported data types.
Upload files directly via the SDK
# Local paths
local_data_path = '/tmp/test_data_row.txt'
with open(local_data_path, 'w') as file:
file.write("sample data")
task2 = dataset.create_data_rows([local_data_path])
task2.wait_till_done()
# Note that you cannot set external_ids at this time when uploading from local files.
# To do this you have to first
item_url = client.upload_file(local_data_path)
task4 = dataset.create_data_rows([{
"row_data": item_url,
"external_id": str(uuid.uuid4())
}])
task4.wait_till_done()
Append to an existing dataset
Adding data rows to an existing dataset can be accomplished by using the same methods described above.
## Get existing dataset
dataset = client.get_dataset("DATASET_ID")
## Follow data row creation as shown in earlier sections
You can also append data rows to a dataset in the UI. Go to Catalog, select your dataset from the left, then click Append to dataset.

Complete Python SDK tutorials
The tutorials below cover the most common CRUD methods on data rows and datasets.
Each dataset has a unique dataset ID
Each dataset has a unique dataset ID. You can find this dataset ID in the UI of Labelbox:
- Go to Catalog
- Select your dataset
- Copy the ID from the URL
Updated 5 days ago