Dataset

A developer guide for creating and modifying datasets via the Python SDK.

The most common method for importing data to Labelbox is via Python SDK after setting up a cloud storage integration. With an IAM delegated access integration, you can keep your data in your cloud bucket and grant Labelbox limited access to the data on demand.

πŸ“˜

Examples for all data types

The examples in this developer guide primarily use image assets. For sample approaches with other data modalities, please view the developer guide for importing data nested under each asset type in the Import/Export section of the table of contents.


Client

import labelbox as lb
client = lb.Client(api_key="<YOUR_API_KEY>")

Create a dataset

The only required argument when creating a dataset is the name.

dataset = client.create_dataset(
  name='<dataset_name>',
  description='<dataset_description>',	# optional
  iam_integration=None		# if not specified, will use default integration, set as None to not use delegated access.
)

Get a dataset

dataset = client.get_dataset("<dataset_id>")

# alternatively, you can get a dataset by name
dataset = client.get_datasets(where=labelbox.Dataset.name == "<dataset_name>").get_one()

Methods

Create data rows

🚧

Special character handling

Please note that certain characters like #,<, > and || are not supported in URLs and should be avoided in your file names to prevent loading issues.

Please refer to https://datatracker.ietf.org/doc/html/rfc2396#section-2.4.3 on URI standards.

A good test for the handling of special characters is to test URLs in your browser address bar β€” if the URL doesn't load properly in your browser, it won't load in Labelbox.

The only required argument when creating a data row is the row_data. However, Labelbox strongly recommends supplying each data row with a global key upon creation.

# this example uses the uuid package to generate unique global keys
from uuid import uuid4

data =[
    {
      "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
     	"global_key": str(uuid4())
    },
	{
      "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
      "global_key": str(uuid4())
    }
  ]

dataset.upsert_data_rows(data) 

# or alternatively use 

dataset.create_data_rows(data)

You can also create data rows with metadata, attachments, and image overlays in the same task. The code below contains an end-to-end example for creating a dataset with data rows that include these elements.

import labelbox as lb
from uuid import uuid4
import datetime

# insert your API key
LB_API_KEY = "<API KEY>"
client = lb.Client(api_key=LB_API_KEY)

# get the metadata ontology
metadata_ontology = client.get_data_row_metadata_ontology()

# create the dataset
dataset = client.create_dataset(name="Bulk import example")

# build the assets
assets = [
  {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
  {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
  {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}
]

# build the metadata
asset_metadata_fields = [
  {"name": "captureDateTime", "value": datetime.datetime.utcnow()},
  {"name": "tag", "value": "tag_string"},
  {"name": "split", "value": "train"}
]

# build the attachments
asset_attachments = [
  {"type": "IMAGE", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"},
  {"type": "VIDEO", "value":  "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"},
  {"type": "RAW_TEXT", "value": "IOWA, Zone 2232, June 2022 [Text string]"},
  {"type": "RAW_TEXT", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
  {"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
  {"type": "HTML", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"},
  {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" },
  {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", "name": "CIR"},
  {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", "name": "Weeds"},
  {"type": "PDF_URL", "value": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf"}
]

# connect the metadata and attachments to the data rows
for item in assets:
  item["metadata_fields"] = asset_metadata_fields
  item["attachments"] = asset_attachments

# create the data rows
task = dataset.upsert_data_rows(assets) # or # or alternatively use create_data_rows()
task.wait_till_done()
print(task.errors)

πŸ“˜

Limits

See this page to learn the limits for creating data rows in one bulk operation. Regardless of your tier's limit, if you have a large dataset to upload, you can split the data rows into chunks and upload them sequentially.

Create a singular data row

# simplest example
dataset.create_data_row(
  row_data="https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
  global_key=str(uuid4())
)

Upload a local file

# get a local file and write some text data
local_data_path = "/tmp/test_data_row.txt"
with open(local_data_path, 'w') as file:
  file.write("sample data")

# create the data row
task = dataset.create_data_rows([local_data_path])
task.wait_till_done()
# note that you cannot set global keys when creating a data row from a local file
# first we must upload the file then create the data row
local_data_path = "/tmp/test_data_row.txt"
with open(local_data_path, 'w') as file:
  file.write("sample data")

# upload the file to Labelbox storage  
file_url = client.upload_file(local_data_path)

# create the data row with a global key
task = dataset.create_data_rows([{
  "row_data": item_url,
  "global_key": "<unique_global_key>"
}])
task.wait_till_done()

Append to an existing dataset

You can add data rows to an existing dataset using the same methods described above for creating data rows. First, get a dataset, then create the rows.

# get existing dataset
dataset = client.get_dataset("<dataset_id>")

# use a method for data row creation as shown in sections above

Export data rows from a dataset

Filtering the result, you can obtain a list of global keys or data row IDs.

# Get a dataset
dataset = client.get_dataset("<dataset_id>")

# Start the export task for the dataset
export_task = dataset.export()
export_task.wait_till_done()

export_json = []

# Callback used for JSON Converter
def json_stream_handler(output: lb.JsonConverterOutput):
  export_json.append(output.json_str)

if export_task.has_errors():
  export_task.get_stream(
    converter=lb.JsonConverter(),
    stream_type=lb.StreamType.ERRORS
  ).start(stream_handler=lambda error: print(error))

if export_task.has_result():
  export_json = export_task.get_stream(
    converter=lb.JsonConverter(),
    stream_type=lb.StreamType.RESULT
  ).start(stream_handler=json_stream_handler)

# Extract the data rows urls from the export result
data_row_urls = [item["data_row"]["row_data"] for item in export_json]

# To export all the global keys 
# global_keys = [item["data_row"]["global_key"] for item in export_json]

# To export all the data row ids
# data_row_ids = [item["data_row"]["id"] for item in export_json]
# Get a dataset
dataset = client.get_dataset("<dataset_id>")

# Start the export task for the dataset
export_task = dataset.export_v2()
export_task.wait_till_done()

# Check for any errors in the export task
if export_task.errors:
    print(export_task.errors)
    
export_json = export_task.result

# Extract the data rows urls from the export result
data_row_urls = [item["row_data"] for item in export_json]

# To export all the global keys 
# global_keys = [item["data_row"]["global_key"] for item in export_json]

# To export all the data row ids
# data_row_ids = [item["data_row"]["id"] for item in export_json]

Export data rows with labels and predictions

For more details, see Export data rows from Catalog.

# set the export params to include/exclude certain fields
export_params={
  "attachments": True,
  "metadata_fields": True,
  "data_row_details": True,
  "project_details": True,
  "performance_details": True,
  "project_ids": ["<project_id_1>", "<project_id_2>"],
  "model_run_ids": ["<model_run_id_1>", "<model_run_id_2>"]
}

# set filters
filters={
  "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
  "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"]
}

# get a dataset
dataset = client.get_dataset("<dataset_id>")

export_task = dataset.export(params=export_params, filters=filters)
export_task.wait_till_done()

export_json = []

# Callback used for JSON Converter
def json_stream_handler(output: lb.JsonConverterOutput):
  export_json.append(output.json_str)

if export_task.has_errors():
  export_task.get_stream(
    converter=lb.JsonConverter(),
    stream_type=lb.StreamType.ERRORS
  ).start(stream_handler=lambda error: print(error))

if export_task.has_result():
  export_json = export_task.get_stream(
    converter=lb.JsonConverter(),
    stream_type=lb.StreamType.RESULT
  ).start(stream_handler=json_stream_handler)
# set the export params to include/exclude certain fields
export_params={
  "attachments": True,
  "metadata_fields": True,
  "data_row_details": True,
  "project_details": True,
  "performance_details": True,
  "project_ids": ["<project_id_1>", "<project_id_2>"],
  "model_run_ids": ["<model_run_id_1>", "<model_run_id_2>"]
}

# set filters
filters={
  "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
  "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"]
}

# get a dataset
dataset = client.get_dataset("<dataset_id>")

# run the export task
export_task = dataset.export_v2(params=export_params, filters=filters)
export_task.wait_till_done()

# view errors and results
if export_task.errors:
  print(export_task.errors)
  
export_json = export_task.result
print("results: ", export_json)

Update a dataset

dataset.update(name="new_dataset_name")

Delete a dataset

❗️

Deleting a dataset cannot be undone

This method deletes the dataset along with all data rows in the dataset and any labels made on these data rows. This action cannot be reverted without the assistance of Labelbox support.

dataset.delete()

Attributes

Get the basics

# name (str)
dataset.name

# description (str)
dataset.description

# updated at (datetime)
dataset.updated_at

# created at (datetime)
dataset.created_at

# created by (relationship to User object)
user = dataset.created_by()

# organization (relationship to Organization object)
organization = dataset.organization()

Get the data rows

The data_rows() attribute is a relationship to the DataRow objects in the dataset. The relationship retrieves a paginated collection of data rows.

data_rows = dataset.data_rows()

# inspect one data row
next(data_rows)

# inspect a number of data rows
for data_row in data_rows:
  print(data_row)
  
# for ease of use, you can convert the paginated collection to a list
list(data_rows)

Get the number of data rows

The row_count is a cached attribute, thus you must re-fetch the dataset after creating data rows to retrieve the updated value.

dataset = client.get_dataset("<dataset_id>")
dataset.row_count