Dataset

Summary of methods for creating and modifying datasets.

The most common method for importing data to Labelbox is via Python SDK after setting up a cloud storage integration. With an IAM delegated access integration, you can keep your data in your cloud bucket and grant Labelbox limited access to the data on demand.

📘

Examples for all data types

The examples in this developer guide primarily use image assets. For sample approaches with other data modalities, please view the developer guide for importing data nested under each asset type in the Import/Export section of the table of contents.


Create a dataset

The only required argument when creating a dataset is the name. You can also specify the IAM integration that you have set up on your account if not specified or set to None it will use your default integration. For more details. including how to get your IAM integrations; visit our dedicated IAM integration page

dataset = client.create_dataset(
  name='<dataset_name>',
  description='<dataset_description>',	# optional
  iam_integration=None		# if not specified, will use default integration, set as None to not use delegated access.
)

Get a dataset

dataset = client.get_dataset("<dataset_id>")

# alternatively, you can get a dataset by name
dataset = client.get_datasets(where=labelbox.Dataset.name == "<dataset_name>").get_one()

Dataset Methods

Create data rows

🚧

Special character handling

Please note that certain characters like #,<, > and || are not supported in URLs and should be avoided in your file names to prevent loading issues.

Please refer to https://datatracker.ietf.org/doc/html/rfc2396#section-2.4.3 on URI standards.

A good test for the handling of special characters is to test URLs in your browser address bar — if the URL doesn't load properly in your browser, it won't load in Labelbox.

The only required argument when creating a data row is the row_data. However, Labelbox strongly recommends supplying each data row with a global key upon creation.

# this example uses the uuid package to generate unique global keys
from uuid import uuid4

data =[
    {
      "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
     	"global_key": str(uuid4())
    },
	{
      "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
      "global_key": str(uuid4())
    }
  ]

dataset.upsert_data_rows(data) 

# or alternatively use 

dataset.create_data_rows(data)

You can also create data rows with metadata, attachments, and image overlays in the same task. The code below contains an end-to-end example for creating a dataset with data rows that include these elements.

import labelbox as lb
from uuid import uuid4
import datetime

# insert your API key
LB_API_KEY = "<API KEY>"
client = lb.Client(api_key=LB_API_KEY)

# get the metadata ontology
metadata_ontology = client.get_data_row_metadata_ontology()

# create the dataset
dataset = client.create_dataset(name="Bulk import example")

# build the assets
assets = [
  {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
  {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
  {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}
]

# build the metadata
asset_metadata_fields = [
  {"name": "captureDateTime", "value": datetime.datetime.utcnow()},
  {"name": "tag", "value": "tag_string"},
  {"name": "split", "value": "train"}
]

# build the attachments
asset_attachments = [
  {"type": "IMAGE", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"},
  {"type": "VIDEO", "value":  "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"},
  {"type": "RAW_TEXT", "value": "IOWA, Zone 2232, June 2022 [Text string]"},
  {"type": "RAW_TEXT", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
  {"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
  {"type": "HTML", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"},
  {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" },
  {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", "name": "CIR"},
  {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", "name": "Weeds"},
  {"type": "PDF_URL", "value": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf"}
]

# connect the metadata and attachments to the data rows
for item in assets:
  item["metadata_fields"] = asset_metadata_fields
  item["attachments"] = asset_attachments

# create the data rows
task = dataset.upsert_data_rows(assets) # or # or alternatively use create_data_rows()
task.wait_till_done()
print(task.results)
print(task.errors)

📘

Limits

See this page to learn the limits for creating data rows in one bulk operation. Regardless of your tier's limit, if you have a large dataset to upload, you can split the data rows into chunks and upload them sequentially.

Task Results

The task result is a collection of details, typically information on the data rows associated with the task.

dataset = client.get_dataset("<dataset id>")

task = dataset.create_data_rows([{
  "row_data": item_url,
  "global_key": "<unique_global_key>"
}])
task.wait_till_done()

print(task.result) # prints the data rows associated with the create_data_rows operation

🚧

Warning

Large results (over 150,000 data rows) can take up to 10 mins to process.

Task Errors

The task errors work similarly to the task results but contain information on any associated errors.

dataset = client.get_dataset("<dataset id>")

task = dataset.create_data_rows([{
  "row_data": item_url,
  "global_key": "<unique_global_key>"
}])
task.wait_till_done()

print(task.errors) # prints the data rows associated with the create_data_rows operation

Create a Singular Data Row

# simplest example
task = dataset.create_data_row(
  row_data="https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
  global_key=str(uuid4())
)
task.wait_till_done()

Importing a Local File

Upload a Local File

You can upload a local file to Labelbox storage and use the returned URL to create a data row.

# note that you cannot set global keys when creating a data row from a local file
# first we must upload the file then create the data row
local_data_path = "/tmp/test_data_row.txt"
with open(local_data_path, 'w') as file:
  file.write("sample data")

# upload the file to Labelbox storage  
file_url = client.upload_file(local_data_path)

# create the data row with a global key
task = dataset.create_data_rows([{
  "row_data": item_url,
  "global_key": "<unique_global_key>"
}])
task.wait_till_done()

Create Data Row from a Local File

You can also create a data row from a local file directly without uploading it to storage. This creates the file URL and the data row in one step. However, this is not recommended since it increases the time it takes for your data row creation.

# get a local file and write some text data
local_data_path = "/tmp/test_data_row.txt"
with open(local_data_path, 'w') as file:
  file.write("sample data")

# create the data row
task = dataset.create_data_rows([{
  "row_data": item_url,
  "global_key": "<unique_global_key>"
}])
task.wait_till_done()

Append to an existing dataset

You can add data rows to an existing dataset using the methods described above. First, get a dataset, then create the rows.

# get existing dataset
dataset = client.get_dataset("<dataset_id>")

# use a method for data row creation as shown in sections above
task = dataset.create_data_rows([{
  "row_data": item_url,
  "global_key": "<unique_global_key>"
}])

Export data rows from a dataset

Filtering the result, you can obtain a list of asset URLs, global keys, or data row IDs.

# Get a dataset
dataset = client.get_dataset("<dataset_id>")

# Start the export task for the dataset
export_task = dataset.export()
export_task.wait_till_done()

# Check for errors in export
if export_task.has_errors():
  export_task.get_buffered_stream(
    stream_type=lb.StreamType.ERRORS
  ).start(stream_handler=lambda error: print(error))

if export_task.has_result():
  stream = export_task.get_buffered_stream()

  # Extract the data rows urls from the export result
  asset_urls = [item.json["data_row"]["row_data"] for item in stream]

  # To export all the global keys 
  # global_keys = [item.json["data_row"]["global_key"] for item in stream]

  # To export all the data row ids
  # data_row_ids = [item.json["data_row"]["id"] for item in stream]

Export data rows with labels and predictions

For more details, see Export data rows from Catalog.

# set the export params to include/exclude certain fields
export_params={
  "attachments": True,
  "metadata_fields": True,
  "data_row_details": True,
  "project_details": True,
  "performance_details": True,
  "project_ids": ["<project_id_1>", "<project_id_2>"],
  "model_run_ids": ["<model_run_id_1>", "<model_run_id_2>"]
}

# set filters
filters={
  "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
  "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"]
}

# get a dataset
dataset = client.get_dataset("<dataset_id>")

export_task = dataset.export(params=export_params, filters=filters)
export_task.wait_till_done()

# Conditional for errors
if export_task.has_errors():
  export_task.get_buffered_stream(
    stream_type=lb.StreamType.ERRORS
  ).start(stream_handler=lambda error: print(error))

if export_task.has_results():
  stream = export_task.get_buffered_stream()

  # iterate through data rows
  for data_row in stream:
    print(data_row.json)

Update a dataset

dataset.update(name="new_dataset_name")

Delete a dataset

❗️

Deleting a dataset cannot be undone

This method deletes the dataset along with all data rows in the dataset and any labels made on these data rows. This action cannot be reverted without the assistance of Labelbox support.

dataset.delete()

Dataset Attributes

Get the basics

# name (str)
dataset.name

# description (str)
dataset.description

# updated at (datetime)
dataset.updated_at

# created at (datetime)
dataset.created_at

# created by (relationship to User object)
user = dataset.created_by()

# organization (relationship to Organization object)
organization = dataset.organization()

Get the data rows

The data_rows() attribute is a relationship to the DataRow objects in the dataset. The relationship retrieves a paginated collection of data rows. It is recommended to use exports to get direct data row details especially for larger data sets.

data_rows = dataset.data_rows()

# inspect one data row
next(data_rows)

# inspect a number of data rows
for data_row in data_rows:
  print(data_row)
  
# for ease of use, you can convert the paginated collection to a list
list(data_rows)

Get the number of data rows

The row_count is a cached attribute; thus, you must re-fetch the dataset after creating data rows to retrieve the updated value.

dataset = client.get_dataset("<dataset_id>")
dataset.row_count