Summary of methods for creating and modifying datasets.
The most common method for importing data to Labelbox is via Python SDK after setting up a cloud storage integration. With an IAM delegated access integration, you can keep your data in your cloud bucket and grant Labelbox limited access to the data on demand.
The examples in this developer guide primarily use image assets. For sample approaches with other data modalities, please view the developer guide for importing data nested under each asset type in the Import/Export section of the table of contents.
The only required argument when creating a dataset is the name. You can also specify the IAM integration that you have set up on your account if not specified or set to None it will use your default integration. For more details. including how to get your IAM integrations; visit our dedicated IAM integration page
Copy
Ask AI
dataset = client.create_dataset( name='<dataset_name>', description='<dataset_description>', # optional iam_integration=None # if not specified, will use default integration, set as None to not use delegated access.)
dataset = client.get_dataset("<dataset_id>")# alternatively, you can get a dataset by namedataset = client.get_datasets(where=labelbox.Dataset.name == "<dataset_name>").get_one()
A good test for the handling of special characters is to test URLs in your browser address bar — if the URL doesn’t load properly in your browser, it won’t load in Labelbox.
The only required argument when creating a data row is the row_data. However, Labelbox strongly recommends supplying each data row with a global key upon creation.
Copy
Ask AI
# this example uses the uuid package to generate unique global keysfrom uuid import uuid4data =[ { "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4()) }, { "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4()) } ]dataset.upsert_data_rows(data)# or alternatively usedataset.create_data_rows(data)
You can also create data rows with metadata, attachments, and image overlays in the same task. The code below contains an end-to-end example for creating a dataset with data rows that include these elements.
Copy
Ask AI
import labelbox as lbfrom uuid import uuid4import datetime# insert your API keyLB_API_KEY = "<API KEY>"client = lb.Client(api_key=LB_API_KEY)# get the metadata ontologymetadata_ontology = client.get_data_row_metadata_ontology()# create the datasetdataset = client.create_dataset(name="Bulk import example")# build the assetsassets = [ {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}, {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}, {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}]# build the metadataasset_metadata_fields = [ {"name": "captureDateTime", "value": datetime.datetime.utcnow()}, {"name": "tag", "value": "tag_string"}, {"name": "split", "value": "train"}]# build the attachmentsasset_attachments = [ {"type": "IMAGE", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"}, {"type": "VIDEO", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"}, {"type": "RAW_TEXT", "value": "IOWA, Zone 2232, June 2022 [Text string]"}, {"type": "RAW_TEXT", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}, {"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"}, {"type": "HTML", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"}, {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" }, {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", "name": "CIR"}, {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", "name": "Weeds"}, {"type": "PDF_URL", "value": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf"}]# connect the metadata and attachments to the data rowsfor item in assets: item["metadata_fields"] = asset_metadata_fields item["attachments"] = asset_attachments# create the data rowstask = dataset.upsert_data_rows(assets) # or # or alternatively use create_data_rows()task.wait_till_done()print(task.results)print(task.errors)
See this page to learn the limits for creating data rows in one bulk operation. Regardless of your tier’s limit, if you have a large dataset to upload, you can split the data rows into chunks and upload them sequentially.
The task result is a collection of details, typically information on the data rows associated with the task.
Copy
Ask AI
dataset = client.get_dataset("<dataset id>")task = dataset.create_data_rows([{ "row_data": item_url, "global_key": "<unique_global_key>"}])task.wait_till_done()print(task.result) # prints the data rows associated with the create_data_rows operation
The task errors work similarly to the task results but contain information on any associated errors.
Copy
Ask AI
dataset = client.get_dataset("<dataset id>")task = dataset.create_data_rows([{ "row_data": item_url, "global_key": "<unique_global_key>"}])task.wait_till_done()print(task.errors) # prints the data rows associated with the create_data_rows operation
You can upload a local file to Labelbox storage and use the returned URL to create a data row.
Copy
Ask AI
# note that you cannot set global keys when creating a data row from a local file# first we must upload the file then create the data rowlocal_data_path = "/tmp/test_data_row.txt"with open(local_data_path, 'w') as file: file.write("sample data")# upload the file to Labelbox storagefile_url = client.upload_file(local_data_path)# create the data row with a global keytask = dataset.create_data_rows([{ "row_data": item_url, "global_key": "<unique_global_key>"}])task.wait_till_done()
You can also create a data row from a local file directly without uploading it to storage. This creates the file URL and the data row in one step. However, this is not recommended since it increases the time it takes for your data row creation.
Copy
Ask AI
# get a local file and write some text datalocal_data_path = ""# create the data rowtask = dataset.create_data_rows([{ "row_data": local_data_path, "global_key": "<unique_global_key>"}])task.wait_till_done()
You can add data rows to an existing dataset using the methods described above. First, get a dataset, then create the rows.
Copy
Ask AI
# get existing datasetdataset = client.get_dataset("<dataset_id>")# use a method for data row creation as shown in sections abovetask = dataset.create_data_rows([{ "row_data": item_url, "global_key": "<unique_global_key>"}])
Filtering the result, you can obtain a list of asset URLs, global keys, or data row IDs.
Copy
Ask AI
# Get a datasetdataset = client.get_dataset("<dataset_id>")# Start the export task for the datasetexport_task = dataset.export()export_task.wait_till_done()# Stream the export using a callback functiondef json_stream_handler(output: labelbox.BufferedJsonConverterOutput): print(output.json)export_task.get_buffered_stream(stream_type=labelbox.StreamType.RESULT).start(stream_handler=json_stream_handler)# Collect all exported data into a listexport_json = [data_row.json for data_row in export_task.get_buffered_stream()]# Extract the data rows urls from the export resultasset_urls = [item.json["data_row"]["row_data"] for item in stream]# (Optional) To export all the global keysglobal_keys = [item.json["data_row"]["global_key"] for item in stream]# (Optional) To export all the data row idsdata_row_ids = [item.json["data_row"]["id"] for item in stream]
# set the export params to include/exclude certain fieldsexport_params={ "attachments": True, "metadata_fields": True, "data_row_details": True, "project_details": True, "performance_details": True, "project_ids": ["<project_id_1>", "<project_id_2>"], "model_run_ids": ["<model_run_id_1>", "<model_run_id_2>"]}# set filtersfilters={ "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"], "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"]}# get a datasetdataset = client.get_dataset("<dataset_id>")export_task = dataset.export(params=export_params, filters=filters)export_task.wait_till_done()# Stream the export using a callback functiondef json_stream_handler(output: labelbox.BufferedJsonConverterOutput): print(output.json)export_task.get_buffered_stream(stream_type=labelbox.StreamType.RESULT).start(stream_handler=json_stream_handler)# Collect all exported data into a listexport_json = [data_row.json for data_row in export_task.get_buffered_stream()]
# name (str)dataset.name# description (str)dataset.description# updated at (datetime)dataset.updated_at# created at (datetime)dataset.created_at# created by (relationship to User object)user = dataset.created_by()# organization (relationship to Organization object)organization = dataset.organization()
The data_rows() attribute is a relationship to the DataRow objects in the dataset. The relationship retrieves a paginated collection of data rows. It is recommended to use exports to get direct data row details especially for larger data sets.
Copy
Ask AI
data_rows = dataset.data_rows()# inspect one data rownext(data_rows)# inspect a number of data rowsfor data_row in data_rows: print(data_row)# for ease of use, you can convert the paginated collection to a listlist(data_rows)