Create a dataset

Data Row vs Dataset

In Labelbox, a Data Row represents an asset and all of its relevant information. A Dataset is a collection of Data Rows.

33753375

Key definitions

Term

Definition

Data Row

Contains all of the following information for a single Asset:

  • URL to your cloud-hosted file
  • Metadata
  • Media attributes (e.g., data type, size, etc.)
  • Attachments (files that provide context for your labelers)

Dataset

A set of Data Rows from a single domain or source

Asset

A single cloud-hosted file to be labeled (e.g., an image, a video, or a text file).

Attachment

Supplementary information you can attach to an asset that provides contextual information used as an aid during labeling. Learn more about attachments and image layers.

Global key (recommended)

A customer-specified ID for each data row asset. It is an optional field, but it is a good practice to use global keys to map your external database/file path to your Labelbox assets for easy retrieval.

Global keys are uniquely enforced at Catalog (Organization) level, so it helps prevent duplicate data upload. This is the preferred ID to use to identify all your assets.

External ID

Optional ID to map a data Row in Labelbox with your external database. It is not uniquely enforced. We recommend using global keys as IDs for your assets.

Best practices for creating datasets

Organization

It is best to put data from a single domain or source into a single dataset. Organizing your data this way will make it easier to set up your labeling workflows. For example, it would be easiest to organize a set of images coming from a particular type of medical device into a single dataset. You can then use Metadata to better organize and filter the Data Rows within that dataset.

Naming

Clear names that explain the source and purpose of a dataset are best. For example medical-device-type-1 would help identify this dataset as data relating to a particular version of a device. You can use the dataset description to include more context.

Option 1: Create a datasets via the Python SDK (recommended)

πŸ“˜

Recommended: Python SDK and Delegated Access

The most common method way of importing data is via Python SDK and configuring cloud integration using Integrations

The most common method way of importing data is via Python SDK, after setting up an IAM Delegated Access Integration. With the IAM Delegated Access integration, you can keep your data in your cloud bucket and grant Labelbox limited access to the data on demand.

πŸ“˜

Limit on uploading data rows in one SDK operation

To ensure performance, we recommended uploading up to 150k data rows at one time with the dataset.create_data_rows methods. If you are including metadata in the same call, 30k is the limits. If you have a large dataset to upload, you can split your data rows into chunks and upload them in sequence.

The example script below imports a set of images along with:

  1. Global keys

  2. Metadata

  3. Attachments

  4. Image layers

import labelbox
from uuid import uuid4 ## to generate unique IDs
import datetime 

#Enter your API key
LB_API_KEY = "<INSERT API KEY>"
client = labelbox.Client(api_key=LB_API_KEY)
metadata_ontology = client.get_data_row_metadata_ontology()

dataset = client.create_dataset(name="Bulk import example")

assets = [{"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}]


asset_metadata_fields = [{"schema_id": metadata_ontology.reserved_by_name["captureDateTime"].uid, "value": datetime.datetime.utcnow()},
                  {"schema_id": metadata_ontology.reserved_by_name["tag"].uid, "value": "tag_string"},
                  {"schema_id": metadata_ontology.reserved_by_name["split"]["train"].parent, "value": metadata_ontology.reserved_by_name["split"]["train"].uid}]

asset_attachments = [{"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" },
                     {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", "name": "CIR"},
                     {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", "name": "Weeds"},
                     {"type": "TEXT", "value": "IOWA, Zone 2232, June 2022 [Text string]"},
                     {"type": "TEXT", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
                     {"type": "IMAGE", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"},
                     {"type": "VIDEO", "value":  "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"},
                     {"type": "HTML", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"}]

for item in assets:
  item["metadata_fields"] = asset_metadata_fields
  item["attachments"] = asset_attachments

task = dataset.create_data_rows(assets)
task.wait_till_done()
print(task.errors)
import labelbox 

#Enter your API key
API_KEY = ""
client = labelbox.Client(api_key=API_KEY)

#create a new dataset
dataset = client.create_dataset(name="Data Row attachment example")

#Create metadata fields
metadata_fields = [{"schema_id": metadata_ontology.reserved_by_name["captureDateTime"].uid, "value": datetime.datetime.utcnow()},
                  {"schema_id": metadata_ontology.reserved_by_name["tag"].uid, "value": "tag_string"},
                  {"schema_id": metadata_ontology.reserved_by_name["split"]["train"].parent, "value": metadata_ontology.reserved_by_name["split"]["train"].uid}]

#create a data row with external ID
data_row = dataset.create_data_row(row_data="https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
                        external_id="base_image", metadata_fields=metadata_fields)

#Create multiple attachments
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", attachment_name="RGB")
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", attachment_name="CIR")
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", attachment_name="Weeds")

data_row.create_attachment(attachment_type="TEXT", attachment_value="IOWA, Zone 2232, June 2022 [Text string]")
data_row.create_attachment(attachment_type="TEXT", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt")
data_row.create_attachment(attachment_type="IMAGE", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg")
data_row.create_attachment(attachment_type="VIDEO", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4")
data_row.create_attachment(attachment_type="HTML", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html")

❗️

Special character handling

Please note that certain characters like # are not supported in URLs and should be avoided in your file names to prevent loading issues. A good litmus test for special character handling is to test URLs in your browser address bar; if it doesn't load properly in your browser, it won't load in Labelbox.

See Common SDK methods for other dataset and Data Row methods.

Option 2: Upload a JSON file (with Delegated Access URLs)

This option is useful if you are unable to use Python SDK. By uploading a JSON file, your data is able to remain in your cloud bucket. See Integrations to learn how to set up IAM Delegated Access.

  1. Create a JSON file containing data formatted as per data type

  2. Go to the Create a dataset page.

  3. Drag and drop your JSON file onto the page.

Give it a try using the examples below. Copy and paste the content into a text editor and save it as a JSON file (.json extension)

[
    {
        "externalId": "basic.png",
        "imageUrl": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
        "attachments": [
            {
                "type": "IMAGE_OVERLAY",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg",
                "name": "RGB"
            },
            {
                "type": "IMAGE_OVERLAY",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg",
                "name": "CIR"
            },
            {
                "type": "IMAGE_OVERLAY",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg",
                "name": "Weeds"
            },
            {
                "type": "TEXT",
                "value": "IOWA, Zone 2232, June 2022 [Text string]"
            },
            {
                "type": "TEXT",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"
            },
            {
                "type": "IMAGE",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"
            },
            {
                "type": "VIDEO",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"
            },
            {
                "type": "HTML",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"
            }
        ]
    }
]
[
     {
         "externalId": "DesigningForGoogleCastVideo.mp4",
         "videoUrl": "https://commondatastorage.googleapis.com/gtv-videos-bucket/CastVideos/dash/DesigningForGoogleCastVideo.mp4",
         "attachments": [
             {
                 "type": "TEXT",
                 "value": "Some sample text"
             },
             {
                 "type": "TEXT",
                 "value": "Some more sample text"
             }
         ]
     }
]
[
     {
         "externalId": "lorem-ipsum.txt",
         "data": "https://storage.googleapis.com/labelbox-sample-datasets/nlp/lorem-ipsum.txt",
         "attachments": [
             {
                 "type": "TEXT",
                 "value": "Some sample text"
             },
             {
                 "type": "TEXT",
                 "value": "Some more sample text"
             }
         ]
     }
]
[
    {
        "externalId": "cklidhv7o0zdk0y4z4282dp6o",
        "tileLayerUrl": "https://s3-us-east-2.amazonaws.com/lb-ron/CACI/ron_mctiles/{z}/{x}/{y}.png",
        "bounds": [
            [
                19.405662413477728,
                -99.21052827588443
            ],
            [
                19.400498983095076,
                -99.20534818927473
            ]
        ],
        "minZoom": 12,
        "maxZoom": 20,
        "epsg": "EPSG4326",
        "version": 2,
        "attachments": [
             {
                 "type": "TEXT",
                 "value": "Some sample text"
             },
             {
                 "type": "TEXT",
                 "value": "Some more sample text"
             }
         ]
     }
]

Option 3: Upload data directly

❗️

Upload file size limitation

You can upload a file up to 500MB each.

Upload files directly via the UI

  1. Log into Labelbox.

  2. Go to the Catalog and select New dataset.

  3. Upload any supported data types.

Upload files directly via the SDK

# Local paths
local_data_path = '/tmp/test_data_row.txt'
with open(local_data_path, 'w') as file:
    file.write("sample data")

task2 = dataset.create_data_rows([local_data_path])
task2.wait_till_done()
# Note that you cannot set external_ids at this time when uploading from local files.
# To do this you have to first
item_url = client.upload_file(local_data_path)
task4 = dataset.create_data_rows([{
    DataRow.row_data: item_url,
    DataRow.external_id: str(uuid.uuid4())
}])
task4.wait_till_done()

Append to an existing dataset

Adding Data Rows to an existing dataset can be accomplished by using the same methods described above.

## Get existing dataset
dataset = client.get_dataset("DATASET_ID")

## Follow data row creation as shown in earlier sections

Complete Python SDK tutorials

The tutorials below cover the most common CRUD methods on Data Rows and Datasets.

Python Tutorial

Github

Google Colab

Data Rows

Open In GithubOpen In Github

Open In ColabOpen In Colab

Datasets

Open In GithubOpen In Github

Open In ColabOpen In Colab

Each dataset has a unique dataset ID

Each dataset has a unique dataset ID. You can find this dataset ID in the UI of Labelbox:

  • go to Catalog
  • select your dataset
  • the dataset ID will show up in the URL
14121412

Did this page help you?