Create a dataset

Data Row vs Dataset

In Labelbox, a Data Row represents an asset and all of its relevant information. A Dataset is a collection of Data Rows.

Key definitions

Term

Definition

Data Row

Contains all of the following information for a single Asset:

  • URL to your cloud-hosted file
  • Metadata
  • Media attributes (e.g., data type, size, etc.)
  • Attachments (files that provide context for your labelers)

Dataset

A set of Data Rows from a single domain or source

Asset

A single cloud-hosted file to be labeled (e.g., an image, a video, or a text file).

Attachment

Supplementary information you can attach to an asset that provides contextual information used as an aid during labeling. Learn more about attachments and image layers.

External ID

Optional ID to map a data Row in Labelbox with your external database. It is not uniquely enforced.

Key

Coming soon. Intended to replace External ID. Uniquely enforced at Catalog level.

Best practices for creating datasets

Organization

It is best to put data from a single domain or source into a single dataset. Organizing your data this way will make it easier to set up your labeling workflows. For example, it would be easiest to organize a set of images coming from a particular type of medical device into a single dataset. You can then use Metadata to better organize and filter the Data Rows within that dataset.

Naming

Clear names that explain the source and purpose of a dataset are best. For example medical-device-type-1 would help identify this dataset as data relating to a particular version of a device. You can use the dataset description to include more context.

Option 1: Create a datasets via the Python SDK (recommended)

📘

Recommended: Python SDK and Delegated Access

The most common method way of importing data is via Python SDK and configuring cloud integration using Integrations

The most common method way of importing data is via Python SDK, after setting up an IAM Delegated Access Integration. With the IAM Delegated Access integration, you can keep your data in your cloud bucket and grant Labelbox limited access to the data on demand.

The example script below imports a set of images along with:

  1. External ID

  2. Metadata

  3. Attachments

  4. Image layers

import labelbox
from uuid import uuid4 ## to generate unique IDs
import datetime 

#Enter your API key
LB_API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJjamhmbjYwYjMwc25uMDc1Nnh1bnN0MzFqIiwib3JnYW5pemF0aW9uSWQiOiJjamhmbjV5NnMwcGs1MDcwMjRuejFvY3lzIiwiYXBpS2V5SWQiOiJja291dTJlZTI0ZXBrMHlhYjJjcWpnNmFsIiwic2VjcmV0IjoiYmJkMTkxNDllZDI5ODc4YWMzYWFmMTU1M2QyNjg5MzYiLCJpYXQiOjE2MjEzOTA1NDMsImV4cCI6MjI1MjU0MjU0M30.UJtr069VlldlRy-3DvLbft-PeUkR3xAyklAaJXFgijw"
client = labelbox.Client(api_key=LB_API_KEY)
metadata_ontology = client.get_data_row_metadata_ontology()

dataset = client.create_dataset(name="Bulk import example")

assets = [{"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "external_id": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "external_id": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "external_id": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "external_id": str(uuid4())},
          {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "external_id": str(uuid4())}]


asset_metadata_fields = [{"schema_id": metadata_ontology.reserved_by_name["captureDateTime"].uid, "value": datetime.datetime.utcnow()},
                  {"schema_id": metadata_ontology.reserved_by_name["tag"].uid, "value": "tag_string"},
                  {"schema_id": metadata_ontology.reserved_by_name["split"]["train"].parent, "value": metadata_ontology.reserved_by_name["split"]["train"].uid}]

asset_attachments = [{"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" },
                     {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", "name": "CIR"},
                     {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", "name": "Weeds"},
                     {"type": "TEXT", "value": "IOWA, Zone 2232, June 2022 [Text string]"},
                     {"type": "TEXT", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
                     {"type": "IMAGE", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"},
                     {"type": "VIDEO", "value":  "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"},
                     {"type": "HTML", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"}]

for item in assets:
  item["metadata_fields"] = asset_metadata_fields
  item["attachments"] = asset_attachments

task = dataset.create_data_rows(assets)
task.wait_till_done()
import labelbox 

#Enter your API key
API_KEY = ""
client = labelbox.Client(api_key=API_KEY)

#create a new dataset
dataset = client.create_dataset(name="Data Row attachment example")

#Create metadata fields
metadata_fields = [{"schema_id": metadata_ontology.reserved_by_name["captureDateTime"].uid, "value": datetime.datetime.utcnow()},
                  {"schema_id": metadata_ontology.reserved_by_name["tag"].uid, "value": "tag_string"},
                  {"schema_id": metadata_ontology.reserved_by_name["split"]["train"].parent, "value": metadata_ontology.reserved_by_name["split"]["train"].uid}]

#create a data row with external ID
data_row = dataset.create_data_row(row_data="https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
                        external_id="base_image", metadata_fields=metadata_fields)

#Create multiple attachments
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", attachment_name="RGB")
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", attachment_name="CIR")
data_row.create_attachment(attachment_type="IMAGE_OVERLAY", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", attachment_name="Weeds")

data_row.create_attachment(attachment_type="TEXT", attachment_value="IOWA, Zone 2232, June 2022 [Text string]")
data_row.create_attachment(attachment_type="TEXT", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt")
data_row.create_attachment(attachment_type="IMAGE", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg")
data_row.create_attachment(attachment_type="VIDEO", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4")
data_row.create_attachment(attachment_type="HTML", attachment_value="https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html")

❗️

Special character handling

Please note that certain characters like # are not supported in URLs and should be avoided in your file names to prevent loading issues. A good litmus test for special character handling is to test URLs in your browser address bar; if it doesn't load properly in your browser, it won't load in Labelbox.

See Common SDK methods for other dataset and Data Row methods.

Option 2: Upload a JSON file (with Delegated Access URLs)

This option is useful if you are unable to use Python SDK. By uploading a JSON file, your data is able to remain in your cloud bucket. See Integrations to learn how to set up IAM Delegated Access.

  1. Create a JSON file containing data formatted as per data type

  2. Go to the Create a dataset page.

  3. Drag and drop your JSON file onto the page.

Give it a try using the examples below. Copy and paste the content into a text editor and save it as a JSON file (.json extension)

[
    {
        "externalId": "basic.png",
        "imageUrl": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
        "attachments": [
            {
                "type": "IMAGE_OVERLAY",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg",
                "name": "RGB"
            },
            {
                "type": "IMAGE_OVERLAY",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg",
                "name": "CIR"
            },
            {
                "type": "IMAGE_OVERLAY",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg",
                "name": "Weeds"
            },
            {
                "type": "TEXT",
                "value": "IOWA, Zone 2232, June 2022 [Text string]"
            },
            {
                "type": "TEXT",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"
            },
            {
                "type": "IMAGE",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"
            },
            {
                "type": "VIDEO",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"
            },
            {
                "type": "HTML",
                "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"
            }
        ]
    }
]
[
     {
         "externalId": "DesigningForGoogleCastVideo.mp4",
         "videoUrl": "https://commondatastorage.googleapis.com/gtv-videos-bucket/CastVideos/dash/DesigningForGoogleCastVideo.mp4",
         "attachments": [
             {
                 "type": "TEXT",
                 "value": "Some sample text"
             },
             {
                 "type": "TEXT",
                 "value": "Some more sample text"
             }
         ]
     }
]
[
     {
         "externalId": "lorem-ipsum.txt",
         "data": "https://storage.googleapis.com/labelbox-sample-datasets/nlp/lorem-ipsum.txt",
         "attachments": [
             {
                 "type": "TEXT",
                 "value": "Some sample text"
             },
             {
                 "type": "TEXT",
                 "value": "Some more sample text"
             }
         ]
     }
]
[
    {
        "externalId": "cklidhv7o0zdk0y4z4282dp6o",
        "tileLayerUrl": "https://s3-us-east-2.amazonaws.com/lb-ron/CACI/ron_mctiles/{z}/{x}/{y}.png",
        "bounds": [
            [
                19.405662413477728,
                -99.21052827588443
            ],
            [
                19.400498983095076,
                -99.20534818927473
            ]
        ],
        "minZoom": 12,
        "maxZoom": 20,
        "epsg": "EPSG4326",
        "version": 2,
        "attachments": [
             {
                 "type": "TEXT",
                 "value": "Some sample text"
             },
             {
                 "type": "TEXT",
                 "value": "Some more sample text"
             }
         ]
     }
]

Option 3: Upload data directly

❗️

Upload file size limitation

You can upload a file up to 500MB each.

Upload files directly via the UI

  1. Log into Labelbox.

  2. Go to the Catalog and select New dataset.

  3. Upload any supported data types.

Upload files directly via the SDK

# Local paths
local_data_path = '/tmp/test_data_row.txt'
with open(local_data_path, 'w') as file:
    file.write("sample data")

task2 = dataset.create_data_rows([local_data_path])
task2.wait_till_done()
# Note that you cannot set external_ids at this time when uploading from local files.
# To do this you have to first
item_url = client.upload_file(local_data_path)
task4 = dataset.create_data_rows([{
    DataRow.row_data: item_url,
    DataRow.external_id: str(uuid.uuid4())
}])
task4.wait_till_done()

Append to an existing dataset

Adding Data Rows to an existing dataset can be accomplished by using the same methods described above.

## Get existing dataset
dataset = client.get_dataset("DATASET_ID")

## Follow data row creation as shown in earlier sections

Complete Python SDK tutorials

The tutorials below cover the most common CRUD methods on Data Rows and Datasets.

Python Tutorial

Github

Google Colab

Data Rows

Open In GithubOpen In Github

Open In ColabOpen In Colab

Datasets

Open In GithubOpen In Github

Open In ColabOpen In Colab


Did this page help you?