> ## Documentation Index
> Fetch the complete documentation index at: https://docs.labelbox.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Dataset

> Summary of methods for creating and modifying datasets.

The most common method for importing data to Labelbox is via Python SDK after setting up a [cloud storage integration](/docs/iam-delegated-access). With an IAM delegated access integration, you can keep your data in your cloud bucket and grant Labelbox limited access to the data on demand.

<Info>
  ### Examples for all data types

  The examples in this developer guide primarily use image assets. For sample approaches with other data modalities, please view the developer guide for importing data nested under each asset type in the **Import/Export** section of the table of contents.
</Info>

## Create a dataset

The only required argument when creating a dataset is the `name`. You can also specify the IAM integration that you have set up on your account if not specified or set to `None` it will use your default integration. For more details. including how to get your IAM integrations; visit our dedicated [IAM integration](/reference/cloud-storage-iam-integration) page

<CodeGroup>
  ```python Python theme={null}
  dataset = client.create_dataset(
    name='<dataset_name>',
    description='<dataset_description>',	# optional
    iam_integration=None		# if not specified, will use default integration, set as None to not use delegated access.
  )
  ```
</CodeGroup>

## Get a dataset

<CodeGroup>
  ```python Python theme={null}
  dataset = client.get_dataset("<dataset_id>")

  # alternatively, you can get a dataset by name
  dataset = client.get_datasets(where=labelbox.Dataset.name == "<dataset_name>").get_one()
  ```
</CodeGroup>

## Dataset methods

## Create data rows

<Warning>
  ### Special character handling

  Please note that certain characters like `#`,`<`, `>` and `|`| are not supported in URLs and should be avoided in your file names to prevent loading issues.

  Please refer to [https://datatracker.ietf.org/doc/html/rfc2396#section-2.4.3](https://datatracker.ietf.org/doc/html/rfc2396#section-2.4.3) on URI standards.

  A good test for the handling of special characters is to test URLs in your browser address bar — if the URL doesn't load properly in your browser, it won't load in Labelbox.
</Warning>

The only required argument when creating a data row is the `row_data`. However, Labelbox strongly recommends supplying each data row with a [global key](/reference/data-row-global-keys) upon creation.

<CodeGroup>
  ```python Python theme={null}
  # this example uses the uuid package to generate unique global keys
  from uuid import uuid4

  data =[
      {
        "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
       	"global_key": str(uuid4())
      },
  	{
        "row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
        "global_key": str(uuid4())
      }
    ]

  dataset.upsert_data_rows(data)

  # or alternatively use

  dataset.create_data_rows(data)
  ```
</CodeGroup>

You can also create data rows with [metadata](/docs/datarow-metadata), [attachments](/docs/label-data#attachments), and [image overlays](/docs/label-data#image-overlay) in the same task. The code below contains an end-to-end example for creating a dataset with data rows that include these elements.

<CodeGroup>
  ```python Python theme={null}
  import labelbox as lb
  from uuid import uuid4
  import datetime

  # insert your API key
  LB_API_KEY = "<API KEY>"
  client = lb.Client(api_key=LB_API_KEY)

  # get the metadata ontology
  metadata_ontology = client.get_data_row_metadata_ontology()

  # create the dataset
  dataset = client.create_dataset(name="Bulk import example")

  # build the assets
  assets = [
    {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
    {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())},
    {"row_data": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg", "global_key": str(uuid4())}
  ]

  # build the metadata
  asset_metadata_fields = [
    {"name": "captureDateTime", "value": datetime.datetime.utcnow()},
    {"name": "tag", "value": "tag_string"},
    {"name": "split", "value": "train"}
  ]

  # build the attachments
  asset_attachments = [
    {"type": "IMAGE", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg"},
    {"type": "VIDEO", "value":  "https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4"},
    {"type": "RAW_TEXT", "value": "IOWA, Zone 2232, June 2022 [Text string]"},
    {"type": "RAW_TEXT", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
    {"type": "TEXT_URL", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt"},
    {"type": "HTML", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html"},
    {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/rgb.jpg", "name": "RGB" },
    {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/cir.jpg", "name": "CIR"},
    {"type": "IMAGE_OVERLAY", "value": "https://storage.googleapis.com/labelbox-sample-datasets/Docs/weeds.jpg", "name": "Weeds"},
    {"type": "PDF_URL", "value": "https://storage.googleapis.com/labelbox-datasets/arxiv-pdf/data/99-word-token-pdfs/0801.3483.pdf"}
  ]

  # connect the metadata and attachments to the data rows
  for item in assets:
    item["metadata_fields"] = asset_metadata_fields
    item["attachments"] = asset_attachments

  # create the data rows
  task = dataset.upsert_data_rows(assets) # or # or alternatively use create_data_rows()
  task.wait_till_done()
  print(task.results)
  print(task.errors)
  ```
</CodeGroup>

<Info>
  ### Limits

  See [this page](/docs/limits) to learn the limits for creating data rows in one bulk operation. Regardless of your tier's limit, if you have a large dataset to upload, you can split the data rows into chunks and upload them sequentially.
</Info>

### Task results

The task result is a collection of details, typically information on the data rows associated with the task.

<CodeGroup>
  ```python Python theme={null}
  dataset = client.get_dataset("<dataset id>")

  task = dataset.create_data_rows([{
    "row_data": item_url,
    "global_key": "<unique_global_key>"
  }])
  task.wait_till_done()

  print(task.result) # prints the data rows associated with the create_data_rows operation
  ```
</CodeGroup>

<Warning>
  ### Warning

  Large results (over 150,000 data rows) can take up to 10 mins to process.
</Warning>

### Task errors

The task errors work similarly to the task results but contain information on any associated errors.

<CodeGroup>
  ```python errors theme={null}
  dataset = client.get_dataset("<dataset id>")

  task = dataset.create_data_rows([{
    "row_data": item_url,
    "global_key": "<unique_global_key>"
  }])
  task.wait_till_done()

  print(task.errors) # prints the data rows associated with the create_data_rows operation
  ```
</CodeGroup>

## Create a singular data row

<CodeGroup>
  ```python Python theme={null}
  # simplest example
  task = dataset.create_data_row(
    row_data="https://storage.googleapis.com/labelbox-sample-datasets/Docs/basic.jpg",
    global_key=str(uuid4())
  )
  task.wait_till_done()
  ```
</CodeGroup>

## Importing a local file

### Upload a local file

You can upload a local file to Labelbox storage and use the returned URL to create a data row.

<CodeGroup>
  ```python Upload local file first theme={null}
  # note that you cannot set global keys when creating a data row from a local file
  # first we must upload the file then create the data row
  local_data_path = "/tmp/test_data_row.txt"
  with open(local_data_path, 'w') as file:
    file.write("sample data")

  # upload the file to Labelbox storage
  file_url = client.upload_file(local_data_path)

  # create the data row with a global key
  task = dataset.create_data_rows([{
    "row_data": item_url,
    "global_key": "<unique_global_key>"
  }])
  task.wait_till_done()
  ```
</CodeGroup>

### Create data row from a local file

You can also create a data row from a local file directly without uploading it to storage. This creates the file URL and the data row in one step. However, this is not recommended since it increases the time it takes for your data row creation.

<CodeGroup>
  ```python Upload local file directly theme={null}
  # get a local file and write some text data
  local_data_path = ""
  # create the data row
  task = dataset.create_data_rows([{
    "row_data": local_data_path,
    "global_key": "<unique_global_key>"
  }])
  task.wait_till_done()
  ```
</CodeGroup>

## Append to an existing dataset

You can add data rows to an existing dataset using the methods described above. First, get a dataset, then create the rows.

<CodeGroup>
  ```python Python theme={null}
  # get existing dataset
  dataset = client.get_dataset("<dataset_id>")

  # use a method for data row creation as shown in sections above
  task = dataset.create_data_rows([{
    "row_data": item_url,
    "global_key": "<unique_global_key>"
  }])
  ```
</CodeGroup>

## Export data rows from a dataset

Filtering the result, you can obtain a list of asset URLs, global keys, or data row IDs.

<CodeGroup>
  ```python Export theme={null}
  # Get a dataset
  dataset = client.get_dataset("<dataset_id>")

  # Start the export task for the dataset
  export_task = dataset.export()
  export_task.wait_till_done()

  # Stream the export using a callback function
  def json_stream_handler(output: labelbox.BufferedJsonConverterOutput):
    print(output.json)

  export_task.get_buffered_stream(stream_type=labelbox.StreamType.RESULT).start(stream_handler=json_stream_handler)

  # Collect all exported data into a list
  export_json = [data_row.json for data_row in export_task.get_buffered_stream()]

  # Extract the data rows urls from the export result
  asset_urls = [item.json["data_row"]["row_data"] for item in stream]

  # (Optional) To export all the global keys
  global_keys = [item.json["data_row"]["global_key"] for item in stream]

  # (Optional) To export all the data row ids
  data_row_ids = [item.json["data_row"]["id"] for item in stream]
  ```
</CodeGroup>

## Export data rows with labels and predictions

For more details, see [Export data rows from Catalog](/reference/export-overview#export-data-rows-from-catalog).

<CodeGroup>
  ```python Export theme={null}
  # set the export params to include/exclude certain fields
  export_params={
    "attachments": True,
    "metadata_fields": True,
    "data_row_details": True,
    "project_details": True,
    "performance_details": True,
    "project_ids": ["<project_id_1>", "<project_id_2>"],
    "model_run_ids": ["<model_run_id_1>", "<model_run_id_2>"]
  }

  # set filters
  filters={
    "last_activity_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"],
    "label_created_at": ["2000-01-01 00:00:00", "2050-01-01 00:00:00"]
  }

  # get a dataset
  dataset = client.get_dataset("<dataset_id>")

  export_task = dataset.export(params=export_params, filters=filters)
  export_task.wait_till_done()

  # Stream the export using a callback function
  def json_stream_handler(output: labelbox.BufferedJsonConverterOutput):
    print(output.json)

  export_task.get_buffered_stream(stream_type=labelbox.StreamType.RESULT).start(stream_handler=json_stream_handler)

  # Collect all exported data into a list
  export_json = [data_row.json for data_row in export_task.get_buffered_stream()]
  ```
</CodeGroup>

## Update a dataset

<CodeGroup>
  ```python Python theme={null}
  dataset.update(name="new_dataset_name")
  ```
</CodeGroup>

## Delete a dataset

<Danger>
  **Deleting a dataset cannot be undone**

  This method deletes the dataset along with all data rows in the dataset and any labels made on these data rows. This action cannot be reverted.
</Danger>

<CodeGroup>
  ```python Python theme={null}
  dataset.delete()
  ```
</CodeGroup>

# Dataset attributes

## Get the basics

<CodeGroup>
  ```python Python theme={null}
  # name (str)
  dataset.name

  # description (str)
  dataset.description

  # updated at (datetime)
  dataset.updated_at

  # created at (datetime)
  dataset.created_at

  # created by (relationship to User object)
  user = dataset.created_by()

  # organization (relationship to Organization object)
  organization = dataset.organization()
  ```
</CodeGroup>

## Get the data rows

The `data_rows()` attribute is a relationship to the `DataRow` objects in the dataset. The relationship retrieves a paginated collection of data rows. It is recommended to use exports to get direct data row details especially for larger data sets.

<CodeGroup>
  ```python Python theme={null}
  data_rows = dataset.data_rows()

  # inspect one data row
  next(data_rows)

  # inspect a number of data rows
  for data_row in data_rows:
    print(data_row)

  # for ease of use, you can convert the paginated collection to a list
  list(data_rows)
  ```
</CodeGroup>

## Get the number of data rows

The `row_count` is a cached attribute; thus, you must re-fetch the dataset after creating data rows to retrieve the updated value.

<CodeGroup>
  ```python Python theme={null}
  dataset = client.get_dataset("<dataset_id>")
  dataset.row_count
  ```
</CodeGroup>
