Databricks

Learn how to sync your data from Databricks to Labelbox.

You can sync data from Databricks to Labelbox using the built-in integration from Census, a tool that connects data warehouses to operational platforms. Setting up this integration consists of four main steps across Databricks and Labelbox:

  1. Databricks: Structure your data to follow the data requirements.
  2. Databricks: Configure required credentials.
  3. Labelbox: Connect Databricks to Labelbox.
  4. Labelbox: Sync your data between Databricks and Labelbox.

📘

Paid feature

The Census integration is available exclusively to paid customers. For upgrade options, see Plans & pricing.

Structure your data

Each entry in your data must include the following fields for a successful sync:

  • A global key that uniquely identifies each data row. Global keys, also called sync keys, help Labelbox detect new, changed, or duplicate records. They must be unique within your Labelbox workspace.
  • A dataset ID that identifies the target dataset. You can include the dataset ID directly in your source data or specify it as a constant value during the sync setup. To find a dataset ID, open the dataset in Catalog, then use the dataset menu to copy the ID to your clipboard.
  • A row data value that specifies the URL or file path of the data asset.

You can also include the following optional fields:

  • A Metadata JSON dictionary that specifies the custom metadata of a data row. Example: [{ "name": "dog", "value": 123 }, { "name": "fox", "value": 123 }]
  • An Attachments JSON object that specifies attachments associated with the data row. Example:
    [  
      {  
        "type": "RAW_TEXT",  
        "value": "IOWA, Zone 2232, June 2022"  
      },
      {
        "type": "IMAGE",
        "value": "https://storage.example.com/samples/attachment.jpeg"
      }
    ]
    

See this Google Sheet for an example of structured data that includes all required and optional fields.

Configure credentials

Before setting up an integration on Labelbox, collect all required credentials from Databricks.

Access token

Collect one of the following types of access tokens supported by Databricks for authentication:

  • Personal access tokens: Authenticate access to resources and APIs at the Databricks workspace level.
  • Service principals: Grant automated tools and scripts API-only access to Databricks resources, providing greater security than personal access tokens.

To learn how to create these access tokens, see the Databricks documentation on personal access token or service principal.

Connection details

Collect the following credentials for establishing a connection:

  • Hostname
  • Port
  • HTTP Path

To view and collect these credentials on Databricks:

  • For SQL warehouses, navigate to the Connection Details tab.
  • For all-purpose clusters, navigate to the Configuration tab and select Advanced Options > JDBC/ODBC.

Allowed IP addresses

Depending on your Databricks IP access settings, you might need to add Census IP addresses to your allowlist. To find IP addresses of your region, see Census IP addresses. To learn more about setting your network policy on Databricks, see the Databricks documentation.

Add connections

To add an integration between Databricks and Labelbox:

  1. On the Labelbox home page or the Workspace settings > Integrations page, select Sync from a source.
  2. Select Databricks as the data source.
  3. Configure Census settings:
    1. Select between Basic and Advanced sync engine options. See the Census documentation for more information.
    2. Add all required credentials that you configured on Databricks.
  4. Click Continue to start the connection.
  5. (Optional) Test the connection.

You can check the connection status of your configured integrations under Manage integrations and start to sync data to your datasets.

Sync data

To sync Databricks data to a Labelbox dataset after connecting Databricks to Labelbox:

  1. On the Workspace settings page, select Integrations.
  2. Under Manage integrations, select Sync Integration.
  3. Select an existing dataset or create a new dataset for loading your data.
  4. Under Select a Source, select Select a Warehouse Table. Then, under Connection, select your added Databricks integration.
  5. Keep the default settings of Select a Destination.
  6. Under Select a Sync Behavior, select the data operation type that controls how data rows are imported into Labelbox. The options are Create only, Update only, Update or Create, and Delete.
  7. Map your data according to the data requirements.
  8. (Optional) Run a test sync to verify the sync behavior.
  9. Select a Run Mode to manually trigger the sync or set up a pattern that automates the data sync between your source and Labelbox.

Once your sync has run successfully, you can manage and use your dataset with synced data like any other Labelbox dataset.