Databricks

Learn how to sync your data from Databricks to Labelbox.

You can sync your data from Databricks to Labelbox using the Census data platform built into Labelbox. The setup includes four parts:

  1. On Databricks: Structure your data to follow the data requirements.
  2. On Databricks: Configure required credentials.
  3. On Labelbox: Connect Databricks to Labelbox.
  4. On Labelbox: Sync your data between Databricks and Labelbox.

📘

Census availability

The Census integration feature is available exclusively for paid customers. Free users need to upgrade their subscription to access this feature.
For more information on upgrading your subscription, please see Plans & pricing.

Structure your data

Each entry in your data must have the following values for a successful sync:

  • A global key that uniquely identifies individual data rows. Global keys (also called sync keys) help Labelbox distinguish between new records, changed records, or duplicate records. Global keys must be unique within your Labelbox workspace, generally an organization subscription.
  • A dataset ID that identifies the data row destination. You can specify the dataset ID in the source data or as a constant value when setting up the sync.
    To find a dataset ID, use Catalog to open the dataset and then use the Dataset menu to copy the dataset ID to the Clipboard.
  • A row data value that specifies the location of a data asset.

You can also include the following additional fields:

  • A Metadata JSON object (dictionary) that specifies custom metadata to be added to the data row. A size limit applies based on your subscription.
    Example: [{ "name": "dog", "value": 123 }, { "name": "fox", "value": 123 }]
  • An Attachments JSON object that specifies attachments to be associated with the data row.
    Example:
    [  
      {  
        "type": "RAW_TEXT",  
        "value": "IOWA, Zone 2232, June 2022 [Text string]"  
      },
      {
        "type": "IMAGE",
        "value": "https://storage.example.com/samples/attachment.jpeg"
      }
    ]
    

See this Google Sheet for an example on how to structure required and optional values.

Configure credentials

Before setting up an integration on Labelbox, collect all required credentials from Databricks.

Access token

Databricks supports two types of access token for authentication:

  • Personal access tokens: Authenticate access to resources and APIs at the Databricks workspace level.
  • Service principals: Give automated tools and scripts API-only access to Databricks resources, providing greater security than personal access tokens.

To learn how to create these access tokens, see the Databricks documentation on personal access token or service principal.

🚧

Service principle limitations

Service principals can't be connected to All Purpose Clusters that are in the Single User Access Mode.

Connection details

The following credentials are required to establish a connection:

  • Hostname
  • Port
  • HTTP Path

To view and collect these credentials on Databricks:

  • For SQL Warehouses, switch to the Connection details tab.
  • For All Purpose Clusters, on the Configuration tab, open the Advanced Options section, and select JDBC/ODBC section.

Allowed IP addresses

Depending on your Databricks IP access settings, you might need to add Census IP addresses to your allowlist. To find IP addresses of your region, see Census IP addresses. To learn more on setting your network policy on Databricks, see the Databricks documentation.

Add connections

To add an integration between Databricks and Labelbox:

  1. On the Labelbox home page or the Workspace settings > Integrations page, select Sync from a source.
  2. Select Databricks as the data source.
  3. Configure Census settings, including:
    1. Select between Basic and Advanced sync engine options. See the Census documentation for more information.
    2. Add all required credentials that you configured on Databricks.
  4. Click Continue to start the connection.
  5. (Optional) Test the connection.

You can check the connection status of your configured integrations under Manage integrations and start to sync data to your datasets.

Add permissions for advanced sync engine

If you choose to use the advanced sync engine and haven't created a corresponding Census schema, you need to run the following scripts within the Databricks environment to create the schema and grant permissions based on your access token type:

CREATE SCHEMA IF NOT EXISTS CENSUS;
GRANT ALL PRIVILEGES ON SCHEMA CENSUS TO `[email protected]`
CREATE SCHEMA IF NOT EXISTS CENSUS;
GRANT ALL PRIVILEGES ON SCHEMA CENSUS TO `service-principal-clientid-guid`;

Sync data

To sync Databricks data to a Labelbox dataset after connecting Databricks to Labelbox:

  1. On the Workspace settings page, select Integrations.
  2. Under Manage integrations, select Sync Integration.
  3. Select an existing dataset or create a new dataset for loading in your data.
  4. Under Select a Source, select Select a Warehouse Table. Then, under Connection, select your added Databricks integration.
  5. Keep the default settings of Select a Destination.
  6. Under Select a Sync Behavior, select the data operation type that controls how data rows are imported into Labelbox. The options are Create only, Update only, Update or Create, and Delete.
  7. Map your data according to the data requirements.
  8. (Optional) Run a test sync to verify the sync behavior.
  9. Select a Run Mode to manually trigger the sync or set up a pattern that automates the data sync between your source and Labelbox.

Once your sync has run successfully, you can manage and use your dataset with synced data like any other Labelbox dataset.