- Databricks: Structure your data to follow the data requirements.
- Databricks: Configure required credentials.
- Labelbox: Connect Databricks to Labelbox.
- Labelbox: Sync your data between Databricks and Labelbox.
Paid feature
The Census integration is available exclusively to paid customers. For upgrade options, see Plans & pricing.Structure your data
Each entry in your data must include the following fields for a successful sync:- A global key that uniquely identifies each data row. Global keys, also called sync keys, help Labelbox detect new, changed, or duplicate records. They must be unique within your Labelbox workspace.
- A dataset ID that identifies the target dataset. You can include the dataset ID directly in your source data or specify it as a constant value during the sync setup. To find a dataset ID, open the dataset in Catalog, then use the dataset menu to copy the ID to your clipboard.
- A row data value that specifies the URL or file path of the data asset.
- A Metadata JSON dictionary that specifies the custom metadata of a data row. Example:
[{ "name": "dog", "value": 123 }, { "name": "fox", "value": 123 }]
- An Attachments JSON object that specifies attachments associated with the data row. Example:
Configure credentials
Before setting up an integration on Labelbox, collect all required credentials from Databricks.Access token
Collect one of the following types of access tokens supported by Databricks for authentication:- Personal access tokens: Authenticate access to resources and APIs at the Databricks workspace level.
- Service principals: Grant automated tools and scripts API-only access to Databricks resources, providing greater security than personal access tokens.
Connection details
Collect the following credentials for establishing a connection:- Hostname
- Port
- HTTP Path
- For SQL warehouses, navigate to the Connection Details tab.
- For all-purpose clusters, navigate to the Configuration tab and select Advanced Options > JDBC/ODBC.
Allowed IP addresses
Depending on your Databricks IP access settings, you might need to add Census IP addresses to your allowlist. To find IP addresses of your region, see Census IP addresses. To learn more about setting your network policy on Databricks, see the Databricks documentation.Add connections
To add an integration between Databricks and Labelbox:- On the Labelbox home page or the Workspace settings > Integrations page, select Sync from a source.
- Select Databricks as the data source.
-
Configure Census settings:
- Select between Basic and Advanced sync engine options. See the Census documentation for more information.
- Add all required credentials that you configured on Databricks.
- Click Continue to start the connection.
- (Optional) Test the connection.
Sync data
To sync Databricks data to a Labelbox dataset after connecting Databricks to Labelbox:- On the Workspace settings page, select Integrations.
- Under Manage integrations, select Sync Integration.
- Select an existing dataset or create a new dataset for loading your data.
- Under Select a Source, select Select a Warehouse Table. Then, under Connection, select your added Databricks integration.
- Keep the default settings of Select a Destination.
- Under Select a Sync Behavior, select the data operation type that controls how data rows are imported into Labelbox. The options are Create only, Update only, Update or Create, and Delete.
- Map your data according to the data requirements.
- (Optional) Run a test sync to verify the sync behavior.
- Select a Run Mode to manually trigger the sync or set up a pattern that automates the data sync between your source and Labelbox.