Learn how to sync your data from Databricks to Labelbox.
You can sync data from Databricks to Labelbox using the built-in integration from Census, a tool that connects data warehouses to operational platforms. Setting up this integration consists of four main steps across Databricks and Labelbox:
Databricks: Structure your data to follow the data requirements.
Databricks: Configure required credentials.
Labelbox: Connect Databricks to Labelbox.
Labelbox: Sync your data between Databricks and Labelbox.
Each entry in your data must include the following fields for a successful sync:
A global key that uniquely identifies each data row. Global keys, also called sync keys, help Labelbox detect new, changed, or duplicate records. They must be unique within your Labelbox workspace.
A dataset ID that identifies the target dataset. You can include the dataset ID directly in your source data or specify it as a constant value during the sync setup. To find a dataset ID, open the dataset in Catalog, then use the dataset menu to copy the ID to your clipboard.
A row data value that specifies the URL or file path of the data asset.
You can also include the following optional fields:
A Metadata JSON dictionary that specifies the custom metadata of a data row. Example: [{ "name": "dog", "value": 123 }, { "name": "fox", "value": 123 }]
An Attachments JSON object that specifies attachments associated with the data row. Example:
Copy
Ask AI
[ { "type": "RAW_TEXT", "value": "IOWA, Zone 2232, June 2022" }, { "type": "IMAGE", "value": "https://storage.example.com/samples/attachment.jpeg" }]
See this Google Sheet for an example of structured data that includes all required and optional fields.
Depending on your Databricks IP access settings, you might need to add Census IP addresses to your allowlist. To find IP addresses of your region, see Census IP addresses. To learn more about setting your network policy on Databricks, see the Databricks documentation.
To sync Databricks data to a Labelbox dataset after connecting Databricks to Labelbox:
On the Workspace settings page, select Integrations.
Under Manage integrations, select Sync Integration.
Select an existing dataset or create a new dataset for loading your data.
Under Select a Source, select Select a Warehouse Table. Then, under Connection, select your added Databricks integration.
Keep the default settings of Select a Destination.
Under Select a Sync Behavior, select the data operation type that controls how data rows are imported into Labelbox. The options are Create only, Update only, Update or Create, and Delete.