You can use the Labelbox Databricks Pipeline Creator to integrate Databricks data into Labelbox.
The Pipeline Creator integrates Databricks table data into Labelbox; it creates Databricks Workflows that import (ingest) data into Labelbox datasets.
Pipeline Creator is an app on HuggingFace Spaces. Anyone can use the app; you don't need a HuggingFace account or credentials to use Pipeline Creator.
The Databricks Pipeline Creator integration is currently in beta. Some functionality may change before general availability.
Before proceeding, you need the following details:
- The Server Hostname of your Databricks workspace. (Typically, this is a string value similar to
- A Databricks Access token
- A Labelbox API key
In addition, it helps to do a bit of planning, as you'll need several details to set up your Databricks integration, including:
- The Databricks cluster you want your integration to use.
- The table and data you want to import into Labelbox.
- How frequently you want your integration to run (so that your Labelbox data stays in sync).
- Whether or not you're importing data into an existing Labelbox dataset.
To integrate Databricks data into Labelbox, use the Pipeline Creator to set up a Databricks integration:
From your web browser, open the Labelbox Databricks Pipeline Creator.
Select the mode you want your integration to run in.
For best results, we recommend activating the Run in Preview Mode toggle.
Preview mode verifies your integration setup and imports up to 50 rows. (When the preview runs successfully, deactivate the toggle and then deploy your pipeline using the current settings.)
Enter access details and credentials, which include:
- Databricks Server Hostname
- Databricks Access token
- Your Labelbox API key
These details are validated when entered. Error messages indicate validation failure; use message details to troubleshoot and resolve problems.
Enter Labelbox dataset details
Here, you choose whether to create a new dataset or to use an existing one.
When you create a new dataset, you need to name it.
You can also select an existing Labelbox dataset from a list of datasets available to your Labelbox API key.
Select a cluster for your integration workflow to use.
Your selected cluster starts if it isn't already running. This can take several minutes.
Once the cluster starts, a confirmation message appears.
Define how frequently and when your workflow integration runs.
Integrations can be run daily, weekly, or monthly. Times are interpreted by Databricks.
Select the source database and table:
When you do this, the first few rows are displayed as a preview.
Map fields from your source table to your Labelbox dataset.
row_datafield is required and represents the data asset to be labeled.
You can optionally select a field to use as the Labelbox
global_key. If you do not do this, the
row_datavalue is used as the global key.
When finished, select the Deploy Pipeline button.
The deployment result appears below the Deploy Pipeline button. This either confirms successful deployment or displays an error detailing the problem.
To verify your integration, sign in to Databricks and then open Workflows > Jobs.
If you selected preview mode when you created the workflow, your integration is named
PREVIEW_upload_to_labelbox. Production workflows are named
Open your integration to review the details.
Once your Labelbox Databricks integration has been deployed, use the Databricks app or other tools to maintain it.
For example, you can use the Edit schedule button to change the timing of your integration or to pause it.
To delete the integration, select Delete job from the command menu.
To learn more, see View and manage job runs.
Once your Labelbox Databricks intgration has been deployed, you can run it manually using the Databricks app or other tools.
For example, you can use the Databricks app to run the job manually:
- Sign in to the Databricks app and then select Workflows > Jobs.
- Select your integration.
- From the job details view, select the Run Now button displayed to the right of the command menu.
When your Databricks integration runs, it creates a job run; use the Runs tab for your integration to review status.
The graph at the top shows recent run results and the amount of time taken by each run.
To review general details for a particular run, you can select a bar in the graph or locate the corresponding row in the table below the chart.
Output logs are available for each run and can help troubleshoot any issues. To locate these, you can select:
- Job run ID from the popup displayed when you highlight a run in the graph.
- Start time displayed in the job run table
To access detailed logs for the job, you can select:
- The Logs button displays in the Compute section of the job details panel.
- Logs in the Spark column of the run table. To learn more, see Troubleshoot and repair job failures.
For complete details, see the Databricks docs.
The Labelbox Databricks integration imports (ingests) assets from Databricks to Labelbox.
- If you store raw data in Databricks (e.g. raw text), the text is imported.
- If you store public URLs in Databricks, the URLs are imported.
- If you store private (non-public) URLs in Databricks, the URLs are imported to Labelbox.
Labelbox needs access to private URLs in order to display and manage assets. This means you need to set up IAM delegated access before Labelbox can successfully use your assets. For help, see the detaIls for your cloud provider:
Updated about 1 month ago