Databricks

Shows how to use the Labelbox Databricks Pipeline Creator to integrate Databricks with Labelbox in order to import database table data into the platform.

You can use the Labelbox Databricks Pipeline Creator to integrate Databricks data into Labelbox.

The Pipeline Creator integrates Databricks table data into Labelbox; it creates Databricks Workflows that import (ingest) data into Labelbox datasets.

Pipeline Creator is an app on HuggingFace Spaces. Anyone can use the app; you don't need a HuggingFace account or credentials to use Pipeline Creator.

If you prefer, you can host your own instance of Pipeline Creator. To learn more, see the source code on Github, which is available under an Apache 2.0 license.

🚧

The Databricks Pipeline Creator integration is currently in beta. Some functionality may change before general availability.

Prerequisites

Before proceeding, you need the following details:

  1. The Server Hostname of your Databricks workspace. (Typically, this is a string value similar to https://<instance>.<cloud>.databricks.com)
  2. A Databricks Access token
  3. A Labelbox API key

In addition, it helps to do a bit of planning, as you'll need several details to set up your Databricks integration, including:

  • The Databricks cluster you want your integration to use.
  • The table and data you want to import into Labelbox.
  • How frequently you want your integration to run (so that your Labelbox data stays in sync).
  • Whether or not you're importing data into an existing Labelbox dataset.

Set up a Labelbox Databricks integration

To integrate Databricks data into Labelbox, use the Pipeline Creator to set up a Databricks integration:

  1. From your web browser, open the Labelbox Databricks Pipeline Creator.

  2. Select the mode you want your integration to run in.
    For best results, we recommend activating the Run in Preview Mode toggle.

    Preview mode verifies your integration setup and imports up to 50 rows. (When the preview runs successfully, deactivate the toggle and then deploy your pipeline using the current settings.)

  3. Enter access details and credentials, which include:

    • Databricks Server Hostname
    • Databricks Access token
    • Your Labelbox API key
      These details are validated when entered. Error messages indicate validation failure; use message details to troubleshoot and resolve problems.
  4. Enter Labelbox dataset details
    Here, you choose whether to create a new dataset or to use an existing one.
    When you create a new dataset, you need to name it.

    You can also select an existing Labelbox dataset from a list of datasets available to your Labelbox API key.

  5. Select a cluster for your integration workflow to use.

    Your selected cluster starts if it isn't already running. This can take several minutes.

    The Labelbox Databricks integration runs in your Databricks workspace, in a dedicated cluster.

    The Labelbox Databricks integration runs in your Databricks workspace, in a dedicated cluster.

    Once the cluster starts, a confirmation message appears.

  6. Define how frequently and when your workflow integration runs.
    The data ingestion job will run at a specific frequency. You can edit this frequency later on.

    Integrations can be run daily, weekly, or monthly. Times are interpreted by Databricks.

  7. Select the source database and table:

    When you do this, the first few rows are displayed as a preview.

  8. Map fields from your source table to your Labelbox dataset.

    The row_data field is required and represents the data asset to be labeled.
    You can optionally select a field to use as the Labelbox global_key. If you do not do this, the row_data value is used as the global key.

  9. When finished, select the Deploy Pipeline button.
    Labelbox logs confirm that the Labelbox Databricks integration was set up properly.

The deployment result appears below the Deploy Pipeline button. This either confirms successful deployment or displays an error detailing the problem.

Verify integration setup

To verify your integration, sign in to Databricks and then open Workflows > Jobs.

If you selected preview mode when you created the workflow, your integration is named PREVIEW_upload_to_labelbox. Production workflows are named upload_to_labelbox.

Open your integration to review the details.

Sanity check the Labelbox Databricks integration, from your Databricks workspace

Sanity check the Labelbox Databricks integration, from your Databricks workspace

Update or delete integration

Once your Labelbox Databricks integration has been deployed, use the Databricks app or other tools to maintain it.

For example, you can use the Edit schedule button to change the timing of your integration or to pause it.

To delete the integration, select Delete job from the command menu.

To learn more, see View and manage job runs.

Run integration manually

Once your Labelbox Databricks intgration has been deployed, you can run it manually using the Databricks app or other tools.

For example, you can use the Databricks app to run the job manually:

  1. Sign in to the Databricks app and then select Workflows > Jobs.
  2. Select your integration.
  3. From the job details view, select the Run Now button displayed to the right of the command menu.

Logs and error handling

When your Databricks integration runs, it creates a job run; use the Runs tab for your integration to review status.

The graph at the top shows recent run results and the amount of time taken by each run.

To review general details for a particular run, you can select a bar in the graph or locate the corresponding row in the table below the chart.

Output logs are available for each run and can help troubleshoot any issues. To locate these, you can select:

  • Job run ID from the popup displayed when you highlight a run in the graph.
  • Start time displayed in the job run table

To access detailed logs for the job, you can select:

For complete details, see the Databricks docs.

Public vs Delegated Access URLs

The Labelbox Databricks integration imports (ingests) assets from Databricks to Labelbox.

  • If you store raw data in Databricks (e.g. raw text), the text is imported.
  • If you store public URLs in Databricks, the URLs are imported.
  • If you store private (non-public) URLs in Databricks, the URLs are imported to Labelbox.

Labelbox needs access to private URLs in order to display and manage assets. This means you need to set up IAM delegated access before Labelbox can successfully use your assets. For help, see the detaIls for your cloud provider: