Custom model training pipeline

Customize a model training pipeline based on the Labelbox reference implementation.

Labelbox provides code to leverage vertex and other Google cloud services for running training jobs. If you have a Google Cloud account you can follow the steps in Set up Google Vertex AutoML service and deploy the model training service into your GCloud environment. You can check out our Github repo for more instructions.

Design

The out-of-the-box system has the following components:

  1. Coordinator
    a. This is a server that implements the /model and /model_run endpoints described on the Model training service API integration page.
    b. This service runs the different jobs required for training. A sequence of jobs is called a pipeline. When the /model_run endpoint is called, this service will launch all stages of the pipeline being run depending on the modelType that is posted to it.
    c. This service should only handle input/output, resource-intensive tasks will be run as jobs on separate machines.
  2. Jobs
    a. These are individual stages in a pipeline that are run by the Coordinator. They are containerized and create artifacts (e.g. the ETL stage produces a training file that is stored on GCS).
    b. Both vertex AutoML and custom container jobs run as vertex jobs.

Who is this useful for

The training service automatically provisions resources and launches jobs. This is useful if you do not already have your own automated training infrastructure. Labelbox recommends attempting to use and modify your existing pipelines before creating completely custom models.

If you already have an automated training infrastructure, you should instead follow the instructions in Model training service API integration.

Instructions: Modify Existing Pipelines

All out-of-the-box tasks are defined by three stages (ETL, train, and inference). The logic for running these stages is contained by a pipeline.

Modify the ETL

Modifying existing ETL logic is as simple as modifying the python scripts under the [jobs/etl] directory. Once you have updated the logic you can make the changes by running.

docker-compose build && docker-compose push

Modify training logic

All of the training logic is run through Vertex. If you want to add a custom training module, then skip to the next section on this page.

You can modify the training parameters by updating the job definition code in the Coordinator. E.g. There are various model types that you can configure. If you want to update the model type to be a bounding box model, you can update it here. To apply these changes, run this command (note that this will stop any in-progress jobs).

./deployment/reload_coordinator.sh

Modify inference and metrics

All inference jobs first launch a Vertex batch job. Instead of creating the additional overhead of a container for simple post-processing, you should transform the predictions in the inference stage of the pipeline. The inference logic cannot be changed much without a completely custom inference job. You can, however, change prediction thresholds and add additional metrics that are derived from the batch predictions. For example, if you want to decrease the bounding box threshold you could update this line in the bounding box inference job.

To apply changes to the pipeline, you will need to run ./deployment/reload_coordinator.sh (note that this will stop any in-progress jobs).

Labelbox recommends not adding much computing or memory overhead to the coordinator. If you want to add additional metrics or process the data consider adding a new job for this.

Instructions: Add Custom Pipelines

This section explains how to extend the model training code with custom jobs and pipelines. We recommend running a pre-built training pipeline before attempting to create your own custom one.

You don’t have to make entirely new end-to-end pipelines. If you want to create a custom training job but re-use the ETL logic the same instructions apply.

Implement Jobs

The first step in setting up a custom task is to implement the individual jobs that you want to run. The jobs could be ETL, training, inference, or any arbitrary code that you want to execute as part of your training run.

  1. Create a new directory for defining the job logic and dependencies under the existing [jobs directory]
    a. This should contain a docker file and a python script with a cli entrypoint ([bounding box ETL example], [cli entrypoint example]).

  2. Add the job to the [docker-compose file].
    a. If you need information available at run-time, pass it in as a build arg. Env vars are only used for running containers locally. Secrets should not be baked into the image. Use google cloud secret manager to make secrets available. We automatically add the Labelbox API key and the service secret to the secret manager.
    b. Make sure that the image name follows the same convention as the other containers. Doing so will allow you to support parallel deployments.

  3. Add the logic for calling the jobs. Add a new python file for defining the jobs under the pipeline directory.
    a. Define all of the Jobs for that pipeline in the file (e.g. bounding box ETL is set up to run an arbitrary container that accepts CLI args. Use this to run the docker containers that you created in steps 1 and 2.)

Create Pipeline

Now that you have a set of job objects that you want to run, you just need to add them to a new pipeline.

  1. Define the pipeline
    a. Initialize the stages
    b. Implement a run function that accepts JSON as an argument. This function should run each stage in the pipeline. For reference, this run function is called from here.
    Run the stage with self.run_job (uses the base Pipeline classes run_job method). This will automatically handle and log any errors.
  2. Update the pipeline types and definitions in the config.
    a. Pipeline type
    b. Pipeline instance
    c. Pipeline name type