LLM data generation

Guide to creating fine-tuning datasets for LLMs.

With Labelbox, you can prepare a dataset of prompts and responses to fine-tune large language models (LLMs). Labelbox supports dataset creation for a variety of fine-tuning tasks including summarization, classification, question-answering, and generation.

Fine-tuning is useful when an LLM needs learn something specific outside of the data it was trained on. In this case, the model is being fine-tuned to extract product SKUs from reviews.

Fine-tuning is useful when an LLM needs to learn something specific outside of the data it was trained on. In this case, the model is being fine-tuned to extract product SKUs from reviews.

LLM data generation workflows

When you set up an LLM data generation project in Labelbox, you will be prompted to specify how you will be using the editor. You have three choices for specifying your LLM data generation workflow.

  • Workflow 1: Humans generate prompts and responses: In the editor, the prompt and response fields will be required. This will indicate to your team they should create a prompt and a response from scratch.
  • Workflow 2: Humans generate prompts: In the editor, only the prompt field will be required. This will indicate to your team that they should create a prompt from scratch.
  • Workflow 3: Humans generate responses to uploaded prompts: In the editor, a previously uploaded prompt will appear. Your team will need to create responses for that prompt.

Specify prompts and/or responses

During the project setup, if you select Humans generate prompts, you will need to specify a prompt for your labelers to reference so they can generate more prompts. Prompts are restricted to free-form text format. You can optionally set a character minimum and maximum for prompt data.

During the project setup, if you select Humans generate prompts and responses, you will need to specify a prompt and a response for your labelers to reference so they can generate more prompts and responses. The supported format types for prompts and responses are listed below.

Import prompts

During the project setup, if you select Humans generate responses to uploaded prompts as your LLM data generation workflow, you will need to create an import file containing links to a set of prompts in text format. Then, upload your import file and send the prompts to the project.

Follow these steps to upload prompt data to Labelbox:

  1. Create the import file containing links to the prompts. See Import text data to learn how to structure your import file.
  2. Upload the import file to Labelbox.
  3. Save the prompts in a batch and send the batch to your project. See Batches for instructions.

πŸ“˜

Data row size limit

To view the maximum size allowed for a data row, visit our limits page.

Supported prompt formats

If you selected an LLM data generation workflow that involves generating a prompt in the editor, you will need to specify a prompt to use as the ontology. Each LLM data generation ontology is limited to one prompt.

FeatureImport formatExport format
Prompt - Free-form textN/ASee payload

Supported response formats

If you selected an LLM data generation workflow that involves generating a response in the editor, you will need to specify a set of responses to use as the ontology. Below are the supported formats you may include when you are specifying responses in your ontology. Responses can be applied at the global level and/or nested within other responses. LLM data generation ontologies support multiple responses.

FeatureImport formatExport format
Response - TextN/ASee payload
Response - RadioN/ASee payload
Response - ChecklistN/ASee payload

Response - Text

Create a text response by selecting a response type of Text during ontology creation. You can optionally set a character minimum and maximum for text-type responses.

Response - Radio

Create a radio response by selecting a response type of Radio during ontology creation. Radio responses support nested sub-classifications.

Response - Checklist

Create a checklist response by selecting a response type of Checklist during ontology creation. Checklist responses support nested sub-classifications.