Offline multimodal chat evaluation
Learn how to create offline multimodal chat evaluation projects for ranking and classifying model outputs on conversation text data.
The offline multimodal chat evaluation editor allows you to evaluate generative models by importing existing conversations and adding annotations to model responses. The editor supports various data types, including text, images, videos, audio, and PDFs.
Set up offline multimodal chat evaluation projects
The following steps walk you through how to set up an offline multimodal chat evaluation project on the Labelbox platform. To learn how to set up an offline multimodal chat evaluation project using the SDK, see Multimodal chat evaluation projects.
Step 1: create a project
-
On the Annotate projects page, click the + New project button.
-
Select Multimodal chat, and then select Offline multimodal chat.
-
Provide a name and an optional description for your project.
Step 2: add data
- Click the Add data button to select a conversation v2 JSON dataset or create a new dataset. Alternatively, you can import data using the SDK.
Step 3: Set up an ontology
Create an ontology for evaluating model response, like the following example:
The editor supports the following options:
Feature | Description | Export format |
---|---|---|
Message ranking | Rank multiple model-generated responses to determine their relative quality or relevance. | Payload |
Message selection | Select single or multiple responses that meet specific criteria. | Payload |
Message step reasoning | Break responses into steps and evaluate the accuracy of each step by selecting from correct, neutral, and incorrect. Add your rewrite with justification for incorrect steps. | Payload |
Classification - Radio | Select one option from a predefined set. | Payload |
Classification - Checklist | Choose multiple options from a list. | Payload |
Classification - Free text | Add free text annotations. | Payload |
Classification tasks can apply globally to the entire conversation or individually to a message. They can also nest subclassification tasks.
Experimental feature
Message step reasoning is an experimental feature. Currently, you can't import step reasoning labels using the SDK.
Step 4: Complete annotation tasks
Click the Start labeling button to add annotations to evaluate the responses. Complete all tasks in your workflow.
Updated about 5 hours ago