Offline multimodal chat evaluation

Learn how to create offline multimodal chat evaluation projects for ranking and classifying model outputs on conversation text data.

The offline multimodal chat evaluation editor allows you to evaluate generative models by importing existing conversations and adding annotations to model responses. The editor supports various data types, including text, images, videos, audio, and PDFs.

Set up offline multimodal chat evaluation projects

The following steps walk you through how to set up an offline multimodal chat evaluation project on the Labelbox platform. To learn how to set up an offline multimodal chat evaluation project using the SDK, see Multimodal chat evaluation projects.

Step 1: create a project

  1. On the Annotate projects page, click the + New project button.

  2. Select Multimodal chat, and then select Offline multimodal chat.

  3. Provide a name and an optional description for your project.

Step 2: add data

  1. Click the Add data button to select a conversation v2 JSON dataset or create a new dataset. Alternatively, you can import data using the SDK.

Step 3: Set up an ontology

Create an ontology for evaluating model response, like the following example:

The editor supports the following options:

FeatureDescriptionExport format
Message rankingRank multiple model-generated responses to determine their relative quality or relevance.Payload
Message selectionSelect single or multiple responses that meet specific criteria.Payload
Message step reasoningBreak responses into steps and evaluate the accuracy of each step by selecting from correct, neutral, and incorrect. Add your rewrite with justification for incorrect steps.Payload
Classification - RadioSelect one option from a predefined set.Payload
Classification - ChecklistChoose multiple options from a list.Payload
Classification - Free textAdd free text annotations.Payload

Classification tasks can apply globally to the entire conversation or individually to a message. They can also nest subclassification tasks.

📘

Experimental feature

Message step reasoning is an experimental feature. Currently, you can't import step reasoning labels using the SDK.


Step 4: Complete annotation tasks

Click the Start labeling button to add annotations to evaluate the responses. Complete all tasks in your workflow.