Multimodal chat evaluation

A guide to creating a live multimodal chat evaluation project for ranking & classifying model outputs on conversation text data.

The multimodal chat evaluation editor allows you to evaluate generative model responses across multiple data types, including text, images, videos, audio, and PDFs. You can select up to 10 different models, including common foundation models and your custom models, to rate and rank their responses for:

  • Model comparison: Evaluate and compare model configurations for a given use case and decide which one to pick. You can also compare model outputs directly from prompts using predefined criteria to perform a ranking or selection.
  • RLHF(reinforcement learning with human feedback): Create preference data for training a reward model for RLHF based on multiple outputs from a single model or different model.

The editor has the following modes:

  • Live multimodal chat supports live, multi-turn conversations with models for evaluation. Unlike other editors, your team can type prompts to trigger inferencing on the given prompt and return model outputs without importing data into a dataset.
  • Offline multimodal chat allows you to import existing conversations for annotating model responses.

Supported annotation types

FeatureExport format
Message rankingSee payload
Message selectionSee payload
Classification - Radio (Global or message-based)See payload
Classification - Checklist (Global or message-based)See payload
Classification - Free textSee payload

Set up multimodal chat evaluation projects

The following steps walk you through how to initialize a multimodal chat evaluation project on the Labelbox UI:

  1. On the Annotate projects page, click the + New project button.
  2. Select Multimodal chat, and then select from Live multimodal chat and Offline multimodal chat depending on your task type.
  3. Provide a name and an optional description for your project.
  • For a live multimodal chat project, configure the source of your data by creating a new dataset or choosing an existing one, and then select how many data rows you want to be generated.

From there, you can continue the setup following the steps for live projects and offline projects.

Set up live multimodal chat projects

After initializing a multimodal chat project, use the following steps to set up a live multimodal chat project.

This video shows a specific example setting up a live multimodal chat project that selects, sends prompts, and evaluates responses from common foundation models:

Step 1: Select models

Click the + Add model button to select models for your output evaluation project. Depending on your model selection, you can attach images, videos, and documents (PDF) to your prompt.

Currently, you can choose those foundation models integrated by Foundry :

ModelAttachments type
Google Gemini 1.5 ProImage, video, and document (PDF)
Google Gemini 1.5 FlashImage, video, and document (PDF)
Google Gemini ProN/A
Llama 3 70B InstructN/A
OpenAI GPT 4N/A
Claude 3 OpusImage and document (PDF)
Claude 3 HaikuImage and document (PDF)
Claude 3.5 SonnetImage and document (PDF)
OpenAI GPT-4oImage and document (PDF)
OpenAI GPT4 VisualImage and document (PDF)

Step 2: Configure models

Once you have chosen a model, you will be prompted to choose a Model configuration or create a new one. You can either add a unique name for the configuration or use the auto-generated one. Model configuration stores your model configuration attributes and can be reused at a later stage, which allows you to evaluate your configurations by prompting the model directly using the send a message text input. Each model configuration name must be unique.

You can render your content in markdown or text format. Use this toggle in the editor to switch the view. This option also allows you to format model output accordingly.

📘

LaTeX support

To add LaTeX formatting, wrap your math expressions using backticks and dollar signs. The editor supports both inline and block LaTeX formatting. For example, to add LaTeX formatting for x=2, put ```$$x = 2$$```.

Repeat the model configuration process to have a 1-10 model. You can use the same model with different parameters if your use case focuses on a specific model evaluation. See limits for your account limits.

Use the ellipsis to Edit, Duplicate, or Remove a model selection

Use the ellipsis to Edit, Duplicate, or Remove a model selection

Add attachment

Depending on the model capability, you can attach image, video, text and PDF files to your prompts. To add one or more attachment, click the paper clip icon at the prompt level and select from Add from a public link and Upload from computer. Once you have entered a valid URL to insert a supported type of attachment or upload a valid local file, click Save and repeat to add more attachments if necessary.

Step 3: Set up an ontology

Create an ontology for evaluating model response, like the following example:

Step 4: Complete model setup

You need to click the Complete setup button to lock in your selection of models and generate the labeling queue. Once clicked, you won't be able to alter or remove model selection.

Step 5: Complete annotation tasks

Click the Start labeling button to chat with your configured models and add annotations to evaluate the responses. The editor supports the following annotation options:

  • Message ranking
  • Message selection
  • Radio classification (global or message-based)
  • Checklist classification (global or message-based)
  • Free text classification

You can continue to prompt models after the initial input by sending a new prompt. If you have made a mistake in your prompt or encountered a blocker, you can reset your prompt and the model outputs.

Complete all tasks in your workflow.

Set up offline multimodal chat evaluation projects

After initializing a multimodal chat project, use the following steps to set up an offline multimodal chat project:

  1. Click the Add data button to select a conversation v2 JSON dataset or create a new dataset.
  1. Set up an ontology for evaluating model response, like the following example:
    Example multimodal chat ontology
  2. Click the Start labeling button to add annotations to evaluate the responses. The editor supports the following annotation options:
  • Message ranking
  • Message selection
  • Radio classification (global or message-based)
  • Checklist classification (global or message-based)
  • Free text classification
  1. Complete all tasks in your workflow.

Analytics view for annotations

Given the unique nature of the annotations for this editor, there are additional metrics to provide insights into the project overview.

Ranking

Depending on the number of model outputs you have configured, a horizontal ranking graph will appear to provide a visual analysis of LLM (large language model) win rate.

For each position, how often a model output was chosen. A longer bar means it was chosen more often in that position.

Selection

You can also have a selection with different defined topics to further your use case.

Model variance histograms

This chart shows a model's position in terms of how often it wins, with a variance chart showing how consistent it is in that position against other models.