Live multimodal chat evaluation

A guide to creating a live multimodal chat evaluation project for ranking & classifying model outputs on conversation text data.

When creating a project, select Live multimodal chat


With the live multimodal chat evaluation editor, you can create human tasks and rank data for model comparison or RLHF (reinforcement learning with human feedback). You can also compare model outputs directly from the user prompt using predefined criteria to perform a ranking or selection.



This editor solves two important problems that are critical to ensuring responsible AI aligned with human preferences:

  • Model comparison: Conduct the evaluation and comparison of model configurations for a given use case and decide which one to pick
  • RLHF: Create preference data for training a reward model for RLHF based on multiple outputs from a single model or different model (up to 10)

Taking full advantage of Foundry, unlike any other editor, you do not need to import data since the user will type the prompts, and Labelbox will trigger inferencing on the given prompt and return model outputs.

Supported annotation types

FeatureExport format
Message rankingSee payload
Message selectionSee payload
Classification - Radio (Global or message-based)See payload
Classification - Checklist (Global or message-based)See payload
Classification - Free textSee payload

Set up a Live multimodal chat evaluation project

For this version of the live multimodal chat evaluation, Labelbox offers inferencing capability via Foundry; you could also use your custom model integration.

Step 1: Choose Live multimodal chat

In the Select project type modal, select Live multimodal chat.


Step 2: Create a Live multimodal chat evaluation project

Provide a name and an optional description for your project and configure how you would like your data rows to be sourced from (either create a new dataset or choose an existing one), then select how many data rows you want to be generated.


Step 3: Select models

Select what models you would like to do your evaluation from.

Note: Depending on your foundation model selection, you can attach images, videos, and documents (PDF) to your prompt.

Currently, you can choose those models from Foundry :

ModelAttachments type
Google Gemini 1.5 ProImage, video, and document (PDF)
Google Gemini 1.5 FlashImage, video, and document (PDF)
Google Gemini ProN/A
Llama 3 70B InstructN/A
OpenAI GPT 4N/A
Claude 3 OpusImage and document (PDF)
Claude 3 HaikuImage and document (PDF)
Claude 3.5 SonnetImage and document (PDF)
OpenAI GPT-4oImage and document (PDF)
OpenAI GPT4 VisualImage and document (PDF)

For the input, a user would be expected to create a text prompt associated or not with attachments.

Select a model from Foundry

Select a model from Foundry


Once you have chosen a model, you will be prompted to choose a Model configuration or create a new one.
Model configuration will store your model configuration attributes and can be reused at a later stage.

From this view, you can evaluate your Model configuration by prompting the model directly via the send a message text input

Repeat this process to have a minimum of 1 model, up to 10.
You can use the same model with different parameters if your use case focuses on a specific model evaluation.

Use the ellipsis to Edit, Duplicate, or Remove a model selection

Use the ellipsis to Edit, Duplicate, or Remove a model selection

Step 4: Set up an ontology

Create an ontology for evaluating the model responses on each model output. Below is an example of an ontology for a Live multimodal chat evaluation project.

Step 5: Complete setup

This step is mandatory to lock in your selection of models; once clicked, you won't be able to alter or remove model selection.
Note: The labeling queue will not be generated until Complete setup has been clicked.

Live multimodal chat evaluation editor specifics

Toggle markdown view

You can render your content in markdown or text format. Use this toggle in the editor to switch the view. This option also allows you to format model output accordingly.

Add attachment

To add one or more attachment(s), use the paper clip icon at the prompt level and provide a public URL to the attachments.
Note: Image, video, and document (PDF) are supported directly from the URL.

Once you have entered a valid URL to insert a supported type of attachment, click on Save and repeat this if necessary.

Multi-turn prompt, model output

You can continue to prompt models after the initial input by sending a new prompt.

Reset prompt and model outputs

If you have made a mistake in your prompt or encountered a blocker, you can reset your prompt and the model outputs.

Analytics view for annotations

Given the unique nature of the annotations for this editor, there are additional metrics to provide insights into the project overview.

Ranking

Depending on the number of model outputs you have configured, a horizontal ranking graph will appear to provide a visual analysis of LLM (large language model) win rate.

For each position, how often a model output was chosen. A longer bar means it was chosen more often in that position.

Selection

You can also have a selection with different defined topics to further your use case.

Model variance histograms

This chart shows a model's position in terms of how often it wins, with a variance chart showing how consistent it is in that position against other models.