Live multimodal chat evaluation
A guide to creating a live multimodal chat evaluation project for ranking & classifying model outputs on conversation text data.
When creating a project, select Live multimodal chat
![](https://files.readme.io/040bb13-Labelbox_2024-06-04_10-36-35.jpg)
With the live multimodal chat evaluation editor, you can create human tasks and rank data for model comparison or RLHF (reinforcement learning with human feedback). You can also compare model outputs directly from the user prompt using predefined criteria to perform a ranking or selection.
This editor solves two important problems that are critical to ensuring responsible AI aligned with human preferences:
- Model comparison: Conduct the evaluation and comparison of model configurations for a given use case and decide which one to pick
- RLHF: Create preference data for training a reward model for RLHF based on multiple outputs from a single model or different model (up to 10)
Taking full advantage of Foundry, unlike any other editor, you do not need to import data since the user will type the prompts, and Labelbox will trigger inferencing on the given prompt and return model outputs.
Supported annotation types
Feature | Export format |
---|---|
Message ranking | See payload |
Message selection | See payload |
Classification - Radio (Global or message-based) | See payload |
Classification - Checklist (Global or message-based) | See payload |
Classification - Free text | See payload |
Set up a Live multimodal chat evaluation project
For this version of the live multimodal chat evaluation, Labelbox offers inferencing capability via Foundry; you could also use your custom model integration.
Step 1: Choose Live multimodal chat
In the Select project type modal, select Live multimodal chat.
Step 2: Create a Live multimodal chat evaluation project
Provide a name and an optional description for your project and configure how you would like your data rows to be sourced from (either create a new dataset or choose an existing one), then select how many data rows you want to be generated.
![](https://files.readme.io/19b72d1-image.png)
Step 3: Select models
Select what models you would like to do your evaluation from.
Note: Depending on your foundation model selection, you can attach images, videos, and documents (PDF) to your prompt.
Currently, you can choose those models from Foundry :
Model | Attachments type |
---|---|
Google Gemini 1.5 Pro | Image, video, and document (PDF) |
Google Gemini 1.5 Flash | Image, video, and document (PDF) |
Google Gemini Pro | N/A |
Llama 3 70B Instruct | N/A |
OpenAI GPT 4 | N/A |
Claude 3 Opus | N/A |
Claude 3 Haiku | image, and document (PDF) |
Claude 3 Sonnet | image, and document (PDF) |
OpenAI GPT-4o | image, and document (PDF) |
OpenAI GPT4 Visual | image, and document (PDF) |
For the input, a user would be expected to create a text prompt associated or not with attachments.
![Select a model from Foundry](https://files.readme.io/1285c18-image.png)
Select a model from Foundry
Once you have chosen a model, you will be prompted to choose a Model configuration or create a new one.
Model configuration will store your model configuration attributes and can be reused at a later stage.
From this view, you can evaluate your Model configuration by prompting the model directly via the send a message text input
Repeat this process to have a minimum of 1 model, up to 10.
You can use the same model with different parameters if your use case focuses on a specific model evaluation.
![Use the ellipsis to Edit, Duplicate, or Remove a model selection](https://files.readme.io/8e04c12-image.png)
Use the ellipsis to Edit, Duplicate, or Remove a model selection
Step 4: Set up an ontology
Create an ontology for evaluating the model responses on each model output. Below is an example of an ontology for a Live multimodal chat evaluation project.
![](https://files.readme.io/75c49ca-image.png)
Step 5: Complete setup
This step is mandatory to lock in your selection of models; once clicked, you won't be able to alter or remove model selection.
Note: The labeling queue will not be generated until Complete setup has been clicked.
![](https://files.readme.io/12fb708-Labelbox_2024-05-24_13-25-001.jpg)
Live multimodal chat evaluation editor specifics
Toggle markdown view
You can render your content in markdown or text format. Use this toggle in the editor to switch the view. This option also allows you to format model output accordingly.
![](https://files.readme.io/eee3480-Monosnap_2024-05-24_13-30-21.jpg)
Add attachment
To add one or more attachment(s), use the paper clip icon at the prompt level and provide a public URL to the attachments.
Note: Image, video, and document (PDF) are supported directly from the URL.
![](https://files.readme.io/39da2e4-Labelbox_Editor_Model_evaluation_101_2024-05-29_17-58-331.jpg)
Once you have entered a valid URL to insert a supported type of attachment, click on Save and repeat this if necessary.
Multi-turn prompt, model output
You can continue to prompt models after the initial input by sending a new prompt.
Reset prompt and model outputs
If you have made a mistake in your prompt or encountered a blocker, you can reset your prompt and the model outputs.
Analytics view for annotations
Given the unique nature of the annotations for this editor, there are additional metrics to provide insights into the project overview.
Ranking
Depending on the number of model outputs you have configured, a horizontal ranking graph will appear to provide a visual analysis of LLM (large language model) win rate.
For each position, how often a model output was chosen. A longer bar means it was chosen more often in that position.
Selection
You can also have a selection with different defined topics to further your use case.
Model variance histograms
This chart shows a model's position in terms of how often it wins, with a variance chart showing how consistent it is in that position against other models.
Updated 6 days ago