Live multimodal chat evaluation
Learn how to create live multimodal chat evaluation projects for ranking and classifying model outputs through live, multi-turn conversations.
The live multimodal chat evaluation editor allows you to evaluate generative models by comparing their responses in live, multi-turn conversations. You can select up to 10 models, including popular foundation models and custom models, input prompts to trigger model outputs, and add labels to rate, rank, and refine their performance. The editor supports various data types, including text, images, videos, audio, and PDFs.
Set up live multimodal chat evaluation projects
The following steps walk you through how to set up a live multimodal chat evaluation project on the Labelbox platform. To learn how to set up a live multimodal chat evaluation project using the SDK, see Multimodal chat evaluation projects.
Step 1: create a project
- On the Annotate projects page, click the + New project button.
- Select Multimodal chat, and then select Live multimodal chat.
- Provide a name and an optional description for your project.
- Configure the data source by selecting from:
- Create a new dataset or Append data to existing dataset, and then specify the number of data rows you want to generate.
- None of the above to skip generating data rows during project creation and generate them later.
Step 2: Add data
If you didn't generate data rows during project creation or want to add more data rows, select Add data from Catalog or Generate new data.
If you choose to generate new data, select Create a new dataset or Append data to existing dataset, and then specify the name of the data set and the number of data rows you want to generate.
Step 3: Select models
Click the + Add model button to select models for your output evaluation project. Depending on your model selection, you can attach images, videos, and documents (PDF) to your prompt.
Currently, you can choose those foundation models integrated by Foundry :
Model | Attachments type |
---|---|
Google Gemini 1.5 Pro | Image, video, and document (PDF) |
Google Gemini 1.5 Flash | Image, video, and document (PDF) |
Google Gemini Pro | N/A |
Llama 3 70B Instruct | N/A |
OpenAI GPT 4 | N/A |
Claude 3 Opus | Image and document (PDF) |
Claude 3 Haiku | Image and document (PDF) |
Claude 3.5 Sonnet | Image and document (PDF) |
OpenAI GPT-4o | Image and document (PDF) |
OpenAI GPT4 Visual | Image and document (PDF) |
Step 4: Configure models
Once you have chosen a model, you will be prompted to choose a Model configuration or create a new one. You can either add a unique name for the configuration or use the auto-generated one. Model configuration stores your model configuration attributes and can be reused at a later stage, which allows you to evaluate your configurations by prompting the model directly using the send a message text input. Each model configuration name must be unique.
You can render your content in markdown or text format. Use this toggle in the editor to switch the view. This option also allows you to format model output accordingly.
Markdown editor size limit
When using the Markdown editor, limit the character count to fewer than 6,000 characters.
LaTeX support
To add LaTeX formatting, wrap your math expressions using backticks and dollar signs. The editor supports both inline and block LaTeX formatting. For example, to add LaTeX formatting for
x=2
, put ```$$x = 2$$```.
Repeat the model configuration process to have a 1-10 model. You can use the same model with different parameters if your use case focuses on a specific model evaluation. See limits for your account limits.
Add attachment
Depending on the model capability, you can attach image, video, text and PDF files to your prompts. To add one or more attachments, click the paper clip icon at the prompt level and select from Add from a public link and Upload from computer. Once you have entered a valid URL to insert a supported type of attachment or upload a valid local file, click Save and repeat to add more attachments if necessary.
Step 5: Set up an ontology
Create an ontology for evaluating model response, like the following example:
The editor supports the following options:
Feature | Description | Export format |
---|---|---|
Message ranking | Rank multiple model-generated responses to determine their relative quality or relevance. | Payload |
Message selection | Select single or multiple responses that meet specific criteria. | Payload |
Message step reasoning | Break responses into steps and evaluate the accuracy of each step by selecting from correct, neutral, and incorrect. Add your rewrite with justification for incorrect steps. (Optional, live editor only) Regenerate the rest of the conversation after each incorrect step. | Payload |
Classification - Radio | Select one option from a predefined set. | Payload |
Classification - Checklist | Choose multiple options from a list. | Payload |
Classification - Free text | Add free text annotations. | Payload |
Classification tasks can apply globally to the entire conversation or individually to a message. They can also nest subclassification tasks.
Message step reasoning best practices
Message step reasoning is an experimental feature. For projects using ontologies with the step reasoning task, ensure that prompts lead to responses that can be easily broken down into clear steps.
Step 6: Complete model setup
You need to click the Complete setup button to lock in your selection of models and generate the labeling queue. Once clicked, you won't be able to alter or remove model selection.
Step 7: Complete annotation tasks
Click the Start labeling button to chat with your configured models and add annotations to evaluate the responses. You can continue to prompt models after the initial input by sending a new prompt. If you have made a mistake in your prompt or encountered a blocker, you can reset your prompt and the model outputs.
For each prompt, you can generate additional responses or write your own using the Markdown editor. Each time you submit a written response, the AI critic is automatically enabled to help check for grammar and code errors.
Complete all tasks in your workflow.
Updated 2 months ago