Live multimodal chat evaluation

Learn how to create live multimodal chat evaluation projects for ranking and classifying model outputs through live, multi-turn conversations.

The live multimodal chat evaluation editor allows you to evaluate generative models by comparing their responses in live, multi-turn conversations. You can select up to 10 models, including popular foundation models and custom models, input prompts to trigger model outputs, and add labels to rate, rank, and refine their performance. The editor supports various data types, including text, images, videos, audio, and PDFs.

Set up live multimodal chat evaluation projects

The following steps walk you through how to set up a live multimodal chat evaluation project on the Labelbox platform. To learn how to set up a live multimodal chat evaluation project using the SDK, see Multimodal chat evaluation projects.

Step 1: create a project

  1. On the Annotate projects page, click the + New project button.
  2. Select Multimodal chat, and then select Live multimodal chat.
  3. Provide a name and an optional description for your project.
  4. Configure the data source by selecting from:
    • Create a new dataset or Append data to existing dataset, and then specify the number of data rows you want to generate.
    • None of the above to skip generating data rows during project creation and generate them later.

Step 2: Add data

If you didn't generate data rows during project creation or want to add more data rows, select Add data from Catalog or Generate new data.

add-data-buttons

If you choose to generate new data, select Create a new dataset or Append data to existing dataset, and then specify the name of the data set and the number of data rows you want to generate.

Step 3: Select models

Click the + Add model button to select models for your output evaluation project. Depending on your model selection, you can attach images, videos, and documents (PDF) to your prompt.

Currently, you can choose those foundation models integrated by Foundry :

ModelAttachments type
Google Gemini 1.5 ProImage, video, and document (PDF)
Google Gemini 1.5 FlashImage, video, and document (PDF)
Google Gemini ProN/A
Llama 3 70B InstructN/A
OpenAI GPT 4N/A
Claude 3 OpusImage and document (PDF)
Claude 3 HaikuImage and document (PDF)
Claude 3.5 SonnetImage and document (PDF)
OpenAI GPT-4oImage and document (PDF)
OpenAI GPT4 VisualImage and document (PDF)

Step 4: Configure models

Once you have chosen a model, you will be prompted to choose a Model configuration or create a new one. You can either add a unique name for the configuration or use the auto-generated one. Model configuration stores your model configuration attributes and can be reused at a later stage, which allows you to evaluate your configurations by prompting the model directly using the send a message text input. Each model configuration name must be unique.

You can render your content in markdown or text format. Use this toggle in the editor to switch the view. This option also allows you to format model output accordingly.

🚧

Markdown editor size limit

When using the Markdown editor, limit the character count to fewer than 6,000 characters.

📘

LaTeX support

To add LaTeX formatting, wrap your math expressions using backticks and dollar signs. The editor supports both inline and block LaTeX formatting. For example, to add LaTeX formatting for x=2, put ```$$x = 2$$```.

Repeat the model configuration process to have a 1-10 model. You can use the same model with different parameters if your use case focuses on a specific model evaluation. See limits for your account limits.

Use the ellipsis to Edit, Duplicate, or Remove a model selection

Use the ellipsis to Edit, Duplicate, or Remove a model selection

Add attachment

Depending on the model capability, you can attach image, video, text and PDF files to your prompts. To add one or more attachments, click the paper clip icon at the prompt level and select from Add from a public link and Upload from computer. Once you have entered a valid URL to insert a supported type of attachment or upload a valid local file, click Save and repeat to add more attachments if necessary.

Step 5: Set up an ontology

Create an ontology for evaluating model response, like the following example:

The editor supports the following options:

FeatureDescriptionExport format
Message rankingRank multiple model-generated responses to determine their relative quality or relevance.Payload
Message selectionSelect single or multiple responses that meet specific criteria.Payload
Message step reasoningBreak responses into steps and evaluate the accuracy of each step by selecting from correct, neutral, and incorrect. Add your rewrite with justification for incorrect steps. (Optional, live editor only) Regenerate the rest of the conversation after each incorrect step.Payload
Classification - RadioSelect one option from a predefined set.Payload
Classification - ChecklistChoose multiple options from a list.Payload
Classification - Free textAdd free text annotations.Payload

Classification tasks can apply globally to the entire conversation or individually to a message. They can also nest subclassification tasks.

📘

Message step reasoning best practices

Message step reasoning is an experimental feature. For projects using ontologies with the step reasoning task, ensure that prompts lead to responses that can be easily broken down into clear steps.

Step 6: Complete model setup

You need to click the Complete setup button to lock in your selection of models and generate the labeling queue. Once clicked, you won't be able to alter or remove model selection.

Step 7: Complete annotation tasks

Click the Start labeling button to chat with your configured models and add annotations to evaluate the responses. You can continue to prompt models after the initial input by sending a new prompt. If you have made a mistake in your prompt or encountered a blocker, you can reset your prompt and the model outputs.

For each prompt, you can generate additional responses or write your own using the Markdown editor. Each time you submit a written response, the AI critic is automatically enabled to help check for grammar and code errors.

Complete all tasks in your workflow.