Learn how to create live multimodal chat evaluation projects for ranking and classifying model outputs through live, multi-turn conversations.
Model | Attachments type |
---|---|
AWS Nova Lite | Image, video, and document (PDF) |
AWS Nova Micro | Image and document (PDF) |
AWS Nova Pro | Image, video, and document (PDF) |
AWS Nova Sonic Realtime | Audio |
Claude 3.5 Haiku | Image and document (PDF) |
Claude 3.5 Sonnet | Image and document (PDF) |
Claude 3.7 Sonnet | Image, video, and document (PDF) |
Claude 3.7 Sonnet Think | Image, video, and document (PDF) |
Claude 3 Haiku | Image and document (PDF) |
Claude 3 Opus | Image and document (PDF) |
DeepSeek R1 | N/A |
Google Gemini 1.5 Flash | Image, video, and document (PDF) |
Google Gemini 1.5 Pro | Image, video, and document (PDF) |
Google Gemini 2.0 Flash Experimental | Image, video, and document (PDF) |
Google Gemini 2.0 Flash Thinking Mode | Image and document (PDF) |
Google Gemini 2.5 Pro | Image and document (PDF) |
Google Gemini Flash Experimental | Image, video, and document (PDF) |
Google Gemini Pro | N/A |
Google Gemini Pro Experimental | Image, video, and document (PDF) |
Grok | N/A |
Grok 3 | N/A |
Llama 3.1 405b | N/A |
Llama 3.2 | N/A |
Llama 4 Maverick Instruct | N/A |
OpenAI GPT 4 | N/A |
OpenAI GPT 4.1 | Image and document (PDF) |
OpenAI GPT-4o | Image and document (PDF) |
OpenAI GPT-4o mini Transcribe | Audio |
OpenAI GPT-4o Transcribe | Audio |
OpenAI GPT-o1 | Image and document (PDF) |
OpenAI GPT-o1-mini | Image and document (PDF) |
OpenAI GPT-o1-preview | Image and document (PDF) |
OpenAI o3 | Image and document (PDF) |
OpenAI o4-mini | Image and document (PDF) |
Whisper | Audio |
x=2
, put $$x = 2$$
.Use the ellipsis to Edit, Duplicate, or Remove a model selection
Feature | Description | Export format |
---|---|---|
Message ranking | Rank multiple model-generated responses to determine their relative quality or relevance. | Payload |
Message selection | Select single or multiple responses that meet specific criteria. | Payload |
Message step reasoning | (Text conversations only, no multimodal support) Evaluate the accuracy of each step broken down from responses and label it as correct, neutral, or incorrect. Provide a justification for incorrect steps and regenerate the conversation from that step. | Payload |
Classification - Radio | Select one option from a predefined set. | Payload |
Classification - Checklist | Choose multiple options from a list. | Payload |
Classification - Free text | Add free text annotations. | Payload |
$$x^2=4$$
. When rendering LaTeX inside a code block, also use double dollar signs. For example: ``````. If you use a data row-level system prompt and expect Markdown or LaTeX rendering, include the correct LaTeX delimiters in your prompt to match your project settings.