Learn how to use the multimodal chat evaluation editor to evaluate generative models across multiple data types, compare model outputs, and refine performance with supported annotation tasks.
Feature | Description | Export format |
---|---|---|
Message ranking | Rank multiple model-generated responses to determine their relative quality or relevance. | Payload |
Message selection | Select single or multiple responses that meet specific criteria. | Payload |
Message step reasoning | (Text conversations only, no multimodal support) Evaluate the accuracy of each step broken down from responses and label it as correct, neutral, or incorrect. Provide a justification for incorrect steps and regenerate the conversation from that step. | Payload |
Prompt rating | Flag prompt with issues and select an issue category. | Payload |
Fact-checking | Verify the factual information in the response using these rating options: Accurate, Inaccurate, Disputed, Unsupported, Can’t confidently assess, or No factual information. For Accurate, Inaccurate, and Disputed, provide a justification explaining your choice. | Payload |
Classification - Radio | Select one option from a predefined set. | Payload |
Classification - Checklist | Choose multiple options from a list. | Payload |
Classification - Free text | Add free text annotations. | Payload |