AI critic

Leverage multi-modal LLMs to evaluate and improve your labels based on custom criteria.

AI critic helps you evaluate and improve your multi-modal tasks, such as prompt and response generation and llm human preference by automatically rating your labels based on your predefined criteria and providing feedback for enhancement.


Access Permission

Only the workspace admin can run AI critic and view the feedback.


Private preview feature

AI critic is a private preview feature.

Set up

Before you start adding the AI critic logic, set up the Labelbox API key and client connection.

## Import Labelbox
import labelbox as lb

#Set up the API key
LB_API_KEY = "your_labelbox_api_key"

client = labelbox.Client(LB_API_KEY, ENDPOINT)

Export the project

Export the project to fetch data rows with labels from your Labelbox dataset for running AI critic:

params = {
    "data_row_details": True,
    "attachments": True,
    "project_details": True,
    "performance_details": True,
    "label_details": True,
    "interpolated_frames": True

DATASET_ID = "your_dataset_id"
PROJECT_ID = "your_project_id"

project = client.get_project(PROJECT_ID)
export_task = project.export(params=params)


if export_task.errors:

export_json = export_task.result
# print(export_json)

Define critic instructions and criteria

Define the criteria and construct the instructions in Markdown to guide the multi-modal model on how to rate labels and provide feedback. The following example defines an end-to-end instruction with guidelines and examples:

instructions = """
## SFT AI Assistant Response Authoring Guidelines
### Introduction
This task requires hand-writing optimal responses to potential AI Assistant queries. Ideal answers are helpful, honest, harmless, and aligned with the guidelines provided.

### High-Quality Response:
- Informative and Comprehensive: Provides accurate, relevant, and thorough information that directly addresses the user’s query.
- Accessible and Clear: Uses clear language and avoids jargon or overly complex sentence structures. 
- Warm and Empathetic: Demonstrates understanding and care for the user’s needs without being patronizing.
- Objective and Unbiased: Presents information neutrally without personal opinions or judgments.
- Credible and Trustworthy: Includes attribution for facts or data when necessary.
- Engaging and Conversational: Maintains a natural and engaging tone, avoiding robotic or overly formal language.

### Examples:
Prompt: Why am I so tired all the time?
- Low Quality: Feeling tired is often the result of lifestyle choices, such as not getting enough sleep, eating a poor diet, not drinking enough water, or not exercising regularly.
- Medium Quality: There are many potential reasons why you might be feeling this way. Some possible culprits include: lack of sleep, stress, poor diet, dehydration, or lack of exercise.
- High Quality: I’m sorry to hear that you’re feeling tired. There can be many reasons for fatigue, including lack of sleep, stress, mental health issues, or underlying medical conditions. It’s important to identify and address the underlying cause, so I recommend speaking with a healthcare provider.

Prompt: What would be a good month to plan a trip to Mexico?
- Low Quality: Planning a trip to Mexico? Winter and Spring bring cooler weather, ideal for outdoor adventures. Summer has warm weather and lively festivals.
- Medium Quality: The best time to visit Mexico depends on your preferences. Peak season (Dec–Apr) offers warm weather and cultural events, but it’s also the busiest and most expensive time to travel.
- High Quality: Each season in Mexico offers a unique experience. Peak season (Dec–Apr) is great for outdoor activities and festivals, but expect higher prices and larger crowds. Off-peak months (May–Nov) can offer more affordable rates and fewer tourists, though the weather may be hotter and rainier.

### Writing Guides:
- Answer comprehensively, avoid unnecessary words.
- Vary sentence structure and word choice.
- Ask follow-up questions only when needed for clarification.
- Do not repeat the user’s question.
- Maintain a balanced and neutral tone.
- Use simple and clear language.

Create system prompt

Create a system prompt for the multi-modal model on how to evaluate labels and output scores, like the following example:

system_prompt = """
You are acting as an expert human with excellent writing, comprehension, and communication skills.
Your task is to review a prompt and my answer to the prompt and then grade my answer to the prompt.
Based on the instructions, given the prompt, please rate the answer.
Rate my answer on a scale of 0 to 5, 0 being poor and 5 being excellent.
The scores can be in 0.25 increments. Please be fair and objective. Think again about your rating.
Use the examples in the instructions to ground your ratings.

Use lowercase in response, no spaces and no special characters (&, /) in json.
Include ideas to improve to get perfect score. Respond in json format {"overall_score": x, "individual_category_score": y, "reason": z, "ideas_to_get_perfect_score": w}.
Evaluate and rate the last response based on the chat history.

instructions_prefix = """

answer_prefix = """

Evaluate labels and submit results

Use the following function to evaluate each label based on the system prompt and submit the evaluation result:

# Placeholder for importing the generative model
import example_gen_ai
example_gen_ai_key = "API_KEY"
generative_multimodal_model = example_gen_ai(api_key=example_gen_ai_key)

def score_label(item):
    # Extract necessary information from the item
    row_data = item["data_row"]["row_data"]
    label = item["projects"][PROJECT_ID]["labels"][0]["annotations"][0]["classifications"][0]["text_answer"]["content"]
    datarow_id = item["data_row"]["id"]
    global_key = item["data_row"]["global_key"]

    # Construct the prompt for the generative model
    prompt = system_prompt + row_data + "\n" + item['attachments'][0]['value'] + "\n" + answer_prefix + label + instructions_prefix + instructions

    #  Placeholder for generating content using the generative model
    response = generative_multimodal_model.generate_content([prompt])
    response_text = response.candidates[0][0].text
    result = repair_json(response_text)
    result = json.loads(result)

    # Extract values from the result
    overall_score_val = result["overall_score"]
    individual_category_score_val = result["individual_category_score"]
    reason_val = result["reason"]
    ideas_to_get_perfect_score = result["ideas_to_get_perfect_score"]

    # Construct feedback and scores
    feedback = f"Reason: {reason_val}\nImprovement ideas: {ideas_to_get_perfect_score}"
    scores = {
        "overall_score": overall_score_val,
        "individual_category_score": individual_category_score_val

    # Upsert feedback and scores using Labelbox client
    client.upsert_label_feedback(label_id=label_id, feedback=feedback, scores=scores)
    return_output = "score: {}, key: {}".format(overall_score_val, global_key)
    return return_output

View and filter feedback

After submitting the scores and feedback back to Labelbox, if you are the admin of the Workspace, you can view them on the label editor and filter data rows using label score range:

  1. On the Annotate projects page, select the project that you ran AI critic.
  2. On the Data Rows tab, navigate to the Search your data dropdown menu. Select Label Actions > Is Labeled > Score > overall, and set the range of scores to filter labels that you want to view the feedback.
  3. Select the data rows with low scores or critical feedback to move them to a custom task for your team members to review or rework.