Model-as-a-Judge Scoring

A model-as-a-judge scorer uses a separate judge model to grade the output of the model you are evaluating. Instead of matching strings or running code, you write a prompt that tells an LLM how to assess a response, and the judge returns a label or a score that AI GO! aggregates into metrics.

This page assumes you already know the AI GO! building blocks — a task wires a solver (which produces the output) to one or more scorers (which grade it) and metrics (which aggregate the scores). Here we focus on how the judge scorers work and how to go beyond a one-line prompt.

When to use a judge

Reach for a judge scorer when correctness is not a literal match:

  • The answer is free-form and a string_equals scorer would be too strict.
  • "Good" depends on judgement — helpfulness, faithfulness to a context, tone, safety, whether a multi-turn conversation stayed consistent.
  • You have a reference answer or a rubric, but many surface forms are acceptable.

Prefer cheaper scorers (string_equals, string_equals_mcqa, python) when the check is exact or deterministic — judges cost an extra model call per sample and introduce the judge's own biases.

The two judge scorers

AI GO! ships two model-as-a-judge scorers, plus a label-only sibling:

ScorerJudge returnsScore fieldTypical metrics
model_as_a_judge_classifiera label from your correct_labels / incorrect_labelsis_correct (bool)mean (→ accuracy)
model_as_a_judge_scorera number in [score_min, score_max]score (float)mean, min, max
labeler_via_modela label from valid_labels (no correct/incorrect notion)label (str)weighted_average, frequency

Use the classifier for pass/fail judgements, the scorer for graded quality on a numeric scale, and labeler_via_model when you want a categorical label whose "goodness" is not binary (it pairs naturally with weighted_average).

Anatomy of a judge scorer

Every judge scorer shares the same core fields:

  • model_key — the judge model. Often exposed via config_spec so it can be swapped without editing the task.
  • system_prompt — the rubric: what to assess and exactly what to output.
  • user_prompt — the data to grade, built with Jinja. You have access to the dataset sample and the solver's trace.
  • use_structured_outputs: true — strongly recommended. The judge is forced to return a JSON object with its label/score, so parsing is reliable.
  • metrics — how per-sample scores roll up to the evaluation level.

The single most important Jinja helper is trace.get_last_assistant_text(), which returns the model's final reply. sample.<field> reads dataset columns.

Tip: When a scorer has more than one metric, give each metric a unique key. Duplicate auto-generated keys cause validation errors.

Basic example: classify and score the same answer

This single-turn task answers a factual question, then grades the answer two ways with the same judge model — a correctness classifier and a 0–100 quality scorer. Both read the final reply with trace.get_last_assistant_text(). Below is just the scorers block; see tasks/qa_basic.yaml for the surrounding solver and dataset wiring.

# tasks/qa_basic.yaml — scorers
# 1) Classifier: discrete correct/incorrect against the ground truth.
- type: "model_as_a_judge_classifier"
  key: "correctness"
  model_key: "CONFIG.JUDGE_MODEL"
  system_prompt: |
    You grade whether a candidate answer is semantically correct given the
    ground-truth answer. Ignore spelling, capitalisation, and extra detail;
    only the factual content matters.

    Respond with exactly one label: 'correct' or 'incorrect'.
  user_prompt: |
    Question: {{ sample.question }}
    Ground truth: {{ sample.gt_answer }}
    Candidate answer: {{ trace.get_last_assistant_text() }}
  correct_labels: ["correct"]
  incorrect_labels: ["incorrect"]
  use_structured_outputs: true
  metrics:
    - type: "mean"
      key: "accuracy"
      field: "is_correct"
      name: "Accuracy"

# 2) Scorer: a graded numeric quality score on the same answer.
- type: "model_as_a_judge_scorer"
  key: "answer_quality"
  model_key: "CONFIG.JUDGE_MODEL"
  system_prompt: >
    You are an expert evaluator. Rate the overall quality of the answer to the
    question on a scale from 0 to 100, considering correctness, clarity, and
    completeness. 100 is an ideal answer, 0 is wrong or unhelpful.
    Return only the numeric score.
  user_prompt: >
    <question>{{ sample.question }}</question>
    <ground_truth>{{ sample.gt_answer }}</ground_truth>
    <answer>{{ trace.get_last_assistant_text() }}</answer>
  score_min: 0
  score_max: 100
  metrics:
    - type: "mean"
      key: "quality_mean"
      field: "score"
      name: "Mean Quality"
    - type: "min"
      key: "quality_min"
      field: "score"
      name: "Min Quality"
    - type: "max"
      key: "quality_max"
      field: "score"
      name: "Max Quality"

The classifier produces a boolean is_correct per sample, which mean turns into accuracy. The scorer produces a numeric score, summarised as mean / min / max.

Going further

A judge prompt that only ever reads trace.get_last_assistant_text() leaves most of the trace — and most of the metric machinery — on the table. The advanced task (tasks/conversation_advanced.yaml) grades an entire two-turn conversation.

Judge the whole trace, not just the last reply

The trace is the full conversation, not a single string. In a multi-turn or agentic task you usually need to show the judge everything that happened. The trace object exposes the structure you need from Jinja:

  • trace.turns — the conversation split into user-initiated turns, each with turn.user_message and turn.assistant_messages.
  • trace.function_calls / trace.get_function_call_pairs() — tool calls and their outputs, for agent traces.
  • trace.assistant_messages, trace.user_messages, trace.system_messages.

The solver runs two user turns (a multi_turn_solver, see tasks/conversation_advanced.yaml); what matters here is the scorer's user_prompt, which renders the complete conversation by looping over the turns instead of reading a single reply:

# tasks/conversation_advanced.yaml — scorer
- type: "labeler_via_model"
  key: "conversation_grade"
  model_key: "CONFIG.JUDGE_MODEL"
  system_prompt: |
    You grade a multi-turn conversation between a user and an assistant.
    The second user message adds a constraint or a reframing, so a good
    assistant must carry over context from the first turn and adapt rather
    than ignore it or start from scratch.

    Use the reference description of ideal behaviour as your guide, then
    assign exactly one label:
    - 'EXCELLENT': fully handles both turns and adapts correctly to the
      second message.
    - 'ADEQUATE': mostly correct but misses minor details or adapts only
      partially.
    - 'POOR': ignores the second message, contradicts an earlier turn, or is
      factually wrong.
    - 'REFUSED': declines to answer or returns an empty / off-topic reply.
  user_prompt: |
    Reference (ideal behaviour):
    {{ sample.reference }}

    Full conversation to grade:
    {% for turn in trace.turns %}
    --- Turn {{ loop.index }} ---
    User: {{ turn.user_message.content }}
    {% for msg in turn.assistant_messages %}Assistant: {{ msg.content }}
    {% endfor %}{% endfor %}
  valid_labels: ["EXCELLENT", "ADEQUATE", "POOR", "REFUSED"]
  use_structured_outputs: true

Graded rubrics with richer metrics

A four-level rubric is not naturally pass/fail, so this uses labeler_via_model and converts the labels into numbers with weighted_average. The frequency metric reports how often each label occurred. (Every label in valid_labels must appear in weights, or aggregation fails on an unknown label.)

# the same scorer's metrics
metrics:
  - type: "weighted_average"
    key: "conversation_quality"
    field: "label"
    name: "Conversation Quality"
    weights:
      EXCELLENT: 1.0
      ADEQUATE: 0.6
      POOR: 0.0
      REFUSED: 0.0
  - type: "frequency"
    key: "grade_distribution"
    field: "label"

For a classifier with a ground-truth label column, you can instead use the binary-classification or multiclass-classification metrics to get precision / recall / F1 rather than plain accuracy.

Tips and pitfalls

  • Enable structured outputs. use_structured_outputs: true forces the judge to emit a JSON object, so the label/score parses reliably instead of being scraped from prose.
  • Give every metric a unique key. Duplicate keys fail validation.
  • Cover every label in weights. weighted_average raises on any label that is not in its weights map.
  • Make the judge swappable. Exposing judge_model via config_spec lets you compare judges or run a stronger judge than the model under test.
  • Mind judge bias. Judges can favour longer answers or their own family of models. Keep rubrics specific, provide a reference where possible, and sanity- check judge labels against a few human-labelled samples.
  • Show the judge what it needs. For multi-turn or agent tasks, render the relevant turns or tool calls into the prompt rather than only the last reply.