Model As A Judge Classifier

Scores each sample by having a judge model classify the output into a label, then checking whether that label matches one of the configured correct labels, producing an is_correct score of 1.0 or 0.0. Use this when the pass/fail criterion is too nuanced for exact string matching. For a numerical score instead of binary pass/fail, use the Model Scorer instead.

Output

is_correct: 1.0 if the judge's predicted label is in correct_labels, 0.0 otherwise.

How It Works

The judge responds with a JSON object containing a prediction label, reasoning, and confidence. The output is considered correct if prediction is in correct_labels. Enable use_structured_outputs if your model supports it for more reliable label extraction.

Examples

Example: Trivia QA. The judge classifies the model output as correct or incorrect, which the scorer maps to a 1.0 or 0.0 result.

...
definition:
  ...
  scorers:
    - type: "model_as_a_judge_classifier"
      model_key: "openai$gpt-4-1-nano"
      system_prompt: |
        You are a helpful assistant and rate whether the ground truth answer and the
        candidate answer are semantically the same. Semantically the same means that
        you do not care about small spelling mistakes, capitalization or whether
        additional information is given. This is all still considered correct.
        
        Respond only with 'correct' or 'incorrect' and nothing else.
      user_prompt: |
        Ground Truth Answer: {{ sample.gt_answer }}
        Candidate Answer: {{ solver_output.output }}
      correct_labels:
        - "correct"
      incorrect_labels:
        - "incorrect"
      use_structured_outputs: true

Configuration

Properties

type Literal "model_as_a_judge_classifier" required

The type of the scorer.

model_key Key, TemplateValue required

The model to be used as the judge.

system_prompt string, TemplateValue

The system prompt given to the judge model. The prompt can refer to the following variable dynamically (using {{ }} syntax):

In all scenarios:

sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

model_output: The last model output (ex: {{ model_output }})
messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})

Default: You are a helpful assistant and will be used to judge the output of another model.

user_prompt string, TemplateValue required

The user prompt given to the judge model. The prompt can refer to the following variable dynamically (using {{ }} syntax):

In all scenarios:

sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

model_output: The last model output (ex: {{ model_output }})
messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})

correct_labels array[string, TemplateValue] required

The list of labels predicted by the judge that are considered correct.

incorrect_labels array[string, TemplateValue] required

The list of labels predicted by the judge that are considered incorrect.

use_structured_outputs boolean

Whether to use structured outputs. It is recommended to enable this if the model supports it.

Default: False

purpose ScorerPurpose

The purpose of this scorer.

score: The scorer is used to score the solver output or the dataset sample.
qa: The scorer is used to do QA over the solver output or the dataset sample.

Default: score

key string

Unique identifier assigned to the entity in AI GO!.

Default: None

display_name string

The display name of the scorer.

Default: None

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None