Model As A Judge Classifier
Scores each sample by having a judge model classify the output into a label, then checking whether that label matches one of the configured correct labels, producing an is_correct score of 1.0 or 0.0. Use this when the pass/fail criterion is too nuanced for exact string matching. For a numerical score instead of binary pass/fail, use the Model Scorer instead.
Output
is_correct:1.0if the judge's predicted label is incorrect_labels,0.0otherwise.
How It Works
The judge responds with a JSON object containing a prediction label,
reasoning, and confidence. The output is considered correct if
prediction is in correct_labels. Enable use_structured_outputs if
your model supports it for more reliable label extraction.
Examples
Example: Trivia QA. The judge classifies the model output as correct or incorrect, which the scorer maps to a 1.0 or 0.0 result.
...
definition:
...
scorers:
- type: "model_as_a_judge_classifier"
model_key: "openai$gpt-4-1-nano"
system_prompt: |
You are a helpful assistant and rate whether the ground truth answer and the
candidate answer are semantically the same. Semantically the same means that
you do not care about small spelling mistakes, capitalization or whether
additional information is given. This is all still considered correct.
Respond only with 'correct' or 'incorrect' and nothing else.
user_prompt: |
Ground Truth Answer: {{ sample.gt_answer }}
Candidate Answer: {{ solver_output.output }}
correct_labels:
- "correct"
incorrect_labels:
- "incorrect"
use_structured_outputs: trueConfiguration
Properties
type Literal "model_as_a_judge_classifier" required
The type of the scorer.
model_key Key, TemplateValue required
The model to be used as the judge.
system_prompt string, TemplateValue
The system prompt given to the judge model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
You are a helpful assistant and will be used to judge the output of another model.
user_prompt string, TemplateValue required
The user prompt given to the judge model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
correct_labels array[string, TemplateValue] required
The list of labels predicted by the judge that are considered correct.
incorrect_labels array[string, TemplateValue] required
The list of labels predicted by the judge that are considered incorrect.
use_structured_outputs boolean
Whether to use structured outputs. It is recommended to enable this if the model supports it.
Default:False
purpose ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
display_name string
The display name of the scorer.
Default:None
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
