Model Scorer
Scores each sample by having a judge model produce a numeric score, normalized to the [0, 1] range. Use this when the evaluation criterion is subjective or nuanced - such as coherence, groundedness, or helpfulness - and you want a graded score rather than binary pass/fail. For binary pass/fail, use the Model As A Judge Classifier instead.
Output
score: The judge's score, normalized to the[0, 1]range.
Score Normalization
The judge responds with a JSON object containing a score, reasoning,
and confidence. The raw score is validated against [score_min, score_max]
and normalized to [0, 1] before storage.
LLM judges produce more calibrated results with a wider score range. Set
score_min/score_max to 0/100 and describe the scale in your
system_prompt for best results.
Examples
Example: Assess (Configurable) Dimension of an Answer. The judge rates how well the model's answer aligns with a particular dimension, on a 0-to-100 scale normalized to [0, 1].
...
definition:
...
scorers:
- type: "model_as_a_judge_scorer"
model_key: "<< config.judge_model >>"
system_prompt: >
You are an evaluator assessing the << config.evaluation_dimension >> of a
model response to a question.
On a scale from 0 to 100, assign 100 if the response is fully
<< config.evaluation_dimension >> and 0 if it is not at all.
Return only the numeric score.
user_prompt: >
<context>{{ sample.context }}</context>
<response>{{ model_output.choices[0].message.content }}</response>
score_min: 0
score_max: 100
metrics:
- type: "mean"
field: "score"
name: "Mean Score"
- type: "min"
field: "score"
name: "Min Score"
- type: "max"
field: "score"
name: "Max Score"LLM scorers produce more calibrated results when the score range is 0 to 100, rather than 0 to 1.
To use custom score ranges, set the
score_minandscore_maxvalues and specify the range instructions in thesystem_prompt. Note, the corresponding results will always be postprocessed to [0, 1] range for the metric calculation.
Configuration
Properties
type Literal "model_as_a_judge_scorer" required
The type of the scorer.
model_key Key, TemplateValue required
The model to be used as the judge.
system_prompt string, TemplateValue
The system prompt given to the judge model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
You are a helpful assistant and will be used to judge the output of another model.
user_prompt string, TemplateValue required
The user prompt given to the judge model. The prompt can refer
to the following variable dynamically (using {{ }} syntax):
In all scenarios:
sample: Sample attributes (ex:{{ sample.answer }})
If the task has a solver:
model_output: The last model output (ex:{{ model_output }})messages: The full list of input/output messages (ex:{{ messages[0]['content'] }})input_prompt: The message contents of the last model output (only for chat completion tasks) (ex:{{ input_prompt }})
score_min number, TemplateValue
The minimum score that the judge model can predict.
Default:0.0
score_max number, TemplateValue
The maximum score that the judge model can predict.
Default:1.0
use_structured_outputs boolean
Whether to use structured outputs. It is recommended to enable this if the model supports it.
Default:False
purpose ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
display_name string
The display name of the scorer.
Default:None
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
