Model Scorer

Scores each sample by having a judge model produce a numeric score, normalized to the [0, 1] range. Use this when the evaluation criterion is subjective or nuanced - such as coherence, groundedness, or helpfulness - and you want a graded score rather than binary pass/fail. For binary pass/fail, use the Model As A Judge Classifier instead.

Output

  • score: The judge's score, normalized to the [0, 1] range.

Score Normalization

The judge responds with a JSON object containing a score, reasoning, and confidence. The raw score is validated against [score_min, score_max] and normalized to [0, 1] before storage.

LLM judges produce more calibrated results with a wider score range. Set score_min/score_max to 0/100 and describe the scale in your system_prompt for best results.

Examples

Example: Assess (Configurable) Dimension of an Answer. The judge rates how well the model's answer aligns with a particular dimension, on a 0-to-100 scale normalized to [0, 1].

...
definition:
  ...
  scorers:
    - type: "model_as_a_judge_scorer"
      model_key: "<< config.judge_model >>"
      system_prompt: >
        You are an evaluator assessing the << config.evaluation_dimension >> of a
        model response to a question.
        On a scale from 0 to 100, assign 100 if the response is fully
        << config.evaluation_dimension >> and 0 if it is not at all.
        Return only the numeric score.
      user_prompt: >
        <context>{{ sample.context }}</context>
        <response>{{ model_output.choices[0].message.content }}</response>
      score_min: 0
      score_max: 100
      metrics:
        - type: "mean"
          field: "score"
          name: "Mean Score"
        - type: "min"
          field: "score"
          name: "Min Score"
        - type: "max"
          field: "score"
          name: "Max Score"
💡

LLM scorers produce more calibrated results when the score range is 0 to 100, rather than 0 to 1.

To use custom score ranges, set the score_min and score_max values and specify the range instructions in the system_prompt. Note, the corresponding results will always be postprocessed to [0, 1] range for the metric calculation.

Configuration

Properties


type Literal "model_as_a_judge_scorer" required

The type of the scorer.


model_key Key, TemplateValue required

The model to be used as the judge.


system_prompt string, TemplateValue

The system prompt given to the judge model. The prompt can refer to the following variable dynamically (using &#123;&#123; &#125;&#125; syntax):

In all scenarios:

  1. sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

  1. model_output: The last model output (ex: {{ model_output }})
  2. messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
  3. input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})
Default: You are a helpful assistant and will be used to judge the output of another model.

user_prompt string, TemplateValue required

The user prompt given to the judge model. The prompt can refer to the following variable dynamically (using &#123;&#123; &#125;&#125; syntax):

In all scenarios:

  1. sample: Sample attributes (ex: {{ sample.answer }})

If the task has a solver:

  1. model_output: The last model output (ex: {{ model_output }})
  2. messages: The full list of input/output messages (ex: {{ messages[0]['content'] }})
  3. input_prompt: The message contents of the last model output (only for chat completion tasks) (ex: {{ input_prompt }})

score_min number, TemplateValue

The minimum score that the judge model can predict.

Default: 0.0

score_max number, TemplateValue

The maximum score that the judge model can predict.

Default: 1.0

use_structured_outputs boolean

Whether to use structured outputs. It is recommended to enable this if the model supports it.

Default: False

purpose ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

key string

Unique identifier assigned to the entity in AI GO!.

Default: None

display_name string

The display name of the scorer.

Default: None

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None