RAG Checker

Evaluates retrieval-augmented generation (RAG) outputs for faithfulness and relevance against the retrieved context, producing multiple sub-scores. Use this when your pipeline is a RAG system and you want to measure whether the generated answer stays faithful to the retrieved context and whether the retrieved context is relevant to the query. For general LLM output evaluation without retrieved context, use the Model Scorer instead.

Output

The scorer produces a subset of the following scores, depending on scores_to_compute. All scores are in the [0, 1] range.

  • recall: Proportion of ground-truth claims covered by the generated answer.
  • precision: Proportion of generated claims that are supported by the ground-truth.
  • f1_score: Harmonic mean of recall and precision.
  • claim_recall: Fraction of target claims that are found in the retrieved chunks.
  • context_precision: Fraction of retrieved chunks that are relevant, meaning they contain at least one target claim.
  • context_utilization: Fraction of target claims that are both present in the retrieved chunks and correctly included in the model response.
  • noise_sensitivity_in_relevant: Fraction of incorrect claims in the model response that are supported (entailed) by relevant chunks.
  • noise_sensitivity_in_irrelevant: Fraction of incorrect claims in the model response that are supported (entailed) by irrelevant chunks.
  • hallucination: Fraction of incorrect claims in the model response that are not supported by any retrieved chunk.
  • self_knowledge: Fraction of correct claims in the model response that are not supported by any retrieved chunk.
  • faithfulness: Fraction of claims in the model response that are supported (entailed) by the retrieved context.

How It Works

Scoring decomposes both the generated answer and the ground-truth into atomic claims, then cross-checks each claim against the retrieved context passages. Use scores_to_compute to select only the sub-scores you need.

Configuration

Properties


type Literal "rag_checker" required

The type of the scorer.


query_column string, TemplateValue

The name of the sample column containing the query.

Default: query

target_column string, TemplateValue

The name of the sample column containing the target.

Default: target

judge_model_key Key, TemplateValue required

The registered chat completion model to be used as a claim extractor and checker.


scores_to_compute array[string]

The scores to compute. If set to None, all scores are computed.

Default: None

key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

display_name string

The display name of the scorer.

Default: None

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None