RAG Checker
Evaluates retrieval-augmented generation (RAG) outputs for faithfulness and relevance against the retrieved context, producing multiple sub-scores. Use this when your pipeline is a RAG system and you want to measure whether the generated answer stays faithful to the retrieved context and whether the retrieved context is relevant to the query. For general LLM output evaluation without retrieved context, use the Model Scorer instead.
Output
The scorer produces a subset of the following scores, depending on
scores_to_compute. All scores are in the [0, 1] range.
recall: Proportion of ground-truth claims covered by the generated answer.precision: Proportion of generated claims that are supported by the ground-truth.f1_score: Harmonic mean ofrecallandprecision.claim_recall: Fraction of target claims that are found in the retrieved chunks.context_precision: Fraction of retrieved chunks that are relevant, meaning they contain at least one target claim.context_utilization: Fraction of target claims that are both present in the retrieved chunks and correctly included in the model response.noise_sensitivity_in_relevant: Fraction of incorrect claims in the model response that are supported (entailed) by relevant chunks.noise_sensitivity_in_irrelevant: Fraction of incorrect claims in the model response that are supported (entailed) by irrelevant chunks.hallucination: Fraction of incorrect claims in the model response that are not supported by any retrieved chunk.self_knowledge: Fraction of correct claims in the model response that are not supported by any retrieved chunk.faithfulness: Fraction of claims in the model response that are supported (entailed) by the retrieved context.
How It Works
Scoring decomposes both the generated answer and the ground-truth into atomic claims,
then cross-checks each claim against the retrieved context passages. Use
scores_to_compute to select only the sub-scores you need.
Configuration
Properties
type Literal "rag_checker" required
The type of the scorer.
query_column string, TemplateValue
The name of the sample column containing the query.
Default:query
target_column string, TemplateValue
The name of the sample column containing the target.
Default:target
judge_model_key Key, TemplateValue required
The registered chat completion model to be used as a claim extractor and checker.
scores_to_compute array[string]
The scores to compute. If set to None, all scores are computed.
Default:None
key string
Unique identifier assigned to the entity in AI GO!.
Default:None
purpose ScorerPurpose
The purpose of this scorer.
score: The scorer is used to score the solver output or the dataset sample.qa: The scorer is used to do QA over the solver output or the dataset sample.
score
display_name string
The display name of the scorer.
Default:None
metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]
The metrics associated with this scorer, which will produce per-task metrics.
Default:None
