Policies
Policies are currently in Preview. Breaking changes may occur in future releases.
Policies let you define “quality gates” for your AI app and automatically check whether your evaluation results meet them. If you ran multiple evaluations with the same key, only the latest evaluation for that key is considered.
A policy groups one or more rules. Each rule evaluates metrics produced by your evaluations and returns a simple status: PASS or FAIL, with a short explanation.
Prerequisites
To use policies, you need:
- An AI app with at least one evaluation you can run.
- Tasks that produce the metrics you want to set rules on (for example:
accuracy,hallucination,faithfulness). - The CLI installed and configured (see Command Line Interface).
Quickstart
This quickstart assumes you already have evaluations that produce the metrics you care about.
- Define the policies.
- Set policies for your AI app.
- Run (or re-run) your evaluations.
- Check policy status.
1. Define policies
Create a file called policies.yaml that contains the policies:
policies:
- key: "rag_quality_gate_v1"
display_name: "RAG Quality Gate v1"
description: "Baseline quality thresholds for production RAG systems"
rules:
- key: "hallucinations_are_uncommon"
display_name: "Hallucinations are Uncommon"
description: "Hallucinations must be below 0.05"
definition:
type: "threshold"
metric: "hallucination"
operator: "<"
threshold: 0.05
scope: "all_latest"
- key: "faithfulness_is_measured"
display_name: "Faithfulness is measured"
description: "Faithfulness metric must be computed."
definition:
type: "exists"
metric: "faithfulness"
scope: "all_latest"This file defines a policy named "RAG Quality Gate v1" with 2 rules:
- Hallucinations are Uncommon: passes if at least one
hallucinationmetric value exists and all of thehallucinationmetric values are below 0.05. - Faithfulness is measured: passes if at least one
faithfulnessmetric value exists.
Tip: see the Rule Types and Rule Scope sections below for more details on rule creation.
2. Set policies for your AI app
You can set those policies for your AI app by running
$ lf set policies policies.yaml3. Run evaluations
Policies check the metrics computed by the latest evaluations per key. If you haven’t run your evaluations yet, you can run them one by one:
$ lf run evaluations/rag_eval.yaml
$ lf run evaluations/safety_eval.yaml4. View the status of the policies
Once the evaluations finish, you can view the status of your policies by running:
$ lf overview policiesExample output:
[Note] The 'policy' features are experimental. There may be breaking changes in future releases.
On AI app 'my-rag-app'.
Policy 'rag_quality_gate_v1':
- Name: 'RAG Quality Gate v1'
- Overview: 1/2 rules passed
Policy Rules (2 rows)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Rule Name ┃ Result ┃ Explanation ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hallucinations are Uncommon │ FAIL │ 1/1 metric values with name 'hallucination' do not meet the threshold (0.05). │
│ Faithfulness is measured │ PASS │ 1 metric value found with name 'faithfulness'. │
└─────────────────────────────┴────────┴────────────────────────────────────────────────────────────────────────────────┘
Tip: if you want JSON output, use
lf overview policies --json.
Defining rules
Rule types
Metric exists
Passes if the metric value is computed by an evaluation. Only metric values in scope are considered (see Rule Scope). Example:
definition:
type: "exists"
metric: "faithfulness"Metric meets the threshold
Passes if the metric is computed and all metric values meet the threshold. Only metric values in scope are considered (see Rule Scope). Example:
definition:
type: "threshold"
metric: "hallucination"
operator: "<"
threshold: 0.05Rule scope
Each rule must define a scope, which controls which metrics the rule considers. Two types of scopes are supported:
1. Metrics from all latest evaluations
Applies the rule to the metrics produced by the latest run of each evaluation (per-key) in the AI app. If you ran multiple evaluations with the same key, only the latest evaluation for that key is considered.
scope: "all_latest"2. Custom scope
Use this when you want a rule to apply only to specific evaluations/tasks/metrics (for example, to prevent experimental evaluations from affecting policy status, or to apply a threshold only to a particular evaluation / task result). If you ran multiple evaluations with the same key, only the latest evaluation for that key is considered.
You can narrow the scope by filtering on evaluation keys, task specification keys, scorer keys, and/or metric keys. Example:
scope:
evaluation_keys: ["rag_eval_v1", "safety_eval_v2"]
task_specification_keys: ["rag_performance"]
scorer_keys: ["rag_checker"]
metric_keys: ["claim_recall"]Export policies
If you want to export the policies for the current AI app, run:
$ lf export policies -o policies.yamlTip: if you want JSON output, use
lf export policies --json.
Practical tips
- Start with a small number of rules (1–5) for your first policy, then expand.
- Use clear rule names that are easy to read in
lf overview policies. - If multiple evaluations exist in your app, prefer scoping rules to specific evaluation keys until you’re ready to broaden coverage.
- Iterate: policies are easiest to write once you’ve seen the metrics produced by your tasks.
