⚠️
Policies are currently in Preview. Breaking changes may occur in future releases.

Policies let you define “quality gates” for your AI app and automatically check whether your evaluation results meet them. If you ran multiple evaluations with the same key, only the latest evaluation for that key is considered.

A policy groups one or more rules. Each rule evaluates metrics produced by your evaluations and returns a simple status: PASS or FAIL, with a short explanation.

Prerequisites

To use policies, you need:

An AI app with at least one evaluation you can run.
Tasks that produce the metrics you want to set rules on (for example: accuracy, hallucination, faithfulness).
The CLI installed and configured (see Command Line Interface).

Quickstart

This quickstart assumes you already have evaluations that produce the metrics you care about.

Define the policies.
Set policies for your AI app.
Run (or re-run) your evaluations.
Check policy status.

1. Define policies

Create a file called policies.yaml that contains the policies:

policies:
  - key: "rag_quality_gate_v1"
    display_name: "RAG Quality Gate v1"
    description: "Baseline quality thresholds for production RAG systems"
    rules:
      - key: "hallucinations_are_uncommon"
        display_name: "Hallucinations are Uncommon"
        description: "Hallucinations must be below 0.05"
        definition:
          type: "threshold"
          metric: "hallucination"
          operator: "<"
          threshold: 0.05
        scope: "all_latest"
      - key: "faithfulness_is_measured"
        display_name: "Faithfulness is measured"
        description: "Faithfulness metric must be computed."
        definition:
          type: "exists"
          metric: "faithfulness"
        scope: "all_latest"

This file defines a policy named "RAG Quality Gate v1" with 2 rules:

Hallucinations are Uncommon: passes if at least one hallucination metric value exists and all of the hallucination metric values are below 0.05.
Faithfulness is measured: passes if at least one faithfulness metric value exists.

📘
Tip: see the Rule Types and Rule Scope sections below for more details on rule creation.

2. Set policies for your AI app

You can set those policies for your AI app by running

$ lf set policies policies.yaml

3. Run evaluations

Policies check the metrics computed by the latest evaluations per key. If you haven’t run your evaluations yet, you can run them one by one:

$ lf run evaluations/rag_eval.yaml
$ lf run evaluations/safety_eval.yaml

4. View the status of the policies

Once the evaluations finish, you can view the status of your policies by running:

$ lf overview policies

Example output:

[Note] The 'policy' features are experimental. There may be breaking changes in future releases.
On AI app 'my-rag-app'.

Policy 'rag_quality_gate_v1':
- Name: 'RAG Quality Gate v1'
- Overview: 1/2 rules passed

                                                Policy Rules (2 rows)                                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Rule Name                   ┃ Result ┃ Explanation                                                                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hallucinations are Uncommon │ FAIL   │ 1/1 metric values with name 'hallucination' do not meet the threshold (0.05).  │
│ Faithfulness is measured    │ PASS   │ 1 metric value found with name 'faithfulness'.                                 │
└─────────────────────────────┴────────┴────────────────────────────────────────────────────────────────────────────────┘

📘
Tip: if you want JSON output, use lf overview policies --json.

Defining rules

Rule types

Metric exists

Passes if the metric value is computed by an evaluation. Only metric values in scope are considered (see Rule Scope). Example:

definition:
  type: "exists"
  metric: "faithfulness"

Metric meets the threshold

Passes if the metric is computed and all metric values meet the threshold. Only metric values in scope are considered (see Rule Scope). Example:

definition:
  type: "threshold"
  metric: "hallucination"
  operator: "<"
  threshold: 0.05

Rule scope

Each rule must define a scope, which controls which metrics the rule considers. Two types of scopes are supported:

1. Metrics from all latest evaluations

Applies the rule to the metrics produced by the latest run of each evaluation (per-key) in the AI app. If you ran multiple evaluations with the same key, only the latest evaluation for that key is considered.

scope: "all_latest"

2. Custom scope

Use this when you want a rule to apply only to specific evaluations/tasks/metrics (for example, to prevent experimental evaluations from affecting policy status, or to apply a threshold only to a particular evaluation / task result). If you ran multiple evaluations with the same key, only the latest evaluation for that key is considered.

You can narrow the scope by filtering on evaluation keys, task specification keys, scorer keys, and/or metric keys. Example:

scope:
  evaluation_keys: ["rag_eval_v1", "safety_eval_v2"]
  task_specification_keys: ["rag_performance"]
  scorer_keys: ["rag_checker"]
  metric_keys: ["claim_recall"]

Export policies

If you want to export the policies for the current AI app, run:

$ lf export policies -o policies.yaml

📘
Tip: if you want JSON output, use lf export policies --json.

Practical tips

Start with a small number of rules (1–5) for your first policy, then expand.
Use clear rule names that are easy to read in lf overview policies.
If multiple evaluations exist in your app, prefer scoping rules to specific evaluation keys until you’re ready to broaden coverage.
Iterate: policies are easiest to write once you’ve seen the metrics produced by your tasks.