Run Evaluation From AI Atlas

📘

In this tutorial, we will run a harmful content evaluation for an OpenAI model using lf init --atlas to download the evaluation package from AI Atlas.

Before You Begin

  • You need a live AI GO! deployment.
  • You need a Python environment with the AI GO! CLI installed. Follow the CLI installation page.
  • You need a configured CLI. Follow the CLI configuration steps.
  • You need an OpenAI API key set as OPENAI_API_KEY in your environment.

Step 1: Create an AI App

Create an AI app to use as a workspace for the evaluation.

  1. Define the app in a YAML file.
display_name: "My App"
key: "my-app"
  1. Create the app and switch to it.
lf app add -f app.yaml
lf switch my-app
  1. Confirm the app is active.
$ lf status
Working on AI app with key 'my-app'.

Step 2: Add the Model Under Test

Define and add the model you want to evaluate. This example uses OpenAI GPT-4.1 Nano.

  1. Define the model in a YAML file.
display_name: "OpenAI GPT-4.1 Nano"
key: "openai-gpt-4-1-nano"
task: "chat_completion"
config:
  connection_type: "custom_connection"
  adapter:
    key: "latticeflow$openai_chat_completion"
  url: "https://api.openai.com/v1/chat/completions"
  api_key: $OPENAI_API_KEY
  model_key: "gpt-4.1-nano"
  1. Add the model.
lf model add -f model.yaml

Step 3: Initialize the Evaluation from AI Atlas

Download the harmful_content evaluation package from AI Atlas into your working directory.

lf init --atlas harmful_content

This creates a harmful_content/ directory containing the evaluation definition, datasets, tasks, a config.env file, and a RUN.md with evaluation-specific instructions.

Step 4: Configure the Evaluation

Open harmful_content/config.env and set the required values.

# The key of the model under test.
MODEL_KEY="openai-gpt-4-1-nano"

# The key of the model to use as judge.
JUDGE_MODEL_KEY="openai-gpt-4-1-nano"

Step 5: Run the Evaluation

Run the evaluation, passing config.env via --env.

lf --env harmful_content/config.env run -f harmful_content/run.yaml

You will see output similar to:

On AI app 'my-app'.
[Dataset(key="harmful_content")] Created successfully
[Task(key="harmful_content")] Created successfully
[Evaluation(ID="1")] Created successfully
[Evaluation(ID="1")] Started successfully.
----------------------------------------------------------------------------------
Evaluation overview available at:

http://<your-aigo-url>/ai-apps/.../evaluations

Or in the CLI using:

lf overview eval --id 1

Step 6: Explore Results

  1. Check the evaluation status in the CLI.
lf overview eval --id 1
  1. Open the evaluations page in the UI to see all evaluation runs and aggregate metrics.

  2. Drill into individual model responses and scores via the task result sidebar.