System Tasks

A system task evaluates a system-level property without interacting with a model or dataset. Instead of the usual solver & scorer pipeline, a system task runs a single Python snippet — the compute_evidence_snippet — that probes a system and returns metrics directly.

Use a system task when:

  • You need to verify an infrastructure or operational property (e.g. HTTPS enforcement, DNS configuration, certificate validity, endpoint availability).
  • The check is self-contained — it does not require a dataset of test cases or a model to generate responses.
  • You want to parameterize the check and run it across multiple configurations in a single evaluation.

Task Definition

Set definition.type to "system_task" and provide a compute_evidence_snippet with the Python code that implements the check. Declare any parameters the snippet needs in config_spec.

display_name: "Enforces HTTPS"
key: "enforces-https"
description: >
  Checks whether a given URL enforces HTTPS by redirecting HTTP requests to HTTPS.
tags: ["Security"]
config_spec:
  - type: "string"
    key: "url"
    display_name: "URL"
    description: "The URL to check for HTTPS enforcement."
definition:
  type: "system_task"
  compute_evidence_snippet: !include "./check_https.py"

Configuration parameters and secrets are available inside the snippet via the << config.KEY >> and << secrets.KEY >> placeholder syntax — see Config Specification and Use Secrets.

See the Tasks CLI reference for the full task specification.

Writing the Evidence Snippet

The snippet must define a compute_evidence function that returns a dictionary of metrics:

def compute_evidence():
    ...
    return {"metrics": {"Metric Name": {"value": <number>, "reason": "<explanation>"}}}

Each metric has a numeric value (typically 0 or 1 for pass/fail checks, but any number is valid) and a reason string. We encourage always providing a reason — it makes results interpretable in the UI and in exported evidence. Multiple metrics can be returned from a single snippet.

Example

The following snippet checks whether a URL enforces HTTPS by verifying that plain HTTP requests are redirected:

import http.client
from urllib.parse import urlparse


def compute_evidence():
    url = "<< config.url >>"
    parsed = urlparse(url if "://" in url else "http://" + url)
    host = parsed.netloc or parsed.path
    path = parsed.path if parsed.netloc else "/"
    if not path:
        path = "/"

    try:
        conn = http.client.HTTPConnection(host, timeout=10)
        conn.request("GET", path)
        resp = conn.getresponse()
        conn.close()
        location = resp.getheader("Location", "")
        if resp.status in (301, 302, 307, 308) and location.lower().startswith("https://"):
            return {
                "metrics": {
                    "Enforces HTTPS": {
                        "value": 1,
                        "reason": f"HTTP {resp.status} redirects to {location}"
                    }
                }
            }
        return {
            "metrics": {
                "Enforces HTTPS": {
                    "value": 0,
                    "reason": f"HTTP returned {resp.status} with no HTTPS redirect"
                }
            }
        }
    except OSError:
        return {
            "metrics": {
                "Enforces HTTPS": {
                    "value": 1,
                    "reason": "HTTP connection refused — HTTPS is enforced at transport level"
                }
            }
        }

See the full runnable example for a complete system task with HSTS checks and an evaluation configuration.

📘

Python snippet environment

The snippet runs inside AI GO!'s fixed Python runtime (Python 3.11). Only the libraries listed in Python Snippets are available at execution time.

💡

When to use a system task vs. a benchmark task:

ScenarioTask Type
Evaluate a model's responses against a datasetBenchmark task
Evaluate dataset quality without a modelBenchmark task (evaluated_entity_type: dataset)
Check an infrastructure or system propertySystem task
Run a self-contained probe that produces metrics directlySystem task

A system task has no solver, no dataset, and no scorers — the compute_evidence function is the entire execution pipeline.