Function Call Coverage

Checks whether an agent made all required function calls during execution, producing an all_required_calls_made score of 1.0 or 0.0. Use this when evaluating tool-using agents where specific function calls must appear in the trace. For validation of function call arguments rather than presence, use a Python Scorer instead.

🧪

This scorer is an experimental feature and the API is subject to change.

Output

  • all_required_calls_made: 1.0 if all required function calls were made (in the correct order when using in_order mode), 0.0 otherwise.
  • required_calls_coverage: Fraction of required function calls that were made at least once.
  • num_required_calls_made: Number of required function calls that were made at least once.
  • num_required_calls_not_made: Number of required function calls that were never made.
  • num_unrequired_calls: Number of function calls made to functions not in the required list. This includes both calls to functions that are not required at all, and excessive calls to functions that are required.
  • num_required_calls_total: Total number of required function calls.

Modes

Two modes are available via the mode field:

  • any_order: All required calls must appear in the trace, in any order.
  • in_order: All required calls must appear in the trace as a subsequence in the specified order.

Given function_calls: ["search", "calculator"]:

TraceModeall_required_calls_madenum_required_calls_madenum_unrequired_calls
["search", "calculator"]any1.020
["calculator", "search"]any1.020
["calculator", "search"]in order0.020
["search", "lookup"]any0.011

Examples

Example: Any Order. Required calls are read from the dataset; the scorer passes if all appear in the trace in any order.

# tasks/task.yaml
...
scorers:
  - type: "function_call_coverage"
    function_calls: "{{ sample.function_calls }}"
    mode: "any_order"

Example: In Order. Required calls are defined statically; the scorer passes only if they appear as a subsequence of the trace in the specified order.

# tasks/task.yaml
...
scorers:
  - type: "function_call_coverage"
    function_calls: '["search", "calculator"]'
    mode: "in_order"

Configuration

Properties


type Literal "function_call_coverage" required

The type of the scorer.


function_calls string, TemplateValue required

Jinja template that produces the list of required function call names.

The required function calls can be:

  1. A hard-coded list (e.g. ["search", "calculator"])
  2. Refer to a sample field (e.g. "{{ sample.function_calls }}")
  3. Derived from sample data (e.g. "{{ sample.tools \| map(attribute='name') \| list }}")

sample represents the current row of the dataset (with a field for every dataset column).

The template should produce a JSON list of function call name strings.


mode string

any_order: checks only that every required function call was made at least once, in any order.

in_order: additionally checks that the required function calls appear as a subsequence of the trace — i.e. in the specified order, with other function calls allowed in between.

Default: any_order

key string

Unique identifier assigned to the entity in AI GO!.

Default: None

purpose ScorerPurpose

The purpose of this scorer.

  • score: The scorer is used to score the solver output or the dataset sample.
  • qa: The scorer is used to do QA over the solver output or the dataset sample.
Default: score

display_name string

The display name of the scorer.

Default: None

metrics array[PythonMetricTemplate, BinaryClassificationMetricTemplate, MulticlassClassificationMetricTemplate, MeanMetricTemplate, MaxMetricTemplate, MinMetricTemplate, StdDevMetricTemplate, FrequencyMetricTemplate, RecallMetricTemplate, PrecisionMetricTemplate, F1ScoreMetricTemplate]

The metrics associated with this scorer, which will produce per-task metrics.

Default: None