Solver Post-processor

The solver post-processor transforms a single solver output into multiple individually-scored rows. After the solver runs, the post-processor receives the original sample and the full solver output and returns a list of new (sample, solver output) pairs — one per scored row. Each pair then flows through the scorer pipeline independently.

Use the solver post-processor when the output of the solver is too complex and needs to be broken down into individual, smaller outputs that are viewed and scored separately.

Task Definition

Add a postprocessor block inside the solver definition. Set type to "python" and provide the Python code via postprocess_snippet:

display_name: "Multi QA Checker"
key: "multi-qa-checker"
description: >
  Evaluates a model that answers multiple questions in a single turn.
  The post-processor splits the combined answer into individual scored rows.
config_spec:
  - type: "model"
    key: "judge_model"
    display_name: "Judge Model"
definition:
  dataset:
    key: "multi-qa"
  solver:
    type: "single_turn_solver"
    input_builder:
      type: "chat_completion"
      input_messages:
        - role: "system"
          content: "You are a helpful assistant."
        - role: "user"
          content: >
            Respond to all of these questions and separate the answers with '---':
            {{ '\n---\n'.join(sample.questions) }}

            Answer with a single word and no explanations.
    postprocessor:
      type: "python"
      postprocess_snippet: !include "./postprocessor.py"
  scorers:
    - type: "string_equals"
      ground_truth: "{{ sample.target }}"

The postprocessor block is supported for all solver types.

See the Tasks CLI reference for the full task specification.

Writing the Postprocess Snippet

The snippet must define a postprocess function with this signature:

def postprocess(
    sample: RawSample,
    solver_output: SolverOutput,
) -> list[tuple[RawSample, SolverOutput]]:
    ...

Both def and async def are supported.

  • sample: the original dataset row as a dictionary.
  • solver_output: the output produced by the solver for that sample (a SolverTrace, SingleSolverOutput, GroupedSolverTrace, or GroupedSolverOutput).
  • Return value: a list of (new_sample, new_solver_output) pairs. Each pair becomes one scored row in the evidence.

Sample IDs

Each returned sample dict may optionally include a "sample_id" key. If you omit it , AI GO! generates unique IDs automatically by appending an index to the original sample ID (e.g. "japan.0", "japan.1"). The original sample ID is always stored under "original_sample_id" in each new sample dict so you can trace results back to their source.

Available types

The following types from latticeflow.core.dtypes are available in the snippet:

TypeDescription
RawSampledict[str, Any] alias for a dataset row
SolverOutputUnion of all solver output types
SolverTraceOpen Responses solver output with a structured Trace
SingleSolverOutputLegacy solver output (messages + model output)
GroupedSolverTraceMultiple SolverTrace objects from a grouped solver
GroupedSolverOutputMultiple SingleSolverOutput objects from a grouped solver

Example

The following snippet handles a solver that answers several questions in one response, separated by ---. It splits the response and reconstructs one SolverTrace per question:

from latticeflow.core.dtypes import ChatCompletionModelOutput
from latticeflow.core.dtypes import ChatCompletionModelOutputChoice
from latticeflow.core.dtypes import ChatCompletionOutputMessage
from latticeflow.core.dtypes import Message
from latticeflow.core.dtypes import MessageRole
from latticeflow.core.dtypes import MessageStatus
from latticeflow.core.dtypes import ModelResponse
from latticeflow.core.dtypes import OutputTextContent
from latticeflow.core.dtypes import RawSample
from latticeflow.core.dtypes import SolverOutput
from latticeflow.core.dtypes import SolverTrace
from latticeflow.core.dtypes import Trace


def postprocess(
    sample: RawSample, solver_output: SolverOutput
) -> list[tuple[RawSample, SolverOutput]]:
    # Split the full model answer into individual answers.
    model_response = solver_output.output.choices[0].message.content
    answers = [answer.strip() for answer in model_response.split("---")]
    if len(answers) != len(sample["questions"]):
        raise ValueError(
            f"Expected {len(sample['questions'])} answers, but got {len(answers)}. "
            f"Answer:\n{model_response}"
        )

    # Construct one (sample, solver output) pair per question.
    postprocessed_outputs = []
    for question, target, answer in zip(
        sample["questions"], sample["targets"], answers
    ):
        trace = SolverTrace(trace=Trace.from_items([]), raw_outputs=[])
        trace.append_user_message(question)
        trace.add_model_response(
            ModelResponse(
                raw_output=ChatCompletionModelOutput(
                    choices=[
                        ChatCompletionModelOutputChoice(
                            message=ChatCompletionOutputMessage(
                                role="assistant", content=answer
                            )
                        )
                    ]
                ),
                items=[
                    Message(
                        id="",
                        status=MessageStatus.completed,
                        role=MessageRole.assistant,
                        content=[OutputTextContent(text=answer, annotations=[])],
                    )
                ],
            )
        )
        # Preserve the raw direct I/O from the original solver call.
        trace.direct_ios = solver_output.direct_ios
        postprocessed_outputs.append(({"question": question, "target": target}, trace))

    return postprocessed_outputs

The dataset for this example has samples with a list of questions and corresponding targets:

{"sample_id": "japan", "questions": ["What is the capital city of Japan?", "On which continent is Japan located?"], "targets": ["Tokyo", "Asia"]}
{"sample_id": "austria", "questions": ["What is the capital city of Austria?", "On which continent is Austria located?"], "targets": ["Vienna", "Europe"]}

With 4 dataset samples and 2 questions each, the evaluation will contain 8 scored rows in the evidence (e.g. japan.0, japan.1, austria.0, austria.1, …). Each row is scored independently by the string_equals scorer against its individual target.

See the full runnable example for the complete task definition, dataset, and evaluation configuration.