Integrate Custom Model

The custom model integration, called custom inference integration, registers a model whose inference logic is implemented directly in Python. The logic is supplied as an inference snippet.

It is the right choice whenever the simple model integration cannot express the system end-to-end. These are the scenarios when this method is preferred:

  • The model is exposed as more than one endpoint or requires multiple chained requests per inference (e.g. create a session, then submit a message, then poll for completion).
  • Inference requires an explicit confirmation step or other custom protocol after the initial request.
  • The model maintains custom state across turns that cannot be achieved with a stateful model adapter.
  • The wire format is too irregular to be expressed by a Jinja-based model adapter.
  • The system is embedded inside a larger application and inference must run through that application's SDK.
📘

Custom inference shifts responsibility for HTTP calls, authentication, retries, error handling, and any provider-specific quirks to the inference snippet. Built-in debugging options and rate limiting still apply.

Follow the steps below to integrate such a model in the CLI or create it in the UI by clicking on Add Model > Add Custom Model and then selecting Custom inference connection type.

Step 1: Describe Model

Start by describing the model so you and your collaborators can identify it clearly: set a unique key to reference the model programmatically, a human-readable display_name, a short description, and the machine learning task the model performs (an ML task describes the model I/O and its interaction properties).

display_name: "OpenAI GPT-4.1 Nano (Custom Inference)"
key: "gpt-4-1-nano-custom-inference"
description: "OpenAI's GPT-4.1 Nano served through a custom inference snippet."
task: "chat_completion"

See the Models CLI reference for the allowed ML tasks, key format, and other properties.

Step 2: Configure Connection

Configure the connection by setting the following fields:

  • Set connection_type to "custom_inference" to invoke the model through an inference snippet instead of a single HTTP request.
  • Set run_inference_snippet to the Python script that implements the inference logic — see Step 2.1 for how to write it.
  • Optionally, set timeout to cap the execution time of a single snippet invocation. Keeping it tight makes evaluations faster and more robust by failing stuck requests early; a good starting point is 2–3x the expected end-to-end latency of one inference call.
config:
  connection_type: "custom_inference"
  run_inference_snippet: !include "./run_inference.py"
  timeout: 15

All fields supported by a custom inference connection are described in the Models CLI reference.

Step 2.1: Build Inference Pipeline

The Python environment used to execute inference snippets is described here.

The inference snippet defines the full inference pipeline as a single top-level Python function with a fixed signature:

def run_inference(body: str, environment: dict[str, Any]) -> str:
    ...

The function's inputs and output are wired into the surrounding model adapter as follows:

  • body — the string produced by the adapter's process_input template. This is the request payload the run_inference function must consume.
  • environment — the environment dictionary configured on the model (see Step 2.2).
  • Return value — the string consumed by the adapter's process_output template. Whatever the function returns is fed straight into process_output for parsing into the canonical model I/O format.

The body of the function implements the inference pipeline, which typically has four stages:

  1. Prepare input — parse body, extract the fields you need, and assemble the payload(s) for your endpoint(s).
  2. Run inference — issue the HTTP request(s), poll for completion, or call into the embedding SDK. For multi-endpoint systems this is where you create sessions, submit messages, and collect intermediate state.
  3. Combine outputs — merge streamed chunks, concatenate multi-part responses, or aggregate per-call results into a single response.
  4. Prepare output — serialize the result into the wire format expected by the model adapter and return it as a string.

Step 2.2: Define Environment

Keep endpoint URLs, model identifiers, credentials, and any other parameters out of the Python source by putting them in the environment dictionary. The dictionary is passed unchanged to the run_inference function, so the same inference snippet stays reusable across deployments and no secrets are committed alongside the code.

config:
  connection_type: "custom_inference"
  environment:
    API_KEY: "sk-..."

Inside the snippet, the values are read from the environment argument:

environment = {"API_KEY": "sk-..."}
📘

Be careful when managing sensitive tokens in YAML files. Reference them either through an environment variable loaded from .env ($API_KEY) or through a server-side secret (<< secrets.API_KEY >>).

environment:
  API_KEY: "sk-…"                   # Exposed as a plain string
  API_KEY: $API_KEY                 # Environment variable from `.env`
  API_KEY: "<< secrets.API_KEY >>"  # Server-side secret

Step 2.3: Production Considerations

The inference snippet runs as-is — behaviors that the simple connection handles automatically must be implemented inside the snippet. At minimum, cover the following:

  • Error handling. Detect failed responses and raise descriptive exceptions so that intermediate failures surface in the evaluation log with enough context to debug.
  • Refusal handling. Detect successful responses that contain a refusal or moderation message and surface them as part of the returned output, rather than masking them as errors, so scorers can classify them.
  • Concurrency. Cap how aggressively the snippet is invoked with rate limits and the maximum number of concurrent requests. For systems with shared state, serialize requests by allowing only one concurrent call.
  • Timeouts. Bound the total runtime of one snippet invocation with the model-level timeout, and keep any internal request timeouts consistent with it.

Step 3: Set Model Adapter

A model adapter translates between the canonical model I/O format and the raw payload the run_inference function consumes on body and returns as a string. Use a built-in adapter (for example latticeflow$openai_chat_completion for any OpenAI-compatible chat completion endpoint) or define your own — see the Model Adapters CLI reference.

config:
  connection_type: "custom_inference"
  adapter:
    key: "latticeflow$identity_chat_completion"

If the inference snippet already produces output in the canonical format — for example, by building the chat completion response object in Python before returning it — pair the model with the identity adapter latticeflow$identity_chat_completion, which passes the body through unchanged. This is the default adapter for custom inference models.

Step 4: Test Model

Testing issues a sample request through the configured inference snippet and adapter to verify that the model is reachable, the snippet runs to completion, and the response is well-formed. To test the model before using it in an evaluation, see Testing models.

Example of a Full Definition
display_name: "OpenAI GPT-4.1 Nano (Custom Inference)"
key: "gpt-4-1-nano-custom-inference"
description: "OpenAI's GPT-4.1 Nano served through a custom inference snippet."
rate_limit: 60
task: "chat_completion"
config:
  connection_type: "custom_inference"
  adapter:
    key: "latticeflow$openai_chat_completion"
  run_inference_snippet: !include "./run_inference.py"
  environment:
    MODEL_ENDPOINT_URL: "https://api.openai.com/v1/chat/completions"
    MODEL_ENDPOINT_API_KEY: $OPENAI_API_KEY
    MODEL_KEY: "gpt-4.1-nano"
  timeout: 15
import json

import httpx


def run_inference(body: str, environment: dict):
    body_dict = json.loads(body)
    body_dict["model"] = environment["MODEL_KEY"]

    response = httpx.post(
        environment["MODEL_ENDPOINT_URL"],
        headers={
            "Authorization": f"Bearer {environment['MODEL_ENDPOINT_API_KEY']}",
            "Content-Type": "application/json",
        },
        content=json.dumps(body_dict).encode(),
        timeout=10.0,
        verify=True,
    )
    response.raise_for_status()
    return response.text