LangSmith / LangGraph

Just want to use the integration? If you only need to plug a LangSmith-hosted LangGraph agent into AI GO!, you can use the ready-made integration directly — see the full integration in our registry and GitHub repo: <REGISTRY_URL> / <GITHUB_URL>. This tutorial walks through how that integration is built so you can adapt it to your own stateful agent.

This tutorial demonstrates how to connect a LangGraph agent deployed on the LangGraph Platform (LangSmith) to AI GO! as a custom-inference model, so it can be evaluated like any other model — while preserving conversation state across turns.

The hard part of integrating a stateful agent is not calling the endpoint once; it is making multi-turn conversations work. The LangGraph Platform keeps conversation state server-side in a thread, and AI GO! drives the conversation one user turn at a time. The integration must keep both sides in sync so that turn 3 remembers what happened in turns 1 and 2.

This pattern applies to any stateful agent runtime — LangGraph, a custom agent server, or a hosted assistant API — where the conversation lives behind a session/thread identifier and each call returns a multi-step trace.

What you will build

A complete custom-inference model integration that:

  1. Forwards each user turn to a LangGraph deployment over HTTP (/runs/wait)
  2. Maintains conversation continuity across turns by round-tripping the LangGraph thread_id
  3. Returns the full per-turn agent trace — tool calls, tool outputs, and the final reply — in AI GO!'s Open Responses format, ready for trace-aware scorers

By the end, you will have a model you can register, test, and point any multi-turn evaluation at.


Step 1: Understand the integration

The multi-turn challenge

AI GO! sends a chat-completion request for every turn. The body contains the whole conversation so far; the last entry is the current user turn:

{
  "messages": [
    { "role": "user", "content": "Hi, can you help me see my orders?" },
    { "role": "assistant", "content": "Sure! What's your email and order ID?",
      "thread_id": "019e3d95-1898-7961-b0ed-536ac8c43757" },
    { "role": "user", "content": "[email protected], order ORD-1001" }
  ]
}

LangGraph, however, does not want the whole history replayed — it already has it stored in the thread. It only wants the new user message, plus the thread_id that identifies the conversation.

The mechanism for keeping both sides in sync is passing the thread_id through the conversation itself. AI GO! messages allow extra fields, and any custom field we set on an assistant message is echoed back unchanged on the next turn — this is the supported way to carry custom data between requests. So on each response we attach the thread_id to the assistant message; on the next request we read it back and reuse the same thread. The first turn (no prior assistant message) creates a fresh thread.

Example request bodies (AI GO! → run_inference)

First turn — just the new user message:

{
  "messages": [
    { "role": "user", "content": "Search the catalog for shoes." }
  ]
}

Subsequent turns echo the prior assistant message with its thread_id, so we keep using the same LangGraph thread:

{
  "messages": [
    { "role": "user", "content": "Hello!" },
    { "role": "assistant", "content": "Hello! How can I assist you today?",
      "thread_id": "019e3d95-1898-7961-b0ed-536ac8c43757" },
    { "role": "user", "content": "What is my thread id?" }
  ]
}

The three-stage pipeline

A single inference runs as a three-stage pipeline:

ChatCompletionInput  ->  ModelInput  ->  RawModelOutput  ->  OpenResponsesModelOutput
   (from AI GO!)         convert_        query_model        convert_model_output
                         user_input

run_inference ties the stages together: parse the request, convert it to what LangGraph accepts, call the endpoint, and convert the raw response back into the Open Responses format AI GO! expects.

The output shape

LangGraph's /runs/wait returns the full thread state — every message ever exchanged. We keep only the messages produced by the current turn and translate them into Open Responses trace items:

TypeDescription
messageA text message (here, the assistant's reply)
function_callA tool the agent invoked (name, arguments)
function_call_outputThe result of a tool call, linked by call_id

Emitting these items (rather than just the final text) is what lets downstream trace-aware scorers inspect how the agent reached its answer.


Step 2: Build the inference handler

The integration lives in a single Python file, run_inference.py, that defines a run_inference(body, environment) function. AI GO! calls it once per turn.

The entry point

run_inference is the function AI GO! calls. It wires the three stages together — parse and convert the request, query the LangGraph endpoint, and convert the raw response back into the Open Responses format AI GO! expects:

def run_inference(body: str, environment: dict[str, Any]) -> str:
    model_input = convert_user_input(ChatCompletionInput.model_validate(json.loads(body)))
    response = query_model(model_input, environment)
    model_output = convert_model_output(response)
    return model_output.model_dump_json()

The rest of this step implements each stage in turn.

Model-side types

We model the two intermediate payloads explicitly so the data flow stays legible:

class ModelInput(BaseModel):
    """What the LangGraph endpoint accepts."""
    thread_id: str       # empty on the first turn, reused afterwards
    user_message: str


class RawModelOutput(BaseModel):
    """What /runs/wait returns, plus the thread we used."""
    thread_id: str
    final_state: dict

Stage 1 — convert_user_input

This is where multi-turn continuity is established. We pull the latest user message and recover the thread_id echoed by the previous assistant turn:

def convert_user_input(data: ChatCompletionInput) -> ModelInput:
    messages = data.messages
    last_user = next(m for m in reversed(messages) if m.role == "user")

    thread_id = ""
    for msg in reversed(messages):
        if msg.role == "assistant":
            thread_id = getattr(msg, "thread_id", "") or ""
            break

    return ModelInput(thread_id=thread_id, user_message=last_user.content)

On the first turn there is no prior assistant message, so thread_id stays empty — signalling that a new thread must be created.

Stage 2 — query_model

Create a thread on the first turn, then run the agent and wait for it to finish:

def query_model(model_input: ModelInput, environment: dict[str, Any]) -> RawModelOutput:
    base_url = environment["LANGSMITH_DEPLOY_URL"].rstrip("/")
    api_key = environment["LANGSMITH_API_KEY"]
    assistant_id = environment["LANGGRAPH_ASSISTANT_ID"]
    headers = {"x-api-key": api_key, "Content-Type": "application/json"}

    with httpx.Client(timeout=120) as client:
        thread_id = model_input.thread_id
        if not thread_id:
            create = client.post(f"{base_url}/threads", headers=headers, json={})
            create.raise_for_status()
            thread_id = create.json()["thread_id"]

        run = client.post(
            f"{base_url}/threads/{thread_id}/runs/wait",
            headers=headers,
            json={
                "assistant_id": assistant_id,
                "input": {"messages": [{"role": "user", "content": model_input.user_message}]},
            },
        )
        run.raise_for_status()
        final_state = run.json()

    return RawModelOutput(thread_id=thread_id, final_state=final_state)

We send only the new user message — LangGraph appends it to the thread it already holds.

Example raw model output (agent API → query_model)

/runs/wait returns the final thread state. query_model wraps it with the thread_id we used so convert_model_output can echo it back:

{
  "thread_id": "019e3d95-45a7-7051-9e3a-c42ca0fbe182",
  "final_state": {
    "messages": [
      { "type": "human", "content": "Search the catalog for shoes.",
        "id": "adfea54d-..." },
      { "type": "ai", "content": "", "id": "lc_run--...",
        "tool_calls": [{ "name": "search_products",
                         "args": { "query": "shoes" },
                         "id": "call_DwNd...", "type": "tool_call" }],
        "usage_metadata": { "input_tokens": 177, "output_tokens": 15 } },
      { "type": "tool", "content": "[]", "name": "search_products",
        "tool_call_id": "call_DwNd...", "status": "success" },
      { "type": "ai",
        "content": "I couldn't find any shoes in the catalog. ...",
        "id": "lc_run--...",
        "usage_metadata": { "input_tokens": 201, "output_tokens": 22 } }
    ]
  }
}

Stage 3 — convert_model_output

/runs/wait returns the entire thread, so we keep only the messages produced by this turn (everything after the last human message) and hand them to the converter:

def convert_model_output(raw_model_output: RawModelOutput) -> OpenResponsesModelOutput:
    thread_id = raw_model_output.thread_id
    messages = raw_model_output.final_state.get("messages", [])

    last_human = -1
    for i, msg in enumerate(messages):
        if msg.get("type") == "human":
            last_human = i
    turn_messages = messages[last_human + 1 :] if last_human >= 0 else messages

    return OpenResponsesConverter().build(turn_messages, thread_id=thread_id)

The Open Responses converter

The converter turns LangGraph's per-turn messages into Open Responses items. A single LangGraph ai message can carry several tool calls and/or text, so it expands into multiple items; a tool message becomes one function_call_output:

class OpenResponsesConverter:
    def build(self, messages: list[Any], **kwargs: Any) -> OpenResponsesModelOutput:
        items: list[TraceItem] = []
        num_prompt_tokens = 0
        num_completion_tokens = 0
        for message in messages:
            msg_type = message.get("type")
            if msg_type == "ai":
                for call in message.get("tool_calls") or []:
                    items.append(self.build_function_call(call, **kwargs))
                content = message.get("content") or ""
                if isinstance(content, str) and content.strip():
                    items.append(self.build_assistant_message(message, **kwargs))

                usage = message.get("usage_metadata") or {}
                num_prompt_tokens += usage.get("input_tokens", 0) or 0
                num_completion_tokens += usage.get("output_tokens", 0) or 0
            elif msg_type == "tool":
                items.append(self.build_function_call_output(message, **kwargs))
            else:
                raise ValueError(f"Unhandled message type: `{msg_type}`")

        return OpenResponsesModelOutput(
            items=items,
            usage=self.build_usage(num_prompt_tokens, num_completion_tokens),
        )

The thread_id to round-trip is passed through build(...) as a keyword argument and forwarded to the builders. The per-item builders construct each trace item:

    def build_assistant_message(self, message: dict, **kwargs: Any) -> Message:
        return Message(
            id=str(uuid.uuid4()),
            status=MessageStatus.completed,
            role=MessageRole.assistant,
            content=[OutputTextContent(text=message["content"], annotations=[])],
            # Carried as an extra field so the next turn can reuse the thread.
            thread_id=kwargs.get("thread_id", ""),
        )

    def build_function_call(self, call: dict, **kwargs: Any) -> FunctionCall:
        return FunctionCall(
            id=str(uuid.uuid4()),
            call_id=call["id"],
            name=call["name"],
            arguments=json.dumps(call.get("args") or {}),
            status=FunctionCallStatus.completed,
        )

    def build_function_call_output(self, message: dict, **kwargs: Any) -> FunctionCallOutput:
        return FunctionCallOutput(
            id=str(uuid.uuid4()),
            call_id=message["tool_call_id"],
            output=message.get("content") or "",
            status=FunctionCallOutputStatusEnum.completed,
        )
Example Open Responses output (convert_model_output)

The converted result returned to AI GO! — tool call, tool output, and the final assistant message carrying the thread_id for the next turn:

{
  "items": [
    { "type": "function_call", "id": "5bf9...", "call_id": "call_DwNd...",
      "name": "search_products", "arguments": "{\"query\": \"shoes\"}",
      "status": "completed" },
    { "type": "function_call_output", "id": "9da4...",
      "call_id": "call_DwNd...", "output": "[]", "status": "completed" },
    { "type": "message", "id": "44c2...", "status": "completed",
      "role": "assistant",
      "content": [{ "type": "output_text",
                    "text": "I couldn't find any shoes in the catalog. ...",
                    "annotations": [] }],
      "thread_id": "019e3d95-45a7-7051-9e3a-c42ca0fbe182" }
  ],
  "usage": { "num_prompt_tokens": 378, "num_completion_tokens": 37 }
}

Step 3: Wire up the model

Three small files connect the snippet to AI GO!.

model.yaml

The model uses connection_type: custom_inference with the identity chat-completion adapter (the snippet already returns Open Responses, so no adapter transform is needed). Secrets and endpoint configuration are injected via environment:

display_name: "Customer Support Agent (LangSmith)"
key: "customer-support-agent-langsmith"
description: >-
  A customer-support agent deployed on the LangGraph Platform and served via
  LangSmith.
rate_limit: 15
task: "chat_completion"
config:
  connection_type: "custom_inference"
  adapter:
    key: "latticeflow$identity_chat_completion"
  run_inference_snippet: !include "./run_inference.py"
  environment:
    LANGSMITH_DEPLOY_URL: $LANGSMITH_DEPLOY_URL
    LANGGRAPH_ASSISTANT_ID: $LANGGRAPH_ASSISTANT_ID
    LANGSMITH_API_KEY: "<< secrets.LANGSMITH_API_KEY >>"
  timeout: 120
secrets:
  LANGSMITH_API_KEY: $LANGSMITH_API_KEY
  • !include "./run_inference.py" inlines the snippet at registration time.
  • << secrets.LANGSMITH_API_KEY >> references a server-side secret; the secrets block uploads it from $LANGSMITH_API_KEY in your .env.
  • rate_limit: 15 keeps concurrency low — external agent endpoints rarely tolerate high request volume. Lower it further if you see timeouts.

app.yaml

The model just needs an app to live in. Rather than defining a bespoke one, reuse the shared playground app used across these guides:

display_name: "Playground App"
key: "playground-app"
description: >
  Shared app for trying out model integrations.

.env

LANGSMITH_DEPLOY_URL=https://<your-deployment>.langgraph.app
LANGGRAPH_ASSISTANT_ID=<your-assistant-id>
LANGSMITH_API_KEY=lsv2_sk_...

Step 4: Register and test

# Create and switch to the shared playground app
lf add app -f app.yaml
lf switch playground-app

# Register the model (uploads the secret and inlines run_inference.py)
lf add model -f model.yaml

# Verify the endpoint is reachable and returns well-formed Open Responses
lf test model customer-support-agent-langsmith

lf test model sends a single "Hello!" turn and shows each pipeline stage. A successful run ends with the parsed Open Responses output:

3. Running inference.
   Status code: 200
4. Transforming model output.
   {"items":[{"id":"...","status":"completed","role":"assistant",
              "content":[{"text":"Hello! How can I assist you today?","annotations":[]}],
              "thread_id":"019e7444-..."}],
    "usage":{"num_completion_tokens":10,"num_prompt_tokens":173}}
   ...
   output="items=[Message(type='message', ..., role=<MessageRole.assistant>,
            content=[OutputTextContent(text='Hello! How can I assist you today?', ...)],
            thread_id='019e7444-...')] usage=ModelUsage(...)"
Successfully tested configuration of model with key 'customer-support-agent-langsmith'.

Two things confirm the integration is correct:

  • The assistant reply parses as a Message with role=assistant (not a CustomTaskInputMessage) — the serialization gotcha is handled.
  • A thread_id is present on the message — the value that the next turn will read back to continue the same conversation.

Your model is now registered and can be pointed at any multi-turn evaluation.


Adapting this pattern

The integration generalizes to any stateful agent runtime.

Different agent platforms

Swap the HTTP calls in query_model for your platform's API and adjust convert_model_output to read its message shape. Keep the contract identical: send only the new user turn + a session id, return the current turn's trace items.

Single-turn agents

If your agent is stateless, drop the thread_id machinery entirely: convert_user_input returns just the user message, query_model makes one call, and build_assistant_message omits the thread_id extra field.

Carrying state other than a thread id

This echo mechanism works for any opaque state. Whatever you attach to the assistant Message (a session token, a cursor, a serialized memory blob) comes back on the next request — read it in convert_user_input and forward it to your endpoint.

Plugging into an evaluation

Because the snippet emits full traces (tool calls, outputs, and replies), the model drops straight into trace-aware evaluations — multi-turn solvers, function-call-coverage scorers, or model-as-a-judge scorers over open_responses traces. Point a task_specification at this model key and run lf run -f run.yaml.