LangSmith / LangGraph
Just want to use the integration? If you only need to plug a LangSmith-hosted LangGraph agent into AI GO!, you can use the ready-made integration directly — see the full integration in our registry and GitHub repo:
<REGISTRY_URL>/<GITHUB_URL>. This tutorial walks through how that integration is built so you can adapt it to your own stateful agent.
This tutorial demonstrates how to connect a LangGraph agent deployed on the LangGraph Platform (LangSmith) to AI GO! as a custom-inference model, so it can be evaluated like any other model — while preserving conversation state across turns.
The hard part of integrating a stateful agent is not calling the endpoint once; it is making multi-turn conversations work. The LangGraph Platform keeps conversation state server-side in a thread, and AI GO! drives the conversation one user turn at a time. The integration must keep both sides in sync so that turn 3 remembers what happened in turns 1 and 2.
This pattern applies to any stateful agent runtime — LangGraph, a custom agent server, or a hosted assistant API — where the conversation lives behind a session/thread identifier and each call returns a multi-step trace.
What you will build
A complete custom-inference model integration that:
- Forwards each user turn to a LangGraph deployment over HTTP (
/runs/wait) - Maintains conversation continuity across turns by round-tripping the LangGraph
thread_id - Returns the full per-turn agent trace — tool calls, tool outputs, and the final reply — in AI GO!'s Open Responses format, ready for trace-aware scorers
By the end, you will have a model you can register, test, and point any multi-turn evaluation at.
Step 1: Understand the integration
The multi-turn challenge
AI GO! sends a chat-completion request for every turn. The body contains the whole conversation so far; the last entry is the current user turn:
{
"messages": [
{ "role": "user", "content": "Hi, can you help me see my orders?" },
{ "role": "assistant", "content": "Sure! What's your email and order ID?",
"thread_id": "019e3d95-1898-7961-b0ed-536ac8c43757" },
{ "role": "user", "content": "[email protected], order ORD-1001" }
]
}LangGraph, however, does not want the whole history replayed — it already has it stored in the thread. It only wants the new user message, plus the thread_id that identifies the conversation.
The mechanism for keeping both sides in sync is passing the thread_id through the conversation itself. AI GO! messages allow extra fields, and any custom field we set on an assistant message is echoed back unchanged on the next turn — this is the supported way to carry custom data between requests. So on each response we attach the thread_id to the assistant message; on the next request we read it back and reuse the same thread. The first turn (no prior assistant message) creates a fresh thread.
Example request bodies (AI GO! → run_inference)
First turn — just the new user message:
{
"messages": [
{ "role": "user", "content": "Search the catalog for shoes." }
]
}Subsequent turns echo the prior assistant message with its thread_id, so we keep using the same LangGraph thread:
{
"messages": [
{ "role": "user", "content": "Hello!" },
{ "role": "assistant", "content": "Hello! How can I assist you today?",
"thread_id": "019e3d95-1898-7961-b0ed-536ac8c43757" },
{ "role": "user", "content": "What is my thread id?" }
]
}The three-stage pipeline
A single inference runs as a three-stage pipeline:
ChatCompletionInput -> ModelInput -> RawModelOutput -> OpenResponsesModelOutput
(from AI GO!) convert_ query_model convert_model_output
user_input
run_inference ties the stages together: parse the request, convert it to what LangGraph accepts, call the endpoint, and convert the raw response back into the Open Responses format AI GO! expects.
The output shape
LangGraph's /runs/wait returns the full thread state — every message ever exchanged. We keep only the messages produced by the current turn and translate them into Open Responses trace items:
| Type | Description |
|---|---|
message | A text message (here, the assistant's reply) |
function_call | A tool the agent invoked (name, arguments) |
function_call_output | The result of a tool call, linked by call_id |
Emitting these items (rather than just the final text) is what lets downstream trace-aware scorers inspect how the agent reached its answer.
Step 2: Build the inference handler
The integration lives in a single Python file, run_inference.py, that defines a run_inference(body, environment) function. AI GO! calls it once per turn.
The entry point
run_inference is the function AI GO! calls. It wires the three stages together — parse and convert the request, query the LangGraph endpoint, and convert the raw response back into the Open Responses format AI GO! expects:
def run_inference(body: str, environment: dict[str, Any]) -> str:
model_input = convert_user_input(ChatCompletionInput.model_validate(json.loads(body)))
response = query_model(model_input, environment)
model_output = convert_model_output(response)
return model_output.model_dump_json()The rest of this step implements each stage in turn.
Model-side types
We model the two intermediate payloads explicitly so the data flow stays legible:
class ModelInput(BaseModel):
"""What the LangGraph endpoint accepts."""
thread_id: str # empty on the first turn, reused afterwards
user_message: str
class RawModelOutput(BaseModel):
"""What /runs/wait returns, plus the thread we used."""
thread_id: str
final_state: dictStage 1 — convert_user_input
convert_user_inputThis is where multi-turn continuity is established. We pull the latest user message and recover the thread_id echoed by the previous assistant turn:
def convert_user_input(data: ChatCompletionInput) -> ModelInput:
messages = data.messages
last_user = next(m for m in reversed(messages) if m.role == "user")
thread_id = ""
for msg in reversed(messages):
if msg.role == "assistant":
thread_id = getattr(msg, "thread_id", "") or ""
break
return ModelInput(thread_id=thread_id, user_message=last_user.content)On the first turn there is no prior assistant message, so thread_id stays empty — signalling that a new thread must be created.
Stage 2 — query_model
query_modelCreate a thread on the first turn, then run the agent and wait for it to finish:
def query_model(model_input: ModelInput, environment: dict[str, Any]) -> RawModelOutput:
base_url = environment["LANGSMITH_DEPLOY_URL"].rstrip("/")
api_key = environment["LANGSMITH_API_KEY"]
assistant_id = environment["LANGGRAPH_ASSISTANT_ID"]
headers = {"x-api-key": api_key, "Content-Type": "application/json"}
with httpx.Client(timeout=120) as client:
thread_id = model_input.thread_id
if not thread_id:
create = client.post(f"{base_url}/threads", headers=headers, json={})
create.raise_for_status()
thread_id = create.json()["thread_id"]
run = client.post(
f"{base_url}/threads/{thread_id}/runs/wait",
headers=headers,
json={
"assistant_id": assistant_id,
"input": {"messages": [{"role": "user", "content": model_input.user_message}]},
},
)
run.raise_for_status()
final_state = run.json()
return RawModelOutput(thread_id=thread_id, final_state=final_state)We send only the new user message — LangGraph appends it to the thread it already holds.
Example raw model output (agent API → query_model)
/runs/wait returns the final thread state. query_model wraps it with the thread_id we used so convert_model_output can echo it back:
{
"thread_id": "019e3d95-45a7-7051-9e3a-c42ca0fbe182",
"final_state": {
"messages": [
{ "type": "human", "content": "Search the catalog for shoes.",
"id": "adfea54d-..." },
{ "type": "ai", "content": "", "id": "lc_run--...",
"tool_calls": [{ "name": "search_products",
"args": { "query": "shoes" },
"id": "call_DwNd...", "type": "tool_call" }],
"usage_metadata": { "input_tokens": 177, "output_tokens": 15 } },
{ "type": "tool", "content": "[]", "name": "search_products",
"tool_call_id": "call_DwNd...", "status": "success" },
{ "type": "ai",
"content": "I couldn't find any shoes in the catalog. ...",
"id": "lc_run--...",
"usage_metadata": { "input_tokens": 201, "output_tokens": 22 } }
]
}
}Stage 3 — convert_model_output
convert_model_output/runs/wait returns the entire thread, so we keep only the messages produced by this turn (everything after the last human message) and hand them to the converter:
def convert_model_output(raw_model_output: RawModelOutput) -> OpenResponsesModelOutput:
thread_id = raw_model_output.thread_id
messages = raw_model_output.final_state.get("messages", [])
last_human = -1
for i, msg in enumerate(messages):
if msg.get("type") == "human":
last_human = i
turn_messages = messages[last_human + 1 :] if last_human >= 0 else messages
return OpenResponsesConverter().build(turn_messages, thread_id=thread_id)The Open Responses converter
The converter turns LangGraph's per-turn messages into Open Responses items. A single LangGraph ai message can carry several tool calls and/or text, so it expands into multiple items; a tool message becomes one function_call_output:
class OpenResponsesConverter:
def build(self, messages: list[Any], **kwargs: Any) -> OpenResponsesModelOutput:
items: list[TraceItem] = []
num_prompt_tokens = 0
num_completion_tokens = 0
for message in messages:
msg_type = message.get("type")
if msg_type == "ai":
for call in message.get("tool_calls") or []:
items.append(self.build_function_call(call, **kwargs))
content = message.get("content") or ""
if isinstance(content, str) and content.strip():
items.append(self.build_assistant_message(message, **kwargs))
usage = message.get("usage_metadata") or {}
num_prompt_tokens += usage.get("input_tokens", 0) or 0
num_completion_tokens += usage.get("output_tokens", 0) or 0
elif msg_type == "tool":
items.append(self.build_function_call_output(message, **kwargs))
else:
raise ValueError(f"Unhandled message type: `{msg_type}`")
return OpenResponsesModelOutput(
items=items,
usage=self.build_usage(num_prompt_tokens, num_completion_tokens),
)The thread_id to round-trip is passed through build(...) as a keyword argument and forwarded to the builders. The per-item builders construct each trace item:
def build_assistant_message(self, message: dict, **kwargs: Any) -> Message:
return Message(
id=str(uuid.uuid4()),
status=MessageStatus.completed,
role=MessageRole.assistant,
content=[OutputTextContent(text=message["content"], annotations=[])],
# Carried as an extra field so the next turn can reuse the thread.
thread_id=kwargs.get("thread_id", ""),
)
def build_function_call(self, call: dict, **kwargs: Any) -> FunctionCall:
return FunctionCall(
id=str(uuid.uuid4()),
call_id=call["id"],
name=call["name"],
arguments=json.dumps(call.get("args") or {}),
status=FunctionCallStatus.completed,
)
def build_function_call_output(self, message: dict, **kwargs: Any) -> FunctionCallOutput:
return FunctionCallOutput(
id=str(uuid.uuid4()),
call_id=message["tool_call_id"],
output=message.get("content") or "",
status=FunctionCallOutputStatusEnum.completed,
)Example Open Responses output (convert_model_output)
The converted result returned to AI GO! — tool call, tool output, and the final assistant message carrying the thread_id for the next turn:
{
"items": [
{ "type": "function_call", "id": "5bf9...", "call_id": "call_DwNd...",
"name": "search_products", "arguments": "{\"query\": \"shoes\"}",
"status": "completed" },
{ "type": "function_call_output", "id": "9da4...",
"call_id": "call_DwNd...", "output": "[]", "status": "completed" },
{ "type": "message", "id": "44c2...", "status": "completed",
"role": "assistant",
"content": [{ "type": "output_text",
"text": "I couldn't find any shoes in the catalog. ...",
"annotations": [] }],
"thread_id": "019e3d95-45a7-7051-9e3a-c42ca0fbe182" }
],
"usage": { "num_prompt_tokens": 378, "num_completion_tokens": 37 }
}Step 3: Wire up the model
Three small files connect the snippet to AI GO!.
model.yaml
model.yamlThe model uses connection_type: custom_inference with the identity chat-completion adapter (the snippet already returns Open Responses, so no adapter transform is needed). Secrets and endpoint configuration are injected via environment:
display_name: "Customer Support Agent (LangSmith)"
key: "customer-support-agent-langsmith"
description: >-
A customer-support agent deployed on the LangGraph Platform and served via
LangSmith.
rate_limit: 15
task: "chat_completion"
config:
connection_type: "custom_inference"
adapter:
key: "latticeflow$identity_chat_completion"
run_inference_snippet: !include "./run_inference.py"
environment:
LANGSMITH_DEPLOY_URL: $LANGSMITH_DEPLOY_URL
LANGGRAPH_ASSISTANT_ID: $LANGGRAPH_ASSISTANT_ID
LANGSMITH_API_KEY: "<< secrets.LANGSMITH_API_KEY >>"
timeout: 120
secrets:
LANGSMITH_API_KEY: $LANGSMITH_API_KEY!include "./run_inference.py"inlines the snippet at registration time.<< secrets.LANGSMITH_API_KEY >>references a server-side secret; thesecretsblock uploads it from$LANGSMITH_API_KEYin your.env.rate_limit: 15keeps concurrency low — external agent endpoints rarely tolerate high request volume. Lower it further if you see timeouts.
app.yaml
app.yamlThe model just needs an app to live in. Rather than defining a bespoke one, reuse the shared playground app used across these guides:
display_name: "Playground App"
key: "playground-app"
description: >
Shared app for trying out model integrations..env
.envLANGSMITH_DEPLOY_URL=https://<your-deployment>.langgraph.app
LANGGRAPH_ASSISTANT_ID=<your-assistant-id>
LANGSMITH_API_KEY=lsv2_sk_...Step 4: Register and test
# Create and switch to the shared playground app
lf add app -f app.yaml
lf switch playground-app
# Register the model (uploads the secret and inlines run_inference.py)
lf add model -f model.yaml
# Verify the endpoint is reachable and returns well-formed Open Responses
lf test model customer-support-agent-langsmithlf test model sends a single "Hello!" turn and shows each pipeline stage. A successful run ends with the parsed Open Responses output:
3. Running inference.
Status code: 200
4. Transforming model output.
{"items":[{"id":"...","status":"completed","role":"assistant",
"content":[{"text":"Hello! How can I assist you today?","annotations":[]}],
"thread_id":"019e7444-..."}],
"usage":{"num_completion_tokens":10,"num_prompt_tokens":173}}
...
output="items=[Message(type='message', ..., role=<MessageRole.assistant>,
content=[OutputTextContent(text='Hello! How can I assist you today?', ...)],
thread_id='019e7444-...')] usage=ModelUsage(...)"
Successfully tested configuration of model with key 'customer-support-agent-langsmith'.
Two things confirm the integration is correct:
- The assistant reply parses as a
Messagewithrole=assistant(not aCustomTaskInputMessage) — the serialization gotcha is handled. - A
thread_idis present on the message — the value that the next turn will read back to continue the same conversation.
Your model is now registered and can be pointed at any multi-turn evaluation.
Adapting this pattern
The integration generalizes to any stateful agent runtime.
Different agent platforms
Swap the HTTP calls in query_model for your platform's API and adjust convert_model_output to read its message shape. Keep the contract identical: send only the new user turn + a session id, return the current turn's trace items.
Single-turn agents
If your agent is stateless, drop the thread_id machinery entirely: convert_user_input returns just the user message, query_model makes one call, and build_assistant_message omits the thread_id extra field.
Carrying state other than a thread id
This echo mechanism works for any opaque state. Whatever you attach to the assistant Message (a session token, a cursor, a serialized memory blob) comes back on the next request — read it in convert_user_input and forward it to your endpoint.
Plugging into an evaluation
Because the snippet emits full traces (tool calls, outputs, and replies), the model drops straight into trace-aware evaluations — multi-turn solvers, function-call-coverage scorers, or model-as-a-judge scorers over open_responses traces. Point a task_specification at this model key and run lf run -f run.yaml.
