Summary
The Llama Stack Client Python SDK (llama-stack-client) is Meta's official Python client for Llama Stack — a production framework for building LLM-powered applications using Llama models. Latest stable release: v0.2.x / 0.6.0 (March 2026), actively maintained by the llamastack organization. This repository has zero instrumentation for any llama-stack-client execution surface — no integration directory, no wrapper, no patcher, no auto_instrument() support.
Llama Stack provides standardized inference, safety, agents, and retrieval APIs that work uniformly across local (Ollama, vLLM, TGI), cloud (Together AI, Fireworks, SambaNova), and Meta's own managed endpoints. Its LlamaStackClient and AsyncLlamaStackClient expose a distinct execution API (client.inference.chat_completion(), client.agents.*, client.responses.create()) that is separate from the OpenAI SDK shape and is not coverable by wrap_openai().
Comparable provider and agent framework integrations in this repo: openai, anthropic, mistral, cohere, huggingface-hub, openai-agents, claude-agent-sdk, google-adk.
What needs to be instrumented
The llama-stack-client package exposes these execution surfaces, none of which are instrumented:
Inference API (highest priority)
| SDK Method |
Description |
Return type |
client.inference.chat_completion(model_id, messages, ...) |
Chat completions with Llama models — supports tool calling, streaming, structured output |
ChatCompletionResponse or AsyncIterator[ChatCompletionStreamChunk] |
client.inference.completion(model_id, content, ...) |
Text completions from a prompt |
CompletionResponse |
client.inference.embeddings(model_id, contents, ...) |
Text embeddings generation |
EmbeddingsResponse |
ChatCompletionResponse includes completion_message (content + tool calls), logprobs, and usage metadata (prompt_tokens, completion_tokens, total_tokens). Streaming yields ChatCompletionStreamChunk objects with delta content.
Responses API (agent execution)
| SDK Method |
Description |
Return type |
client.responses.create(input, model, tools=..., ...) |
Agentic response generation — automatically handles tool calls, knowledge bases, and conversation state |
OpenAIResponseObject |
The Responses API is Llama Stack's high-level agent execution surface introduced in early 2026. It accepts a conversation thread, configured tools (MCP servers, function calls, web search), and returns a completed response after autonomously handling tool invocations in a loop.
Agents API (lower-level agent runs)
| SDK Method |
Description |
Return type |
client.agents.create(agent_config) |
Create an agent with a specific system prompt, tools, and model |
AgentCreateResponse |
client.agents.sessions.create(agent_id, ...) |
Create a session for multi-turn conversations |
Session |
client.agents.turns.create(agent_id, session_id, messages, ...) |
Execute an agent turn, yielding events (tool calls, completions, etc.) |
AgentTurnResponseStreamChunk iterator |
The Agents API provides a lower-level interface for multi-turn agent execution with explicit turn management. Each turn involves an LLM call, optional tool execution, and a final completion.
Async variants
All methods have async equivalents on AsyncLlamaStackClient with identical signatures.
Implementation notes
Distinct client from OpenAI: LlamaStackClient is a Stainless-generated client (like Groq, Mistral, Together) but with a Llama Stack-specific API shape. It is not a subclass of openai.OpenAI and cannot be wrapped with wrap_openai().
Provider-agnostic: Llama Stack routes inference to any configured backend (Ollama, Together, Fireworks, SambaNova, Meta's managed endpoints). The integration should capture the model_id and, where available, the backend provider from response metadata.
Streaming: Both inference.chat_completion() and agents.turns.create() support streaming. The integration must handle streaming span lifecycle (start span on call, accumulate chunks/events, finalize on exhaustion).
Tool call tracing: When agents invoke tools, the turn response stream includes AgentTurnResponseStepStartPayload and AgentTurnResponseStepCompletePayload events for each tool call. These should be captured as child spans of the agent turn span.
Token usage: ChatCompletionResponse includes a usage field with prompt_tokens, completion_tokens, total_tokens — directly usable for span metrics.
No coverage in any instrumentation layer
- No integration directory (
py/src/braintrust/integrations/llama_stack/)
- No wrapper function (e.g.
wrap_llama_stack())
- No patcher in any existing integration
- No nox test session (
test_llama_stack)
- No version entry in
py/src/braintrust/integrations/versioning.py
- No mention in
py/src/braintrust/integrations/__init__.py
- No entry in
[tool.braintrust.matrix] in py/pyproject.toml
A grep for llama_stack, llama-stack, llamastack across py/src/braintrust/ returns zero matches.
Braintrust docs status
not_found — llama-stack-client is not listed on the Braintrust integrations directory or the tracing guide. No auto_instrument() reference and no wrap_llama_stack() function are documented anywhere in Braintrust docs.
Upstream references
Local repo files inspected
py/src/braintrust/integrations/ — no llama_stack/ directory on main
py/src/braintrust/wrappers/ — no Llama Stack wrapper
py/noxfile.py — no test_llama_stack session
py/pyproject.toml [tool.braintrust.matrix] — no llama-stack-client entry
py/src/braintrust/integrations/__init__.py — Llama Stack not listed
py/src/braintrust/integrations/versioning.py — no Llama Stack version matrix
- Full repo grep for
llama_stack, llama-stack, llamastack — zero matches in SDK source
Summary
The Llama Stack Client Python SDK (
llama-stack-client) is Meta's official Python client for Llama Stack — a production framework for building LLM-powered applications using Llama models. Latest stable release: v0.2.x / 0.6.0 (March 2026), actively maintained by the llamastack organization. This repository has zero instrumentation for anyllama-stack-clientexecution surface — no integration directory, no wrapper, no patcher, noauto_instrument()support.Llama Stack provides standardized inference, safety, agents, and retrieval APIs that work uniformly across local (Ollama, vLLM, TGI), cloud (Together AI, Fireworks, SambaNova), and Meta's own managed endpoints. Its
LlamaStackClientandAsyncLlamaStackClientexpose a distinct execution API (client.inference.chat_completion(),client.agents.*,client.responses.create()) that is separate from the OpenAI SDK shape and is not coverable bywrap_openai().Comparable provider and agent framework integrations in this repo:
openai,anthropic,mistral,cohere,huggingface-hub,openai-agents,claude-agent-sdk,google-adk.What needs to be instrumented
The
llama-stack-clientpackage exposes these execution surfaces, none of which are instrumented:Inference API (highest priority)
client.inference.chat_completion(model_id, messages, ...)ChatCompletionResponseorAsyncIterator[ChatCompletionStreamChunk]client.inference.completion(model_id, content, ...)CompletionResponseclient.inference.embeddings(model_id, contents, ...)EmbeddingsResponseChatCompletionResponseincludescompletion_message(content + tool calls),logprobs, and usage metadata (prompt_tokens,completion_tokens,total_tokens). Streaming yieldsChatCompletionStreamChunkobjects with delta content.Responses API (agent execution)
client.responses.create(input, model, tools=..., ...)OpenAIResponseObjectThe Responses API is Llama Stack's high-level agent execution surface introduced in early 2026. It accepts a conversation thread, configured tools (MCP servers, function calls, web search), and returns a completed response after autonomously handling tool invocations in a loop.
Agents API (lower-level agent runs)
client.agents.create(agent_config)AgentCreateResponseclient.agents.sessions.create(agent_id, ...)Sessionclient.agents.turns.create(agent_id, session_id, messages, ...)AgentTurnResponseStreamChunkiteratorThe Agents API provides a lower-level interface for multi-turn agent execution with explicit turn management. Each turn involves an LLM call, optional tool execution, and a final completion.
Async variants
All methods have async equivalents on
AsyncLlamaStackClientwith identical signatures.Implementation notes
Distinct client from OpenAI:
LlamaStackClientis a Stainless-generated client (like Groq, Mistral, Together) but with a Llama Stack-specific API shape. It is not a subclass ofopenai.OpenAIand cannot be wrapped withwrap_openai().Provider-agnostic: Llama Stack routes inference to any configured backend (Ollama, Together, Fireworks, SambaNova, Meta's managed endpoints). The integration should capture the
model_idand, where available, the backend provider from response metadata.Streaming: Both
inference.chat_completion()andagents.turns.create()support streaming. The integration must handle streaming span lifecycle (start span on call, accumulate chunks/events, finalize on exhaustion).Tool call tracing: When agents invoke tools, the turn response stream includes
AgentTurnResponseStepStartPayloadandAgentTurnResponseStepCompletePayloadevents for each tool call. These should be captured as child spans of the agent turn span.Token usage:
ChatCompletionResponseincludes ausagefield withprompt_tokens,completion_tokens,total_tokens— directly usable for span metrics.No coverage in any instrumentation layer
py/src/braintrust/integrations/llama_stack/)wrap_llama_stack())test_llama_stack)py/src/braintrust/integrations/versioning.pypy/src/braintrust/integrations/__init__.py[tool.braintrust.matrix]inpy/pyproject.tomlA grep for
llama_stack,llama-stack,llamastackacrosspy/src/braintrust/returns zero matches.Braintrust docs status
not_found—llama-stack-clientis not listed on the Braintrust integrations directory or the tracing guide. Noauto_instrument()reference and nowrap_llama_stack()function are documented anywhere in Braintrust docs.Upstream references
Local repo files inspected
py/src/braintrust/integrations/— nollama_stack/directory onmainpy/src/braintrust/wrappers/— no Llama Stack wrapperpy/noxfile.py— notest_llama_stacksessionpy/pyproject.toml[tool.braintrust.matrix]— no llama-stack-client entrypy/src/braintrust/integrations/__init__.py— Llama Stack not listedpy/src/braintrust/integrations/versioning.py— no Llama Stack version matrixllama_stack,llama-stack,llamastack— zero matches in SDK source