Skip to content

[bot] Add Llama Stack Client Python SDK integration for inference, agents, and responses execution instrumentation #535

Description

@braintrust-bot

Summary

The Llama Stack Client Python SDK (llama-stack-client) is Meta's official Python client for Llama Stack — a production framework for building LLM-powered applications using Llama models. Latest stable release: v0.2.x / 0.6.0 (March 2026), actively maintained by the llamastack organization. This repository has zero instrumentation for any llama-stack-client execution surface — no integration directory, no wrapper, no patcher, no auto_instrument() support.

Llama Stack provides standardized inference, safety, agents, and retrieval APIs that work uniformly across local (Ollama, vLLM, TGI), cloud (Together AI, Fireworks, SambaNova), and Meta's own managed endpoints. Its LlamaStackClient and AsyncLlamaStackClient expose a distinct execution API (client.inference.chat_completion(), client.agents.*, client.responses.create()) that is separate from the OpenAI SDK shape and is not coverable by wrap_openai().

Comparable provider and agent framework integrations in this repo: openai, anthropic, mistral, cohere, huggingface-hub, openai-agents, claude-agent-sdk, google-adk.

What needs to be instrumented

The llama-stack-client package exposes these execution surfaces, none of which are instrumented:

Inference API (highest priority)

SDK Method Description Return type
client.inference.chat_completion(model_id, messages, ...) Chat completions with Llama models — supports tool calling, streaming, structured output ChatCompletionResponse or AsyncIterator[ChatCompletionStreamChunk]
client.inference.completion(model_id, content, ...) Text completions from a prompt CompletionResponse
client.inference.embeddings(model_id, contents, ...) Text embeddings generation EmbeddingsResponse

ChatCompletionResponse includes completion_message (content + tool calls), logprobs, and usage metadata (prompt_tokens, completion_tokens, total_tokens). Streaming yields ChatCompletionStreamChunk objects with delta content.

Responses API (agent execution)

SDK Method Description Return type
client.responses.create(input, model, tools=..., ...) Agentic response generation — automatically handles tool calls, knowledge bases, and conversation state OpenAIResponseObject

The Responses API is Llama Stack's high-level agent execution surface introduced in early 2026. It accepts a conversation thread, configured tools (MCP servers, function calls, web search), and returns a completed response after autonomously handling tool invocations in a loop.

Agents API (lower-level agent runs)

SDK Method Description Return type
client.agents.create(agent_config) Create an agent with a specific system prompt, tools, and model AgentCreateResponse
client.agents.sessions.create(agent_id, ...) Create a session for multi-turn conversations Session
client.agents.turns.create(agent_id, session_id, messages, ...) Execute an agent turn, yielding events (tool calls, completions, etc.) AgentTurnResponseStreamChunk iterator

The Agents API provides a lower-level interface for multi-turn agent execution with explicit turn management. Each turn involves an LLM call, optional tool execution, and a final completion.

Async variants

All methods have async equivalents on AsyncLlamaStackClient with identical signatures.

Implementation notes

Distinct client from OpenAI: LlamaStackClient is a Stainless-generated client (like Groq, Mistral, Together) but with a Llama Stack-specific API shape. It is not a subclass of openai.OpenAI and cannot be wrapped with wrap_openai().

Provider-agnostic: Llama Stack routes inference to any configured backend (Ollama, Together, Fireworks, SambaNova, Meta's managed endpoints). The integration should capture the model_id and, where available, the backend provider from response metadata.

Streaming: Both inference.chat_completion() and agents.turns.create() support streaming. The integration must handle streaming span lifecycle (start span on call, accumulate chunks/events, finalize on exhaustion).

Tool call tracing: When agents invoke tools, the turn response stream includes AgentTurnResponseStepStartPayload and AgentTurnResponseStepCompletePayload events for each tool call. These should be captured as child spans of the agent turn span.

Token usage: ChatCompletionResponse includes a usage field with prompt_tokens, completion_tokens, total_tokens — directly usable for span metrics.

No coverage in any instrumentation layer

  • No integration directory (py/src/braintrust/integrations/llama_stack/)
  • No wrapper function (e.g. wrap_llama_stack())
  • No patcher in any existing integration
  • No nox test session (test_llama_stack)
  • No version entry in py/src/braintrust/integrations/versioning.py
  • No mention in py/src/braintrust/integrations/__init__.py
  • No entry in [tool.braintrust.matrix] in py/pyproject.toml

A grep for llama_stack, llama-stack, llamastack across py/src/braintrust/ returns zero matches.

Braintrust docs status

not_foundllama-stack-client is not listed on the Braintrust integrations directory or the tracing guide. No auto_instrument() reference and no wrap_llama_stack() function are documented anywhere in Braintrust docs.

Upstream references

Local repo files inspected

  • py/src/braintrust/integrations/ — no llama_stack/ directory on main
  • py/src/braintrust/wrappers/ — no Llama Stack wrapper
  • py/noxfile.py — no test_llama_stack session
  • py/pyproject.toml [tool.braintrust.matrix] — no llama-stack-client entry
  • py/src/braintrust/integrations/__init__.py — Llama Stack not listed
  • py/src/braintrust/integrations/versioning.py — no Llama Stack version matrix
  • Full repo grep for llama_stack, llama-stack, llamastack — zero matches in SDK source

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions