Main Features

TOC

Server-Based Architecture AI Agents with Tools Configuration and Extensibility Integration

Server-Based Architecture

Centralized Server: Llama Stack Server hosts inference, agents, safety, tool runtime, vector I/O, and files
Remote or Inline Providers: Support for remote APIs (e.g., OpenAI-compatible) and inline providers (e.g., meta-reference, sqlite-vec, localfs)
Kubernetes Deployment: Deploy via Llama Stack Operator using LlamaStackDistribution custom resources

AI Agents with Tools

Agent Creation: Create agents with model, instructions, and a list of tools
Client-Side Tools: Define tools with the @client_tool decorator; the client executes tool calls and returns results to the server
Session Management: Create sessions and run multi-turn conversations with streaming responses
Streaming: Support for streaming agent responses for real-time display

Configuration and Extensibility

Stack Configuration: YAML-based configuration for APIs, providers, persistence (e.g., kv_default, sql_default), and models
Environment Fallbacks: Use ${env.VAR:~default} in config for flexible deployment
Multiple Distributions: Starter, postgres-demo, meta-reference-gpu and other distribution options

Integration

Python Client: llama-stack-client for Python 3.12+ with full agent and model APIs
REST-Friendly: Server exposes APIs for inference, agents, and tool runtime; can be wrapped in FastAPI or other web frameworks for production use