@roostjs/ai
Why agents are classes, how the agentic loop works, and why Cloudflare Workers AI means no external API keys or network calls.
Why Class-Based Agents
Agents could be plain functions: const agent = createAgent({ instructions, tools });.
This is the approach taken by several popular AI SDK libraries. Roost instead makes agents
classes for reasons that become apparent once an application has more than one or two agents.
Class-based agents have inherent identity. When you have a ResearchAssistant,
a SupportAgent, and a BillingConcierge, their names appear in
logs, error messages, and test assertions. Decorators — @Model, @MaxSteps,
@Temperature — can be attached to the class rather than buried in an options
object, making configuration visually prominent. The static fake() and
restore() methods work on the class as a whole, so tests can fake a specific
agent type without affecting other agents in the same test suite.
Conversation memory is also cleaner on instances: create a new agent instance for each independent conversation, and the instance holds that conversation's history. When the conversation ends, discard the instance. No external state management is required for the common case.
The Agentic Loop
When agent.prompt(input) is called, it does not simply send a message and
return a response. It enters a loop. The agent sends the current message history to the
model. If the model responds with tool calls, the agent executes those tools and adds
the results to the message history, then sends the updated history back to the model.
This continues until the model responds with text (not tool calls) or until
maxSteps is reached.
This loop is what makes agents genuinely agentic rather than just "LLM with context."
A research agent that needs to search the web, summarize three pages, and synthesize
an answer makes multiple sequential tool calls in a single prompt() invocation.
The caller receives the final synthesized answer without needing to orchestrate the
intermediate steps.
Cloudflare Workers AI: No API Keys, No External Calls
When your Roost agent runs on Cloudflare Workers, inference does not go through an external API. It runs on Cloudflare's own GPU infrastructure via the Workers AI binding. There are no API keys to manage, no OpenAI or Anthropic billing to configure, and no external network requests that could fail, time out, or be rate-limited by a third-party service.
The CloudflareAIProvider calls AIClient.run(model, inputs), which
resolves through the env.AI binding — a Cloudflare binding, not an HTTP
endpoint. The inference happens inside Cloudflare's network, at a data center geographically
close to the executing Worker. This is not a small operational advantage: it means AI
inference inherits Workers' reliability and latency characteristics rather than adding
a new external dependency.
The AIProvider Interface
The AIProvider interface exists to make the AI backend swappable. The default
and built-in provider is CloudflareAIProvider. An application can register
a different provider on an agent class via AgentClass.setProvider(provider).
This extensibility is intended primarily for testing and for teams that want to run a
specific agent through a different backend — not as an invitation to routinely switch
providers per request.
Default Model Rationale
The default model is @cf/meta/llama-3.1-8b-instruct. The 8B Llama model
is fast and cheap to run, suitable for most tool-assisted tasks. Agents that need more
capability can override with @cf/meta/llama-3.1-70b-instruct via the
@Model decorator. The default is the smallest model that is generally
useful, not the largest model that might be impressive — smaller models fail faster
and cost less, which is the right default before profiling.
AI Gateway: Routing, Caching, and Fallback
The direct Workers AI binding (env.AI) is the lowest-latency path to inference: the call resolves inside Cloudflare's network without an additional HTTP hop. GatewayAIProvider introduces a deliberate tradeoff — roughly 10 ms of added latency in exchange for three capabilities the binding alone cannot provide.
Observability. Every request through AI Gateway is logged in the Cloudflare dashboard with latency, model, token counts, and error codes. The binding produces no such logs.
Request caching. AI Gateway can cache identical requests at the edge. For applications that send the same prompt repeatedly — search autocomplete, FAQ bots, cached summaries — this eliminates inference cost entirely for cache hits.
Fallback. If the gateway is unreachable or returns an error, GatewayAIProvider calls the direct CloudflareAIProvider automatically. The application never sees the failure.
The decision between the two providers is operational, not architectural. Start with the direct binding during development and switch to the gateway in production when you need logs and caching.
Session Affinity and Prefix Caching
Transformer models benefit from KV caching: when the beginning of a prompt matches a previously computed sequence, the attention values for those tokens can be reused rather than recomputed. Cloudflare AI Gateway exposes this through session affinity.
When GatewayAIProvider detects that a request carries conversation history (more than one non-system message), it adds x-session-affinity: true to the outbound request. This header signals to the gateway to route the request to the same backend replica that handled the previous turn, maximising the probability of a cache hit on the shared system-prompt and conversation prefix.
The savings scale with the ratio of cached tokens to new tokens. A conversation with a long system prompt and many previous turns gets substantial cache hits; a single-turn request with no history gets none. The header adds no latency when the cache is cold — it is purely additive.
RAG Architecture
Retrieval-Augmented Generation (RAG) addresses a fundamental constraint of language models: their knowledge is frozen at training time and limited to what fits in the context window.
A RAG pipeline has two phases:
Ingestion. Source documents are split into chunks, each chunk is embedded into a vector, and the vectors are stored in a vector database (Vectorize). This happens offline or as documents are added to the system.
Retrieval. At query time, the user's question is embedded using the same model. The vector database returns the chunks whose embeddings are closest to the question embedding. Those chunks are inserted into the model's context window as retrieved evidence, and the model generates an answer grounded in that evidence.
RAGPipeline implements both phases. ingest() handles the write path; query() handles the read path. The pipeline does not include prompt construction — inserting retrieved chunks into the agent's instructions or prompt is the application's responsibility, which keeps the pipeline composable.
Chunking Strategies
How a document is divided into chunks has a larger impact on retrieval quality than the choice of embedding model.
Token size. A chunk must be small enough to fit useful context in the vector's metadata (Vectorize stores metadata alongside vectors) but large enough that the embedding captures meaningful semantic content. The default chunk size of 400 estimated tokens is a reasonable starting point for most prose. Code and structured data may require smaller chunks.
Overlap. TextChunker repeats a fraction of the previous chunk's words at the start of the next chunk (10 % by default). This ensures that sentences and ideas that straddle a chunk boundary are represented in at least one chunk, rather than being split across two chunks that each lack the full context.
Semantic boundaries. TextChunker is purely mechanical — it splits on word count regardless of content structure. SemanticChunker first splits on Markdown headings and paragraph breaks, then applies TextChunker only to segments that still exceed the size limit. For documents with clear heading structure (documentation, articles, reports), SemanticChunker produces more coherent chunks because each chunk corresponds to a complete section rather than an arbitrary word window.
Choose TextChunker for unstructured prose, dense text, or content without heading structure. Choose SemanticChunker for Markdown documentation, articles, and any content where headings delineate distinct topics.