12-Factor Agents: Patterns of reliable LLM applications
Jul 3, 2025, Video: 12-Factor Agents: Patterns of reliable LLM applications — Dex Horthy, HumanLayer (YouTube)
Opening & Motivation
- An interactive poll of the audience (10, 100 agents built) sets the stage for the shared experience that “we’re all building agents.”
- Personal story: Initially decided to build an agent, and to move fast, stacked up various libraries. Quickly reached 70-80% quality, enough to excite the CEO and expand the team.
- Hitting a wall: Trying to push from 70-80% to higher quality led to getting lost in library call stacks, making it hard to trace how prompts were constructed or tools were injected. The only way out was to start over, and sometimes, to realize “this isn’t a problem for an agent at all.”
Reflection Case Study: The Failed DevOps Agent
- Attempted to build a DevOps agent that could read a makefile and execute build commands.
- After continually refining the prompt, down to the “exact sequence of steps,” the realization hit: a 90-second bash script could do the job. Not all problems require an agent.
Field Research & Pattern Insights
- Interviewed over 100 founders and engineers to identify common practices in the industry.
- Conclusion 1: Most “production-grade agents” are not actually “agentic”; they behave more like regular software.
- Conclusion 2: People aren’t rewriting everything. Instead, they are introducing a set of unnamed, undefined, but effective small, modular concepts into their existing systems.
- These are “software engineering fundamentals” that don’t require an AI background to apply.
“12-Factor Agents”: Background & Positioning
- Distilled from frontline practices, these “12 factors for AI agents” were open-sourced as a document and code repository, receiving enthusiastic community feedback (rapid star growth, multiple contributors).
- This isn’t an “anti-framework talk” but rather a “requirements list” for frameworks: to better serve engineers who demand high reliability and rapid iteration speed.
Core Goal & Method
- An invitation to “set aside preconceptions” and rethink how to build “truly reliable” agents from first principles of software engineering.
- This talk won’t cover the 12 factors one by one but will weave together several key themes. Details can be reviewed after the session.
Factor 1 (The Essential Magic): Sentence → JSON
- The most magical capability of LLMs has nothing to do with code, loops, or tools. It’s the ability to robustly structure natural language into JSON.
- Subsequent factors address “what to do with the JSON,” but the “stable generation of JSON” itself can be immediately integrated into existing applications.
Factor 4 (The Viewpoint): Demystifying “Tool Use”
- Drawing a parallel to the historical “goto is harmful” debate, the talk proposes that “the abstraction of tool use is harmful.” This doesn’t mean denying agents external capabilities, but opposing the mystical view of tools as “spirits manipulating the environment.”
- The reality is: LLM outputs JSON → it’s handed to deterministic code for execution → (optionally fed back). This is just “JSON + regular code + loops/branches”—nothing mysterious.
Factor 8 (Thematic Bundle): Own Your Control Flow
- Recalling DAG thinking in software (an if/else is also a graph), tools like Airflow/Prefect break down processes into nodes to gain reliability.
- The “naive agent loop” abstraction: event → prompt → generate “next action” → execute → put the result back into the context → feed it back to the model until “done.”
- The problem: This is unreliable for long processes, especially with ultra-long contexts. Even if APIs support huge contexts, a controlled, concise context is often more reliable.
- Decomposing the agent into 4 controllable parts:
- The prompt that selects the next step (the decider).
- The switch/loop execution that hands the model’s JSON result to deterministic code.
- The context window constructed for the model.
- The loop and exit conditions that drive the whole process.
- When you “own the control flow,” you can flexibly apply strategies like break/switch/summarize/introduce review (LM-as-judge).
Distinguishing and Managing Execution State vs. Business State
- Frameworks often have execution state fields: current step, next step, retry count, etc.
- Business state is the “data shown to the user, message history, pending approvals,” etc.
- The goal: Support start/pause/resume just like a regular service.
- Practical advice: Expose the agent as a REST API or MCP service.
- For a normal request: load the corresponding context → feed it to the LLM.
- For a long-running tool: be able to interrupt the process and serialize the context to a database (because you already own the context).
- When a callback arrives: use a state_id to load the state from the database → append the tool result to the “program trace” → send it back to the LLM. The agent itself is “unaware” of what happened in the background.
- Conclusion: Agents are just software. To build high-quality agents, you must take ownership of the “inner loop.”
Factor 2: Own Your Prompt (and its testability)
- Off-the-shelf templates can get you a “pretty good” prompt quickly, but to cross the quality threshold, you’ll eventually need to hand-craft every critical token.
- The view: An LLM is a “pure function” (tokens in → tokens out). Quality depends almost entirely on the tokens you feed in.
- The method: Systematically explore through trial and error, tweaking knobs, and running evaluations.
Ownership of Context Construction (Context Engineering)
- Don’t be constrained by a specific message format. You can compress “what has happened so far” into a single user message or put it in the system prompt, freely modeling and serializing your event stream/threading model.
- The key is density and clarity: every token must be used effectively.
- “Context engineering” in a broad sense = prompt + memory + RAG + history—everything you feed to the model.
Failure Recovery and Error Context Handling
- When a model calls the wrong tool or a downstream API fails, don’t “blindly stuff the full error into the context” for a retry.
- In practice:
- When a subsequent tool call succeeds, clean up traces of pending errors.
- Errors should be summarized to avoid polluting the context with long stack traces.
- Be explicit about “what to tell the model” to best help it succeed next time.
Collaborating with Humans: Pre-empting the “Human vs. Tool” Fork at the Natural Language Layer
- Many systems require a choice between “calling a tool” and “replying to a human” in their very first output.
- Push this choice to the first natural language token (e.g., “I’m done,” “I need to clarify,” “I need a manager to intervene”). Models are better at expressing intent in natural language, which also simplifies subsequent routing.
- This creates a trace and protocol that includes human input and can be extended to auto outloop scenarios (not detailed here).
Ubiquitous Triggers and Access
- Let users interact with the agent where they are: email, Slack, Discord, SMS, etc., instead of forcing them to open yet another chat tab.
Small, Focused “Micro-agents”
- A tale of two approaches: huge, long-looping agents are often unstable. What works in practice is a “deterministic DAG + short micro-agent rounds (3–10 steps).”
- Real-world example (HumanLayer’s internal deployment bot):
- Most of the CI/CD is deterministic code.
- When a PR is merged and tested in a dev environment, the “release” stage is handed to a model. It proposes “deploy the frontend first.” A human can correct this in natural language: “deploy the backend first,” which is then structured into the next JSON step.
- After the backend deployment is approved and completed, the agent returns to the frontend deployment.
- On success, it goes back to a deterministic process for end-to-end testing. On failure, it hands off to a “rollback agent.”
- The claim at scale: This can handle 100 tools and 20 steps because the context is controlled and responsibilities are clear.
Outlook: A Gradual Evolution from “Sprinkling in Models” to “Full Agentification”
- Current state: “Sprinkling” small LLM capabilities into mostly deterministic systems.
- As model capabilities improve, LLMs can take on larger segments, until an entire API or pipeline is driven by an agent.
- Even so, engineering reliability will still be won through “fine-grained engineering of context and control flow.”
- Rule of thumb: Find tasks where the “model is just on the edge of being unstable,” and use engineering to make it stable. That’s how you create “magic that surpasses the competition.”
A View on State: The Agent Should Be “Stateless,” You Manage the State
- Using a “reducer/transducer” joke to emphasize: the agent itself should be stateless. Externalize the state so you have full control over it.
Abstractions Are Still Evolving: Frameworks vs. Libraries
- Referencing the historical debate: do we strive for abstraction or allow for some repetition?
- A sense of direction: Instead of an opaque wrapper, provide a scaffoldable, fully self-contained “create 12-factor agent” style starter, similar to shadcn.
Key Takeaways
- Agents are software: Anyone who can write a switch/while loop can get started.
- LLMs are stateless pure functions: The key is to put the “right stuff” into the context to get a “stable output.”
- Own your state and control flow: This is the price of flexibility and reliability.
- Win in human-computer collaboration: Make agents that work effectively with people.
- Leave the hard parts to yourself: Tools should shield you from “non-AI tediousness,” letting you focus on the “real AI challenges” of prompts, flows, and token quality.
Epilogue & Current Work
- The speaker runs a small company where most work and ideas are open-sourced, but they are also building some “important but uninteresting” infrastructure to reduce toil.
- They are advancing the A2 Protocol, hoping to foster industry convergence on “how agents contact humans.”
- A personal passion for automation has led to building numerous agents for finding housing, internal business processes, etc.
- A call to action and thanks: Welcomes discussion after the talk to continue building things together.
Note: The following points are organized strictly according to the video’s narrative sequence. Timestamps are approximate, and the information is primarily sourced from the original video. Where general concepts or case studies are mentioned, external links are provided for verification and further reading.
[00:02–00:34] Opening & “The Shared Experience of Building Agents”
- Audience interaction: How many have built agents? Some have built 10+, even 100+.
- Introducing the theme: Many teams get “stuck” when their agent reaches 70–80% quality—the CEO is thrilled by the demo, the team expands, but further improvement becomes painful.
[00:34–01:08] The “Black Box” Complexity of Frameworks/Libraries
- The problem: To get past the 80% quality mark, you find yourself “reverse-engineering” the source of prompts and tool calls within deep call stacks, making control difficult.
- The common outcome: Starting over or admitting, “This isn’t the right problem for an agent.”
[01:08–01:28] The DevOps Agent Counterexample
- An attempt to build a DevOps agent that could “understand a Makefile and compile a project” resulted in all the steps being wrong. After adding more and more detail to the prompt, it essentially became “dictating every single step.”
- The takeaway: A 90-second bash script is more reliable for such tasks—not all problems need an agent.
[01:28–02:07] Observations from 100+ Frontline Builders
- Most “agents” in production are not very agentic. They are essentially regular software but incorporate a set of effective and reusable engineering patterns that significantly improve the reliability of LLM applications.
[02:07–02:39] Incremental Adoption, Not a Rewrite
- These patterns are typically small, clear, modular practices that can be directly embedded into existing code without a “big bang rewrite.”
- This isn’t something only those with a “specialized AI background” can do; it’s more about software engineering fundamentals.
[02:22–03:26] The Origin and Positioning of 12-Factor Agents
- Inspired by the spirit of Heroku’s “12-Factor App,” these 12 principles for building reliable LLM applications were compiled and open-sourced, receiving widespread community response (top of HN, rapid star growth). (12factor.net, humanlayer.dev, Hacker News)
- This is not an “anti-framework manifesto” but rather a wishlist for framework features: to better serve developers who demand both high reliability and high iteration speed.
[03:26–03:43] Methodological Stance
- An invitation to “forget preconceptions” and return to first principles: applying what works in software engineering to build “truly reliable” agents.
[03:43–04:23] Factor Example: Structured Output is the Magical Starting Point
- The “magic” of LLMs isn’t about loops, switches, or tools, but about reliably turning a sentence into structured JSON (which is then handled by other factors).
- Structured output / function calling is the engineered form of this paradigm. (OpenAI Platform, OpenAI Cookbook, OpenAI Help Center, Anthropic)
[04:23–05:01] Factor 4: Tool Use is “Harmful” (at a Semantic Level)
- This doesn’t deny the value of giving models access to the outside world. It points out that mystifying “tool use” obscures its true nature: LLM outputs JSON → deterministic code executes it → (optional) result is fed back.
- Therefore, tool use can be orchestrated with standard loops/branches without deifying the “tool.” (The “Considered harmful” phrasing follows the pattern of Dijkstra’s “Go To Statement Considered Harmful” essay title.) (Wikipedia, CWI Homepages)
[05:01–05:40] Own Your Control Flow: From DAGs to the Agent Loop
- Code is inherently a graph. DAG orchestration (Airflow, Prefect, etc.) achieves observability and reliability by breaking processes into nodes. (Apache Airflow, docs.prefect.io, prefect.io)
- The naive agent loop: LLM chooses the next step → executes it → merges the result into the context → continues until “done.”
[05:40–06:18] Limitations of the Naive Loop: Long Contexts & Stability
- “Stuffing everything into the context” often fails for long-running processes. Controlling context size and density is usually more reliable.
[06:18–07:13] Decomposing the Agent into 4 Parts
- Prompt (to choose the step), Switch (to hand the model’s JSON to deterministic code), Context Construction, and Loop & Exit Conditions.
- When you own the control flow, you can insert strategies like “break/summarize/LLM as judge.”
[07:13–08:06] Execution vs. Business State, Pause/Resume & Serialization
- Beyond execution state like “current step, retry count,” you must also manage business state (past messages, user-visible data, pending approvals).
- Best practice: Expose the agent as a standard API (REST) or MCP server. When a long-running tool is called, serialize the context to a database. A callback with a state ID can then resume execution seamlessly. (Model Context Protocol, Anthropic, The Verge)
[08:06–08:37] The Core Tenet
- Agents are fundamentally software: For flexibility and reliability, you must own the inner loop.
[08:37–09:33] Factor 2: Own Your Prompt (Testable and Tunable)
- Beyond a certain quality threshold, you’ll inevitably return to hand-crafting, composing, and evaluating every token in your prompt.
- An LLM is a “pure function”: Tokens in → Tokens out. Reliability hinges on “putting the right tokens in.”
[09:33–10:12] Own Your Context Construction
- Don’t be limited by OpenAI’s “messages” format. You can freely serialize your event/thread model into a single input that precisely tells the model “what has happened so far.”
[10:12–10:55] Error Handling ≠ Blindly Stacking Errors
- When a tool fails, feeding the “error + call” back to the model for a retry is a common but risky pattern that can lead to a “death spiral.”
- Own the context: After a success, clean up historical errors. Retain only condensed, critical signals to avoid polluting the context with stack traces/logs.
[10:55–11:34] Bring Humans into the Loop with a “Contact Human” Tool (HITL)
- Make a clear distinction at the very first step between “tool use vs. human interaction.” Pushing intents like “I need to clarify” or “I need manager approval” to natural language tokens improves the stability and interpretability of the decision.
- The engineering counterpart: Introduce approvals/communication via Slack, Email, SMS, etc. (humanlayer.dev, Y Combinator)
[11:34–12:25] Trigger from Anywhere, Meet Users Where They Are
- Let users interact with the agent directly in their existing workflows (email, Slack, SMS) instead of forcing them to open another “Chat” tab. (humanlayer.dev)
[12:25–13:32] Small, Focused Micro-agents (The Key Practice)
- The effective pattern: Keep the main process as a deterministic DAG and embed small, 3–10 step agent loops at critical branches.
- Real-world use case: HumanLayer’s internal deployment bot. Most of CI/CD is deterministic. After a merge, a model is tasked with “deciding the frontend/backend deployment order.” A human can interject to correct it (“backend first”), and this natural language becomes the next JSON step. The process then returns to a deterministic flow (E2E tests, rollback micro-agent, etc.).
[13:32–14:19] Gradually “Sprinkle In” as Models Evolve
- As the “boundary of reliability” for models expands, you can progressively migrate more steps from deterministic code to agents.
- Rule of thumb: Find a task that is “just within the model’s reach but unstable.” Use engineering (context, control flow, evaluation) to push it inside the stable boundary, creating a differentiated capability.
- Mentions NotebookLM as an example of good engineering at this boundary. (Google NotebookLM, blog.google, The Verge)
[14:19–15:01] State is Your Responsibility: The Agent as a “Stateless Transformer”
- The agent itself should remain stateless. Externalize state to make it observable and persistent, giving you maximum flexibility.
- The industry is still searching for the right abstractions; the old “framework vs. library” debate applies.
[15:01–15:18] The Scaffolding Philosophy: Take-and-Own like shadcn/ui
- Instead of a black-box wrapper, generate a scaffold that developers can take full ownership of (mentions the idea of a “create-12-factor-agent”).
[15:18–16:11] Five Key Takeaways
- Agents are software; LLMs are stateless pure functions—put the “right context” in to get a more reliable output.
- Own your state and control flow to serve flexibility and reliability.
- At the “edge of model capability,” use engineering to turn instability into stability.
- Human-computer collaboration often makes agents stronger.
- Frameworks should solve the “non-AI hard parts,” letting developers focus on prompts, control flow, and token quality.
[16:11–16:48] HumanLayer & A2 (Agent-to-Human) and Closing
- His company focuses on the hard problems of “human-agent collaboration” (approvals, async communication, reachability). Mentions work on an A2 protocol to standardize “how agents contact people.” See their products and docs for more. (humanlayer.dev, GitHub)
Terminology and Case Study Extensions (External Links)
- 12-Factor Agents Collection (HumanLayer): The list of principles and examples. (GitHub, humanlayer.dev)
- 12-Factor App (Original Heroku Methodology): The spiritual predecessor to this methodology. (12factor.net)
- “Considered Harmful” Title Pattern & Dijkstra’s 1968 Paper: Understanding the origin of the “Tool use is harmful” phrasing. (Wikipedia, CWI Homepages)
- DAG Orchestration References (Airflow / Prefect): Traditional abstractions for observable and recoverable workflows. (Apache Airflow, docs.prefect.io)
- Structured Output/Function Calling (OpenAI / Anthropic / Azure Docs): The engineering pattern of LLM outputs JSON → you execute → (optional) feed back. (OpenAI Platform, OpenAI Cookbook, Microsoft Learn, Anthropic)
- MCP (Model Context Protocol): An open protocol for connecting agents to tools and data in a server/client model. (Model Context Protocol, Anthropic, The Verge)
- The Reality and Limits of Long Context (Gemini 1.5/2.x 2M tokens): Official announcements and developer docs, providing context that “ultra-long context is not a silver bullet.” (Google Developers Blog, Google AI for Developers, Google Cloud Storage)
- Human-in-the-Loop (HumanLayer Product Page/Docs/YC Profile): Turning “contact/approval/receipt” into an orchestratable engineering capability. (humanlayer.dev, Y Combinator)