Advanced Context Engineering for Agents

Posted on: August 28, 2025

Aug 25, 2025, Video: Advanced Context Engineering for Agents (YouTube)

Opening & Background
- Speaker: Dex (Founder of Human Layer, YC W24).
- Origin of the term “Context Engineering”: Stemmed from the “12-Factor Agents / Principles of reliable LLM applications” mini-manifesto on April 22, 2022; the title of a talk was changed to “Context Engineering” on June 4, 2022.
- Topic: How to use “context engineering” to push the output of coding agents to their maximum usable limit with current model capabilities.
Industry Observations & Two Reference Talks
- Recommendation 1: Sean Grove – “The New Code”
  - Viewpoint: We are doing “vibe coding” (chat-based programming) the wrong way. If you spend two hours chatting with an agent to get code and then discard all the prompts, keeping only the compiled artifact, it’s like committing the .jar file but throwing away the source code.
  - Insight: In a future where “AI writes more code,” the specifications (specs) are the most critical “source code.”
- Recommendation 2: Stanford Research (100,000 developers, from large corps to startups)
  - Finding: AI programming leads to a significant amount of rework. In complex/legacy (brownfield) scenarios, it can even be counterproductive and slow down progress.
  - Real-world consensus: Good for prototyping, not for large/complex/legacy codebases (at least for now).
Personal Shift in Practice: Forced to Adopt “Spec-first Development”
- Background: Collaborating with “one of the best AI coders,” who would submit 20,000-line Go PRs every few days (including complex system logic like concurrency, shutdown sequences), making line-by-line review nearly impossible.
- The change: Switched to writing specs first (I still review all tests, but no longer review the implementation line-by-line).
- Results: The transition took about 8 weeks; productivity skyrocketed. The speaker himself submitted 6 PRs and has barely opened a non-Markdown file in almost two months.
Goals (Set by “Forced Reality”)
- Work effectively in large, complex codebases.
- Solve complex problems.
- Zero slop, straight to production quality.
- Mental alignment across the team.
- Spend tokens boldly (to maximize context quality).
The Naive Approach: “Shouting” Back and Forth with an Agent
- Symptoms: The agent goes off track, the context is exhausted, constant corrections are needed.
- A crude improvement: Promptly “restart with a clean context” and add constraints like “don’t try X again.” When you see clear signs of deviation, it’s time to restart.
Intentional Compaction
- The idea: Instead of a simple slash-compact, write key information into a “progress file” to serve as a “starter pack” for the next round/next agent.
- What needs compaction:
  - Information lookup/location, understanding system flow, specific edit/operation steps.
  - If an MCP tool returns a giant JSON, it will consume a massive amount of context—it needs to be distilled.
- Purpose: To transform “unstructured, verbose, noisy” process information into a structured, portable, low-volume “snapshot of intent and current state.”
Why Does Everything Revolve Around Context?
- Viewpoint: LLMs are more like pure functions (aside from a few parameters like temperature). The output quality depends almost exclusively on the input context quality.
- Four elements to optimize: Correctness, Completeness, Volume, and Trajectory.
- The hierarchy of badness: Bad information > Missing information > Too much noise.
- Rule of thumb (from Jeff Huntley / SourceAMP): With a usable window of about 170k tokens, the lower the percentage of tokens used for “doing the work” (letting the model do less ineffective reading/searching and more decision-making/implementation), the better the results.
- Example: Ralph Wiggum as a Software Engineer—the reason “running the same prompt in a loop overnight” works is because the context and loop strategy are well-designed.
Sub-agents for “Inline Compaction”
- Common use case: Have a sub-agent focus on search and location (e.g., “where does X happen / how does data flow across components”), offloading the high-cost search/read tasks from the parent agent.
- Ideal return: A structured list (files, line numbers, paths, cause/flow summary).
- The challenge: The “telephone game” problem—how to make the parent model accurately convey the “required return format” to the child model to avoid distortion and confusion.
Going Further: Frequent and Intentional Compaction (FIC)
- Goal: Keep context utilization below 40% long-term.
- Three-stage workflow: Research → Plan → Implement
  - Research:
    - Output: A description of “how the system works and where the problem is,” with filenames + line numbers, allowing subsequent steps to perform targeted reading instead of “searching the whole repo.”
    - There are public research prompts and research output structures.
  - Plan:
    - Output: Every change to be made (down to the file/snippet level) + a step-by-step testing and validation plan.
    - The plan should be much shorter than the final code changes but fully cover the intent and validation path.
  - Implement:
    - Update the plan as you implement (checking off completed items, adding new context), continuously maintaining < 40% context usage.
    - There is a public implement prompt.
- Human in the loop: Mandatory manual review of the “research and plan”. It’s much easier to cut off a wrong trajectory early by reviewing a plan than by reviewing 2000 lines of code.
- Linear workflow: Progress through Research → Plan → Implement, with each step being verifiable and resettable.
The True Role of Code Review:
- It’s not just about finding bugs, but about the team’s “mental alignment,” ensuring everyone understands why the system is changing this way.
- I can’t review 2000 lines of Go every day, but I can definitely review a 200-line implementation plan and course-correct at a much higher level.
External Validation & Case Studies
- BAML Case Study (with YC founder Vibbo(v))
  - Challenge: A one-shot fix in a 300k-line “RS” codebase (Note: likely Rust/VS Code Remote? The talk abbreviates it to “RS”).
  - Result: The PR was merged on the spot by the CTO (it was already merged while recording the podcast), validating that this method works in legacy codebases and is “no slop.”
- Collaboration with a Boundary Language/Compiler CEO
  - Delivered 35,000 lines of changes in 7 hours (partially generated, partially handwritten), equivalent to 1–2 weeks of work.
  - Achieved a major goal: adding WASM support to a language.
- Key Insight (The Magnification of Errors)
  - One bad line of code is just one bad line.
  - One bad point in a plan can lead to hundreds of lines of bad code.
  - One misunderstanding in the research phase can lead to thousands of lines of bad code.
- Therefore, effort must be front-loaded: define the right problem + make the agent truly understand how the system works.
- Process Governance: Strict management of /cloud-md and slash commands, which were polished for weeks before changes were allowed.
Achievements & Data
- All stated goals were met (works in large repos/complex problems, production quality, team alignment).
- “Spend tokens boldly”: A small team burned through a lot of credits in a month, and it was worth every penny (in saved engineering time).
- The Newcomer Effect: An intern, Sam, delivered 2 PRs on day one and 10 PRs on a single day by day eight—the workflow is replicable and scalable.
Outlook
- Coding agents will likely become commoditized. The real difficulty is the re-engineering of teams and processes (uncomfortable but necessary).
- We are working with organizations from 6-person YC teams to 1000-person public companies to help them implement this transition.
- Announcement: A “hyper-engineering” event will be held tomorrow (spots are almost full, but you can self-nominate to try and get a seat); a 90-minute long video will provide more details.
Methodology Checklist (Actionable Takeaways)
- General Principle: Everything is “context engineering.”
- Strategy Mix:
  - Restart the context promptly (restart as soon as it goes off track).
  - Use a progress file for intentional compaction instead of blind compaction.
  - Use Sub-agents to outsource “heavy reading/retrieval,” so the parent agent only does decision-making/implementation.
  - Follow the FIC workflow (Research/Plan/Implement) + a red line of context < 40%.
  - Human-review the research and plan to cut off errors at the highest leverage point.
  - Measure token usage structure: Minimize tokens spent on “mechanical reading/searching” and reserve the window for “effective instructions/structured knowledge.”
One-Sentence Summary
- The most effective “AI coding” today isn’t about smarter agents, but about smarter context: ensuring the model always receives correct, complete, compact, and actionable input, and embedding team consensus into reviewable artifacts like the “research and plan.”

[00:00–00:42] Opening & Terminology Origin
- Speaker: Dex (Founder of Human Layer, YC W24 batch). Explains the origin of the term “Context Engineering”: wrote the “12-Factor Agents” manifesto on April 22, and on June 4, changed a talk title to “Context Engineering.” (YouTube, GitHub)
[00:43–01:31] “What’s Next” & Two Reference Talks
- Recommends two of his favorite talks of the year, which also have more views than “12-Factor Agents”:
  - Sean Grove – The New Code: Emphasizes that the “spec is the core artifact,” and we shouldn’t treat temporary prompts from conversational “vibe coding” as the output. (AI Native Dev, Reddit)
  - Stanford Research (100k Developers): Large-scale, real-world data shows that AI programming gains are accompanied by significant rework, and in complex/brownfield scenarios, it can even be counterproductive. (YouTube, Class Central)
[01:31–03:44] Background, Reality & Goals
- The consensus among many teams: AI coding is great for prototypes but not well-suited for large repositories/complex systems. For example, Replit’s view is that product managers can use AI for rapid prototyping, but engineers will ultimately rewrite it for production. (As relayed in the video) (YouTube)
- Dex’s collaboration with a top AI programmer: Frequently received 20,000-line Go PRs, making line-by-line review impossible. Was forced to adopt spec-first. After an ~8-week transition, he now only reviews tests and specs, not every line of code. Productivity soared; Dex mentions merging 6 PRs in one day last Thursday and has barely opened a non-Markdown file in nearly two months. (Personal experience) (YouTube)
- Defines five goals: must work in large repos, solve complex problems, have no slop, be production-ready, and ensure full team alignment; plus, spend tokens liberally (this is “advanced” context engineering). (YouTube)
[04:05–05:25] The Naive & “Restart” Approach
- Naive usage: Repeatedly talking to an agent until the context is exhausted or you give up.
- Improvement one: When you notice it’s going off track, reset the context and start over (“try again, but don’t do X”).
- When to reset: As soon as you see clear signs of it “getting lost/tangling itself up.” (Heuristic) (YouTube)
[05:04–06:18] “Intentional Compaction” & The Progress File
- Instead of using a black-box command like “/compact,” explicitly write key progress to a progress file. Use this file to guide the next session or a sub-agent.
- Pay attention to context window consumption: Finding files, understanding workflows, edit/work traces, and large JSON returns from MCP tools can drown the context, so they need to be selectively included. (YouTube)
[06:01–06:54] The Reason: LLMs are “Pure Functions,” and Everything is Context Engineering
- Besides the model/temperature, the result is almost entirely determined by the quality of the input context. A coding agent is in a constant loop of “picking a tool/making an edit,” and the context determines if the next step is correct.
- Goal: Optimize the context across four dimensions: correctness, completeness, volume, and trajectory. The worst is incorrect information, followed by missing information, and then noise. (YouTube)
[06:54–07:35] Token Budget Intuition & The “Ralph Wiggum” Method
- Rule of thumb: You have ~170k tokens to work with. The fewer tokens are used for the actual work, the higher the quality (outsource the overhead of “searching, reading, finding”).
- The “Ralph Wiggum as a software engineer” method proposed by Jeff Huntley: Running the same prompt in a loop all night can yield surprisingly good results. If you understand the context window, this is a smart approach. (Geoffrey Huntley)
[07:35–08:44] “Inline Compaction” with Sub-agents
- Common task: Locating information or tracing cross-component information flow. The parent agent uses a tool to pass instructions to a sub-agent for the search. The sub-agent returns a refined result (like file and line numbers), which the parent agent uses to proceed without stuffing large chunks of text back into the window.
- The challenge: The telephone game. You must specify the return format/granularity in the parent-to-child prompt, or it becomes unstable. (YouTube)
[08:44–10:22] The Most Effective Method: Frequent, Active Context Management
- Goal: Context utilization < 40%.
- Three-stage workflow: Research → Plan → Implement.
  - Research: Produces a research document (with filenames + line numbers that tell subsequent agents where to look).
  - Plan: Lists every change to be made (file/snippet) and defines testing and validation steps. The plan is usually much shorter than the actual code changes.
  - Implement: Write code according to the plan. For long processes, continuously update the plan (marking what’s done, what’s next) to keep the new context clean and orderly.
- All three sets of prompts are open-sourced (in CLAUDE.md and related /slash commands). (YouTube)
[10:22–10:48] Necessary Human Review & The Linear Process
- These methods are not magic; they require careful reading and review. Compared to reviewing a 2000-line PR, reviewing a 200-line implementation plan makes it much easier to spot problems early and maintain team alignment. (YouTube)
[10:48–12:01] Case Study 1 (BAML/Boundary Ecosystem)
- A “one-shot” challenge during a podcast: apply a one-time fix to a 300,000-line Rust repository. The CTO mistook it for a regular contribution and merged it directly, validating that the method is viable in legacy codebases and produces no slop. (YouTube)
- Paired with the Boundary CEO for 7 hours and delivered 35,000 lines of code (partially generated, partially handwritten), equivalent to 1–2 weeks of work. This added WASM support to a programming language, proving it can solve complex problems. (YouTube)
[12:01–12:56] Key Insight: The “Hierarchical Magnification” of Errors
- One bad line of code is just “one bad line of code.”
- One bad snippet in a plan can mushroom into hundreds of lines of bad code.
- One bad conclusion or misunderstanding in the research can lead to thousands of lines of bad code.
- Therefore, more effort should be front-loaded: define the right problem → make the agent understand the system → then start writing. The team strictly controls changes to CLAUDE.md and /slash commands and only reviews the research and plan to ensure cognitive alignment. (YouTube)
[12:56–13:40] Results Review
- All goals were met: supports large repos/complex problems, production-grade quality, and team alignment. It did cost a lot in tokens/credits, but this led to significant savings in engineer time. Newcomers ramp up quickly (an intern merged 2 PRs on day one, and 10/day by day 8). The speaker himself mostly just reviews specs, not code. (YouTube)
[13:40–14:20] Outlook & Resources
- Coding agents will be commoditized. The real difficulty is the transformation of teams and workflows (changes in communication and collaboration structures). Human Layer is working with teams from 6-person YC startups to 1000-person public companies to drive this.
- Announces a “HyperEngineering” in-person event tomorrow (almost full) and provides a link to a longer video/resources (QR code shown on screen). (YouTube, Luma)

12-Factor Agents original repository and philosophy (Dex/Human Layer): Context engineering is a collection of engineering practices to make LLM applications more reliable/maintainable. (GitHub)
The New Code (Spec-Driven Development): The trend and key points of treating the spec as a first-class artifact. (AI Native Dev, Reddit)
Stanford “100k Developers” Research Talk: In complex/legacy scenarios, AI can lead to rework/slowdowns; gains are more apparent in simple/greenfield tasks. (YouTube, Class Central)
The “Ralph Wiggum” Method (Geoffrey Huntley): Once you understand the context window, running the same prompt in a long loop is also an effective engineering strategy. (Geoffrey Huntley)

Tags: AI-Agent

Advanced Context Engineering for Agents

Related External References