Does AI Actually Boost Developer Productivity?

Posted on: August 28, 2025

Jul 23, 2025, Video: Does AI Actually Boost Developer Productivity? (100k Devs Study) — Yegor Denisov-Blanch (Stanford) (YouTube)

Opening Background & Viewpoint
- Mark Zuckerberg claimed at the beginning of the year that AI would replace all of Meta’s mid-level engineers by year-end; the speaker finds this overly optimistic, likely motivated by boosting morale and stock prices.
- This statement pressured CTOs globally: many CEOs are asking, “How far have we gotten?” while the actual answer is often, “Not very far, and we’re not sure it’s even possible.”
- The speaker’s judgment: AI won’t completely replace developers in the short term (at least not this year). While AI can certainly boost productivity, it can also decrease it in some situations; it’s not a one-size-fits-all solution.
Research Project & Data Overview
- For the past three years, a Stanford team has been conducting a large-scale study on software engineering productivity, using both time-series and cross-sectional analysis.
- Time-series: Even for participants joining in 2025, their Git history is integrated to observe long-term trends like the impact of the pandemic and AI.
- Cross-sectional: The study covers 600+ companies (large enterprises, mid-sized firms, startups), with data from 100k+ software engineers, tens of millions of commits, and billions of lines of code.
- Data is primarily from private repositories: Compared to public ones, private repos have clearer work boundaries, making them more suitable for measuring team/organizational productivity.
“Ghost Engineer” Phenomenon & Team Background
- Late last year, the team published a controversial finding on “ghost engineers”: about 10% of engineers produce almost nothing (around 50,000 were classified as “ghost engineers” at the time), sparking widespread discussion (including a retweet by Elon Musk).
- Core members: Simon from the industry (former CTO of a unicorn, managed ~700 developers, pained by always being the last to know about team issues); the speaker, who has been working on “data-driven software engineering decisions” at Stanford since 2022; and a professor who studies human behavior in digital environments (known for exposing the Cambridge Analytica scandal).
This Talk’s Outline
- Highlight the limitations of existing research on quantifying AI’s boost to developer productivity.
- Introduce the team’s methodology.
- Present the main results and how to interpret them through “slicing.”
Three Major Limitations of Existing Research
- Looking only at commit/PR/task counts: Task granularity varies greatly, and “more commits” doesn’t equal “higher productivity.” AI also creates follow-up repair tasks, causing teams to “spin their wheels.”
- Experimental control studies often use “zero-context greenfield tasks”: AI excels at templated greenfield code, easily outperforming the control group. However, reality is mostly “brownfield” projects with complex dependencies and existing code, making such conclusions poorly generalizable.
- Surveys have weak predictive power: A comparison of self-assessments and actual measurements for 43 developers found that self-perception of productivity is almost a “coin toss.” The average misjudgment was about 30 percentile points, with only a third landing in the correct quartile. Surveys can measure subjective feelings like morale but are not suitable for quantifying productivity or AI’s impact.
Methodology: From Expert Review to a Scalable Model
- Ideal assessment: Have 10–15 experts independently score code quality, maintainability, output, implementation time, etc. The aggregated results can “predict reality” with high inter-expert consistency.
- The problem: Manual review is slow, expensive, and hard to scale.
- The team’s approach: Build an automated model that connects directly to Git, analyzes source code changes commit by commit, and quantifies “functional output” across multiple dimensions. It defines “team productivity” as “features delivered over time,” not lines of code or commit counts, and visualizes changes over time.
Case Study Results: Output Composition After Introducing AI
- A team of ~120 people piloted AI in September. “Functional output” was tracked monthly and categorized into four types:
  - New Features (Green)
  - Deletions (Gray)
  - Refactoring (Blue)
  - Rework (Orange)
- Both “rework” and “refactoring” modify existing code, but rework primarily targets “recent code” and is generally more “wasteful”; refactoring isn’t necessarily wasteful.
- Observation: After introducing AI, “rework” increased significantly. The rise in code volume and commits created an illusion of being “busier and more productive,” but not all of it was useful.
- Overall assessment: Gross productivity (in terms of code volume delivered) increased by about 30–40%, but subsequent bug fixes and rework offset some of this gain. The net effect averages around a 15–20% increase (across all industries).
Task Complexity × Project Type: Distribution and Range
- A violin plot shows the “productivity gain distribution” (Y-axis starts from -20%), distinguishing between:
  - Low-complexity vs. High-complexity tasks (Blue vs. Red)
  - Greenfield vs. Brownfield projects (Left plot vs. Right plot)
- Conclusion 1: AI provides greater benefits for “low-complexity tasks” (supported by data).
- Conclusion 2: In an enterprise context, the gain distribution for “low-complexity × greenfield” is higher and has a longer tail. For personal or hobby projects starting from scratch, the improvement could be even greater.
- Conclusion 3: For “high-complexity tasks,” the average benefit is lower, and there’s a higher chance of “negative returns” (decreased productivity). The reasons for this are not yet fully clear.
- Presented as bars + interquartile range (IQR) lines: Low-complexity gains > High-complexity; Greenfield gains > Brownfield.
“Quadrant Snapshot” for Management (Sample: 27 companies, 136 teams)
- Low-complexity × Greenfield: ≈ 30–40% gain.
- High-complexity × Greenfield: ≈ 10–15% gain.
- Low-complexity × Brownfield: ≈ 15–20% gain.
- High-complexity × Brownfield: ≈ 0–10% gain.
Differences by Programming Language Popularity
- Low-popularity examples: COBOL, Haskell, Elixir, etc. High-popularity examples: Python, Java, JavaScript/TypeScript, etc.
- Findings:
  - For low-popularity languages, AI’s help is limited even for low-complexity tasks. If it’s only helpful “2 out of 5 times,” developers often stop using it altogether.
  - In “low-popularity × high-complexity” scenarios, AI can even slow down progress (negative returns).
  - The vast majority of actual development happens with high-popularity languages: a common ≈20% gain for low-complexity tasks, and ≈10–15% for high-complexity tasks.
Impact of Codebase Size and Context Length (A more theoretical observation)
- As codebase size grows logarithmically from 1k to 10M lines of code, the productivity gains from AI generally “decay rapidly with scale”:
  - Reason 1: Context window limitations—a larger window doesn’t always mean better results.
  - Reason 2: Decreased signal-to-noise ratio—a large amount of irrelevant context interferes with the model’s judgment.
  - Reason 3: Larger codebases have more complex dependencies and domain logic, making migration and inference harder.
- Combined with a study curve on “context length and code task performance”: As context increases from ~1k to 32k tokens, the model’s performance on coding tasks actually drops (example curve shows a drop from ~90% to ~50%).
  - Even if some models claim to have million-token contexts (one model claims 2 million), it doesn’t mean that “stuffing the entire repo in” will make it more accurate. Extending the context beyond 32k is likely to degrade performance further.
Summary & Conclusions
- Overall, AI can boost developer productivity, but “not always, and not for all tasks equally.”
- Whether to use it and the level of benefit depend on a combination of factors: task complexity, project maturity (greenfield/brownfield), language popularity, codebase size, and context length.
- Practical advice leans toward “use in most scenarios, but be cautious in sensitive ones.” Gains are more reliable in low-complexity, high-popularity language scenarios, and in greenfield or smaller codebase projects. More caution, evaluation, and measurement are needed for high-complexity, brownfield, large codebase, or niche language scenarios.

[00:00–01:28] Opening & Main Idea

Citing the media context from the beginning of the year: the controversy over “using AI to replace a large number of mid-level engineers” (using Meta/“Zuck” as a starting point). The speaker gives a personal judgment: in the short term, AI will not “completely replace” developers, but it does improve productivity; at the same time, there are situations where it reduces productivity (it’s not a one-size-fits-all).

[01:28–02:22] Research Scale & Data Structure

A Stanford team has been conducting a large-scale engineering productivity study for three consecutive years:
- Time-series dimension: New participants’ Git histories are backfilled, allowing observation of long-term trends (e.g., the impact of the pandemic, adoption of AI tools).
- Cross-sectional dimension: Covers 600+ companies (enterprise/mid-market/startup), 100k+ developers, tens of millions of commits, and billions of lines of code.

[02:22–04:11] Private Repos, “Ghost Engineers,” & The Research Team

Primarily private repositories: They better reflect the true output of a team/organization in a closed environment (avoiding the noise of “occasional weekend contributions” in public repos).
A controversial finding from late last year: about 10% of engineers were identified as “ghost engineers,” producing almost nothing. This topic was retweeted by Elon Musk and sparked heated debate (archived in media and on social platforms). (Business Insider, IT Pro, X (formerly Twitter))
Team Background:
- Simon: Former CTO of a unicorn company, managed ~700 engineers, focused on “how to detect team anomalies early.”
- Yegor: Has been focusing on data-driven software engineering decisions at Stanford since 2022.
- (Note) Professor Kosinski: Researches human behavior in digital environments.
Agenda Preview: ① Limitations of mainstream research; ② This study’s methodology; ③ Key results.

[04:28–06:56] Three Limitations of Existing Research

(1) Measuring by commit/PR/task counts:
- Task sizes vary dramatically; more commits ≠ more output.
- The introduction of AI often leads to “rework”: new tasks created to fix defects in AI-generated code, causing teams to “spin faster but not move forward.”
- External supplement: The industry recommends multi-dimensional frameworks (like DORA/Four Keys and SPACE) to avoid simplifying productivity to “activity volume” metrics. (Google Cloud, ACM Queue)
(2) Poor generalizability of greenfield control experiments:
- Many vendors/papers have subjects implement small tasks “from scratch with zero context.” AI excels at this kind of template/boilerplate code.
- However, real-world engineering is mostly brownfield (existing code) with dependencies and domain context, making it difficult to extrapolate these findings directly.
- External supplement: A controlled experiment with GitHub Copilot showed a ~56% speed increase on standardized, low-context tasks (a small HTTP server problem). (arXiv)
(3) Surveys/self-assessments do not equal productivity:
- A small experiment with 43 people: self-assessment and actual performance showed a very weak correlation. Most people over/underestimated themselves by about 30 percentile points.
- External supplement: SPACE emphasizes five dimensions—Satisfaction, Performance, Activity, Communication/collaboration, and Efficiency—and discourages using a single subjective dimension to measure “productivity.” (ACM Queue)

[07:18–09:05] The Ideal Measurement and a Scalable Alternative

Ideal state: Every code change is independently reviewed by 10–15 experts (for quality, maintainability, functionality, effort, etc.). The aggregated results can predict real output well, but this is expensive and not scalable.
Practical solution: Build an automated model to replace the review panel:
- Connects directly to Git, analyzes source code diffs at the commit level.
- Aggregates this into “functional output” (not lines of code/commit count), then summarizes by author/commit hash/timestamp at the team/monthly level to visualize trends.

[09:05–10:32] The First Month After Adopting AI

A team of ~120 people piloted AI in September:
- Stacked bar chart: Green = New Features, Gray = Deletions, Blue = Refactoring, Orange = Rework.
- Immediately after launch, there was a rise in rework—a feeling of “writing more,” but not all of it was useful.
Rough conclusion: Net productivity increased by +15%~20% overall, but this can be misleadingly “inflated” if rework isn’t accounted for.

[10:32–13:22] Gain Distribution: Task Complexity × Project Type

Two violin plots (distributions):
- Low-complexity tasks show a more significant improvement (data supports the idea that “AI is good at simple boilerplate/local completions”).
- Greenfield environments (new projects/modules) are better than brownfield (existing/legacy systems).
- High-complexity tasks are more likely to have negative gains (decreased productivity) in several scenarios.
When converted to a “median + interquartile range” bar-and-whisker plot, the conclusion is the same and easier to communicate to management.

[13:22–14:21] Quick Reference Matrix for Decision-Making (Illustrative)

Sample: 136 teams, 27 companies.
Empirical ranges (for real enterprise projects, not personal side projects):
- Low-complexity × Greenfield: +30%~40%
- High-complexity × Greenfield: +10%~15%
- Low-complexity × Brownfield: +15%~20%
- High-complexity × Brownfield: 0%~10%

[14:21–15:35] Language Popularity × Complexity

Low-popularity languages (e.g., COBOL/Haskell/Elixir):
- Limited help even in low-complexity scenarios, and even negative gains in high-complexity ones (due to the model’s poor mastery, scarce data/examples, and high cost of hallucinations/misguidance).
High-popularity languages (Python/Java/JS/TS, etc.):
- Approximately +20% for low-complexity; approximately +10%~15% for high-complexity (more usable overall).

[15:35–16:19] Codebase Size & Diminishing Marginal Returns

As codebase size scales from thousands to millions/tens of millions of lines, the net benefit from AI rapidly declines:
- Context window limitations (unable to ingest/utilize enough dependencies and cross-module knowledge).
- Signal-to-noise ratio worsens.
- Coupling/domain logic becomes more complex.

[16:19–17:18] Long Context ≠ Stable Gains: Performance May Drop as Context Grows

Referring to long-context research: in the range of 1k→32k tokens, the position of key information and the length of the context significantly affect performance. Many models show performance degradation or “lost in the middle” issues with very long contexts. (Consistent with the talk’s observation that “longer context isn’t necessarily better.”) (arXiv)
At the same time, the industry does have very long context models (e.g., Gemini 1.5 Pro claims a 2 million token context), but “visible ≠ effectively usable”—real-world implementation still requires validating retrieval and inference quality. (Google Developers Blog, Google Cloud)

[17:18–18:02] Summary & Resources

Conclusion: AI generally improves developer productivity, but it is uneven, has boundaries, and is jointly affected by:
- Task complexity, project maturity (greenfield/brownfield), language popularity, codebase size, context length, etc.
Portal for the research and more materials: Stanford Software Engineering Productivity. (softwareengineeringproductivity.stanford.edu)

Supplementary / Corroborating External Links (Echoing the talk’s points)

Measuring productivity multi-dimensionally (avoiding “commit count = productivity”): DORA Four Keys, SPACE framework. (Google Cloud, ACM Queue)
AI’s speed advantage on controlled, low-context tasks (evidence for the greenfield conclusion): GitHub Copilot RCT (~56% faster); enterprise research summary. (arXiv, The GitHub Blog)
Media and social media archives of the “ghost engineer” discussion (publicly disclosed by the research group and retweeted by Musk): (Business Insider, IT Pro, X (formerly Twitter))
Long context effects and “lost in the middle” (evidence for the cautious stance on context length in the talk): (arXiv)
Very long context model announcements (capability claims): Gemini 1.5/2.x official blogs (real-world context for the “longer context” claims). (Google Developers Blog, blog.google)

Note: The points above are primarily based on the video content; external links are for supplementing and verifying related concepts/cases (e.g., metric frameworks, long-context research, public discussion archives, etc.) to facilitate your further reading of original materials.

Tags: AI-Agent