Adventures in Vibe Coding
A review of AI agents in software development
You might remember that about two years ago, I published a scathing criticism of generative AI. I’m happy to report that despite using ChatGPT and Claude 4.8 much more since then, I stand by everything I wrote there, but felt the need to clarify that I don’t think AI models are entirely useless — just that there are many clear reasons to think that they’re a net negative on society. Since playing around with them a bit more, I’ve collected enough data to collect some thoughts.
The first observation is one that after you notice, you’ll see everywhere: despite seemingly possessing the entire breadth of all human knowledge — far more than a university’s worth of PhDs — AIs just aren’t clever. Most humans who are widely read, even ones who wouldn’t consider themselves particularly smart, still have a knack for connecting seemingly unrelated concepts to generate new ideas. Noticing that AI simply does not do this shatters the illusion pretty decisively.
I noticed this first when I was working on a Pokemon ROM hack where I wanted to make all the Pokemon from both FireRed and LeafGreen accessible in both versions; naturally, I decided to accomplish that by stitching together the random encounter rates from both games. As I was working, I decided to ask a very high level math question (which doesn’t need to make sense to you):
What’s the category-theoretic interpretation of summing and normalizing two probability distributions like we’re doing right now?
This is fairly high level math question; even my probability professors didn’t know enough category theory to give a satisfactory answer to this (fairly niche) question. But ChatGPT, on its lowest setting, immediately directed me towards several graduate-level papers on category theoretic interpretations of probability theory (like Giry monads, Markov categories, and physics, topology, logic, and computation), right before helping me edit scripts for my little Pokemon game.
If you’re getting whiplash, you know what I was feeling in that moment. How is it that some “intelligence” with such an encyclopedic knowledge of the deepest subjects in the world is so content to just do whatever you ask? Any human who were that smart would probably not bother to write a Pokemon ROM hack and instead just act out the plot of Limitless: you know, become president, take over the world, restructure the entire economy, or some other Rick Sanchez-style project beyond the comprehension of us mere mortals. Instead, it just gravitates towards whatever is in front of it. This isn’t intelligence so much as free association.
But I’m getting off topic. My point here isn’t another full breakdown of what’s going on with these Lovecraftian-monsters-in-a-bottle, it’s to discuss how well they do at software engineering. Here are some of my takeaways.
They Write Way, Way Faster Than A Human
There’s a famous video from 2011 of IBM’s Watson competing on Jeopardy. The thing that struck me most wasn’t just that it was able to speak coherently and actually answer questions correctly, but just that it could hit the button faster than humanly possible — it always got the first crack at any question, and since most of them could be reasonably solved by something like a Google search or database lookup, it could mop up a lot of the questions by just shutting the humans out of the game.
This is basically how AI coding agents work — they can write hundreds, if not thousands of lines of code per second across many different files. This is their primary value-add. Even if a smart human clocked into work knowing exactly what changes they wanted to make to which files, they’d lose out on the race against an oblivious coworker with access to an AI agent. It’s just the nature of the game.
They Sometimes Struggle To Write Idiomatically
I had historically had a lot of trouble getting over the learning curve with Rust, so I asked an AI to write a Rust program (which ended up becoming Blue Raven) for me to use as a first draft, then refactor. Even as a complete newcomer to the language, I could tell immediately that while it was putting together a functional program, it wasn’t anything close to what a seasoned Rust developer would write. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fn find_windows_mount() -> Option<PathBuf> {
// Common mount points to probe
let candidates = ["/mnt/windows", "/mnt/Windows", "/media/windows", "/windows"];
for c in &candidates {
let p = PathBuf::from(c).join("Windows/System32/config/SYSTEM");
if p.exists() {
return Some(PathBuf::from(c));
}
}
// Check /media/<user>/* mounts
if let Ok(entries) = fs::read_dir("/media") {
for user in entries.flatten() {
if let Ok(mounts) = fs::read_dir(user.path()) {
for mount in mounts.flatten() {
let p = mount.path().join("Windows/System32/config/SYSTEM");
if p.exists() {
return Some(mount.path());
}
}
}
}
}
None
}
Rust is technically an imperative language, but it draws heavily on functional programming: for example, what would be a for loop in most languages tends to be a map in Rust, and it’s generally considered cleaner to make things expressions rather than statements; a much more idiomatic version of the above code would look like
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fn find_mount() -> Option<PathBuf> {
// Common mount points to probe
let paths = ["/mnt/windows", "/mnt/Windows", "/media/windows", "/windows"];
if let Some(path) = paths
.into_iter()
.map(PathBuf::from)
.find(|base| base.join("Windows/System32/config/SYSTEM").exists())
{
return Some(path);
}
// Check /media/<user>/* mounts
fs::read_dir("/media")
.ok()?
.flatten()
.flat_map(|user| fs::read_dir(user.path()).into_iter().flatten().flatten())
.find_map(|mount| {
let path = mount.path();
path.join("Windows/System32/config/SYSTEM")
.exists()
.then_some(path)
})
}
It might not be clear to non-Rust programmers why the bottom code is preferable to the top, but it should be obvious even to a layperson that these pieces of code look very different.
I can only speculate that this discrepancy might result from Rust being unique as an imperative language that uses these iterator functions everywhere — even Python often prefers comprehensions to maps. I also have to imagine that C/C++ dominates the training data for “imperative language that works directly with memory”, so AI might end up in kind of a C/C++ “mindset” just because it’s overfitted. (C/C++ is also much closer to how every other language works.) Or maybe the real reason is just that there’s a sea of novice Rust code out there that’s written with for loops; who knows. The point is, it’s a bit surprising that it writes Rust like this considering how syntactically distinct Rust is and how well these agents know the broader packaged ecosystem.
AI agents also don’t seem to understand that we should (within reason) go out of our way to make data immutable, or use blocks { } to declutter namespaces; the main function it wrote was stuffed full of code like
1
2
3
4
5
6
let mut w = ...;
let mut x = function1(w);
let mut y = function2(x);
let mut z = function3(y);
function4(&mut z);
function5(&z);
and w, x, and y are never used again. If refactoring this as a function makes sense, you might as well scope it like
1
2
3
4
5
6
7
8
9
let z = {
// maybe keep and rename w -> a and y -> b for clarity,
// and drop x because we don't need it
let mut a = ...;
let mut b = function2(function1(...));
function3(b)
}
function4(&z); // refactored to not need a mutable reference
function5(&z);
just to properly encapsulate all the data floating around and reduce cognitive load.
AI-generated Rust quality will probably improve as Rust’s share of all-code-ever-written increases, or maybe as AI engineers prune down the Rust training set to only the highest-quality, most idiomatic examples. If you ask it point-blank “refactor this to be more idiomatic”, it’ll usually do a much better job, so clearly it possesses the skills but just can’t use it out of the box.
It’s Good At Finding Needles In Haystacks
Not only is Claude a grep god, it just doesn’t get mentally tired when searching through long files. Surprisingly, this applies not only when I point it at a particular file or paste a massive, nasty stack trace; I’d expect both of those things to be easy for an AI because it can string search. The surprising thing is when I describe something purely qualitative — a webpage looking kind of janky — and it’ll be able to search through tens of thousands of lines to add in one missing css tag that will fix the issue. It’s not just that it doesn’t get mentally tired reading code all day, it seems to have a really solid understanding of how parts of the codebase manifest into program behavior.
It’s Saved Me A Lot Of Time Debugging
I have a pretty bespoke tech stack on my computer: I’m dual booting Pop_OS! with Windows on a Dell computer that nominally has decent Debian/Ubuntu drivers but can be a little flaky in practice. I’m writing software in very different languages and frameworks, and inevitably an apt package or a driver starts acting weird.
Being unable to solve a really persistent error with poetry was actually the thing that first brought me to ChatGPT: despite completely failing to resolve my issue (I ended up just switching to uv, and have never looked back), I stuck with AI because my results with Google search were even worse. It’s now a common opinion that Google search has sucked for a while now, and jamming AI garbage into it hasn’t helped. If I’m going to be forced to speak to an AI anyways, I might as well do it as a first-class user on the actual website rather than having it shoved down my throat.
It is refreshingly good at debugging complicated bash garbage and reasoning about my installation. The primary benefit is that I can write longer, natural-language queries complete with all the necessary context. I’m not sure if this is an improvement over the quality of Google search we enjoyed in 2018, but until it inevitably rm -rf /’s me, all I need it to do is work.
They’re Generally Bad At Structuring Programs
If you read the initial vibe-coded commit to Blue Raven, you’ll notice that it’s overwhelmingly just a monolith of code directly in main: over 400 lines, compared to my refactor, which sits at only 150. Reading code is a skill (and something that I discovered I was genuinely bad at when I first started working at BitSight), and it sorely hurts when the code wasn’t written by a human with even the intention of being read by another human. Even if AI is trained on code that is written by humans for humans, it’s still ultimately just a facsimile — and that lack of intentionality shows.
And the pain doesn’t just come from the lack of concern for code reuse, maintainability, or unit testing; I think the problem is actually AI doesn’t suffer from the same cognitive burnout that humans do. Resilience is nice when we need to dive into a one million line haystack to find a needling little bug, but it’s a drawback when AI just stacks complicated code on top of complicated code with no attention to cognitive load or encapsulation. Our tiny human brains subconsciously break code down so we understand it just while we’re writing it, and that provides at least a little relief when we read it. But Claude doesn’t get exhausted when it just dumps everything into main, so that’s what we end up having to read.
One way to mitigate these nasty monoliths is by working with an overarching framework or a core library: Claude’s Django code, on a macro level, tends to look much nicer not just because there’s plenty of Django code in all of these training sets, but because Claude inherits all the architecture, encapsulation, and coupling from a human who actually thought about it. With the ambient context of working within a huge framework, not only will the agent know where to look for existing code, it’ll know to look for existing code in the first place.
But not every program fits neatly into Django (and definitely shouldn’t try to). Less structured programs need some tight handholding; while an AI can read most of a project into its context window before making a change, it often doesn’t, and won’t stop to check for abstractions or functions that already exist. I that case, you’re gonna have to prompt it to look for code to reuse or be aware of, and compact the conversation beforehand if you think any necessary information will go out of scope.
Even with all these tricks structure, the AI inevitably loses the forest for the trees, and we end up with a lot of duplicate code. The difference is that good habits yield only several functions’ worth of duplicated code, and bad habits yield entire files’ worth.
Hallucinations Are Here To Stay
Even with retrieval-augmented generation, mistakes and hallucinations are inevitable with probabilistic models — their mental model of the world is remarkably wide and detailed, but it just isn’t deep enough to prevent serious issues where knowledge does matter. Anyone who has used these models can remember times where it’ll answer a question before saying “Wait, that’s wrong. The real answer is…”, sometimes multiple times.
If the introduction hadn’t given it away, I’m still very skeptical of AI’s actual reasoning abilities, even with reports that a OpenAI solved an Erdos problem. As impressive as that sounds, the solution seems to just be connecting a remarkably short bridge from existing human research to the problem statement:
The argument relies crucially on ideas that may, at least in retrospect, be attributed to Ellenberg-Venkatesh, Golod-Shafarevich, and Hajir-Maire-Ramakrishna.
which indicates to me that this was more about discovering that the question had essentially already been solved, unbeknownst to the original authors. Admittedly I haven’t done any math research, but I don’t think this qualifies as the kind of novel stuff that will truly put humans in the back seat. Rather, I take this result as a confirmation of my hunch that many smaller open conjectures have already been basically solved by researchers in completely unrelated fields who just don’t know about those conjectures.
Even so, these kinds of searches is still a huge leap forward for research — math is an unimaginably wide field, and it’s inevitable that obscure results in one field can resolve open problems in a different, seemingly unrelated field. I don’t think this is the quantum leap that it’s being sold as — AI is very, very far from doing deep research that seriously pushes the bounds of what was already known — even though the ability to do this kind of “shallow” research, which just connects existing research to conjectures that they solve, will likely lead to a lot of relatively low-hanging fruit being mopped up over the coming years.
The Biggest Problem: Its Subsidies Are Running Out
Much like how Uber got much more expensive after the investor cash started drying up, AI agents are suddenly becoming much more expensive as investors are increasingly unwilling to subsidize their ridiculous costs. After GitHub revamped Copilot’s pricing model so users were paying closer to what these models actually cost to run, users were blowing through most of their monthly budget in under a day. Ed Zitron, probably the leading critic of the AI industry, has repeatedly warned that the economics of AI don’t make sense: the costs are all massively subsidized by investors who are simply not earning their money back. Mathematically, the music has to stop sooner or later.
The difference between AI and Uber is that Uber was a drop-in replacement for an existing industry that was noticeably sclerotic thanks to regulatory capture schemes like taxi medallions. The decision to burn billions in investor cash was a calculated risk; by the time users had to pay the real costs, the taxi industry had largely been dislodged.
The problem is that this is not feasible with current technology. Barring a second Attention Is All You Need-level leap forward in AI, it’s just not quite at the level that it can replace talented developers. This is still obviously going to reduce overall demand for programmers as code monkey junior developers might find it hard to compete with a code monkey that can write code hundreds of times faster, maybe even with the updated pricing.
Conclusion: Better Luck Next Time?
I don’t think the AI revolution is here, but Sam Altman and Dario Amodei’s little test run has permanently altered society. Most of the supposedly AI-driven software layoffs were actually just market corrections from overhiring during COVID. There will almost certainly be more fluctuations as AI gets more expensive, computer science programs see lower enrollment, and tech debt from mounds of inscrutable AI slop need to be refactored, but I’d be surprised if the economics ever approached the golden days of 2012.
I imagine that the future of “agentic coding” — depending on how long and harsh the inevitable AI winter will be — will involve a lot more tools to reason about code: pre-commit hooks, human-written unit tests, maybe even stricter compilers and type checkers. That will obviously not stop bugs, but the more guard rails we can put around black-box AI code (because the breakneck pace of development that managers are demanding means people are definitely not reviewing all this garbage) means the better we can reason about it, and the more easily we can catch and prevent issues. One thing I’ve been wondering about is having multiple AI code reviewers, all with different settings, reading the same code as redundancy. AI alone is inferior to human code review, but information theorists (or anyone who has studied TCP) knows that you can make very reliable machines from unreliable parts.
It’s also unclear how deeply agentic research will influence math, and I don’t think Claude is going to be solving millennium problems anytime soon, it’s clearly shaken most of the researchers.
The next big AI revolution — which any researcher could stumble on — will probably be the one that actually lives up to the hype. Until then, my only advice is to beware the damage that AI garbage is inflicting on society, and don’t let the salesmen take you for a ride.