From prompt and pray to prompt engineering

07 Apr, 2026

Yesterday the LLM sliced through a problem like butter. Today it can't follow a three-line instruction without going sideways. You press regenerate, rephrase the prompt, add qualifications — "no, I meant this, not that" — the context window slowly fills with noise, further degrading the model's output. A roulette wheel of code: "maybe the machine will do what I ask for, maybe it won't, but I have no real control except pressing this lever harder".

This is addictive in the clinical sense: incredible highs coupled to an unreliable and unpredictable system, with real stakes attached — finishing a project, keeping a deadline. Frustration compounds because you barely glance at the diffs, which always look good when quickly glanced at (that's literally what the models are trained for), while underneath, something went wrong three tool calls ago when the model picked up the wrong documentation or chose a design pattern that doesn't belong in your codebase. You'll never see this because you're staring at scrolling lines of diffs, not how the model actually behaves. You will come out of the session without experiencing any growth.

(btw, this is on substack, if you want to feed the engagement machine).

If all you know is the rat race, then electronic rats that go really fast are a very compelling thing. But racing towards an elusive reward doesn't really make you better at anything.

Reframing agent failure as engineering opportunity

I reframed my reaction to these moments by turning the failure of a model into an opportunity to learn prompt engineering, to learn how models behave, to do science. A model failing at something I expected it to handle is not a random misfortune. There are a finite number of real reasons:

I didn't prompt it right.
The model genuinely can't do it reliably.
I was unlucky, because models behave probabilistically.
There is another issue at play — a tool might be broken, model quality might have degraded because the provider served a quantized version or changed something in their own harness.

Each of these is diagnosable. Each teaches you something. if the model can't tackle a task, that means I have to break it down into smaller, more tractable problems. All of these are engineering.

Think about it the way you'd think about a compile error. There is no need getting angry: "but yesterday you compiled my code just fine! WTF is going on? Let me add some more things here and replace int with string and maybe then it'll work." Instead, read the error message carefully; think back to where you started: maybe it's the wrong include; maybe you genuinely found a bug in the compiler. But most probably, you didn't have a clear understanding of the problem, of the programming language, of the technologies at play: you are missing knowledge and have to do some research.

Working with the early models taught me this. GPT-3.5 was like programming in assembly — I had ~4k tokens, and each one really counted. I could watch the model's behavior shift when I swapped a single term. It taught me that LLMs are just machines, and that what looks like an alien intelligence hard at work is really just a big string-to-string transformer we can reason about. When you picture a LLM as a sort of compiler, debugging becomes the obvious response to failure. Dreaming up agent roles and names and further ways to burn tokens is fun and feels productive, but it's cosplaying engineering.

Addressing non-deterministic: the spray test

Failure due to probabilistic model behavior is easily addressed. If something happens that I didn't expect, I regenerate the same output using the same prompt and see how far the model "sprays" in its behavior. A good setup should lead to the same output most of the time.

Studying prompts this eats tokens and time, but the learnings are invaluable. It shows you whether a model always does the same thing (say, properly use a FoobarWrapper without having to be prompted explicitly), or veers sideways because certain words can be interpreted one way or the other. High variance means your prompt is unclear, underspecified: something in there prevents the model from grasping what you're asking, and how it veers off track is usually telling. Consistent failure means the model can't handle your request properly (but maybe the next model will! AGI is around the corner, in case you forgot), or the combination of harness, tools, and codebase quality needs rethinking. Either way, you now have something concrete to work with instead of a vague sense that the machine is being difficult.

I run these experiments constantly, just to see how different the models behave and fail. That sounds perverse, but it's how you build an actual mental model of what these systems do instead of relying on vibes.

One wrong word changes everything

The 15 words you put in at the start determine everything: choose them as an engineer.

I had a prompt — "build a smalltalk environment for agents" — that produced wildly different results on each regeneration (see if you can spot which word caused it!)

The model doesn't know if I meant Smalltalk the programming language or smalltalk the social chitchat. That single word forks the entire generation into two completely different projects. It seems like a funny example, but the exact same thing happens when you generate source code — it's just less visible.

This was more immediately visible with early models and WordPress. WordPress uses C-style %s syntax in its SQL query templates, while most PHP database libraries use ? placeholders. The models would be writing PHP, writing WordPress, doing a SQL query, and the moment it hit %s, it had a high change of switching to writing C. One token made the model flip from one area of the latent space to another contiguous yet erroneous one. Surfing probabilistic waves of code but ending up on the wrong beach.

This is why I use very short prompts and only carry a minimal set of skills and tools. Every word is load-bearing. Every tool generates load-bearing words. If I can't trace how my prompt shaped the output, then I'm not being careful enough. I only create reusable skills when I'm tired of typing the same stuff over and over, meaning I know something is foolproof. Everything else, I want to understand the effect of each word, because that's where the engineering lives.

Diffs are compiled output, not source

Looking at code diffs an agent produces is like reviewing someone's compiled assembly. It's interesting, even fun (at times), but you don't get to understand what they meant, nor does it result in very reusable material. With models, the generated code is the compiled artifact. Generated code conveys some sort of meaning, sure, but it lacks the actual meat. Your prompt, plus the context the model encountered — the documentation it read, the tools it called, the design decisions it locked in during the first few inferences — that's the source (models are auto-regressive, which is why I count the non-code generating behaviour of the model as "source").

When something fails, it makes sense to go back to the actual session transcript (you have a full trace in ~/.claude/projects or ~/.codex/sessions): Which tools did it call? Which decisions did it make? When did that problematic file get created? What matters is tracking how the model interprets your input into something that turns to crap down the road. There will be a point where you can tell "here's where it started." Maybe it pulled in the wrong documentation. Maybe it chose a design pattern your codebase doesn't use, but nothing steered it away. The prompt and the model's subsequent behavior is what is worth studying, reviewing, archiving, documenting and sharing with colleagues.

Because a lot of information is encrypted in thinking traces, seriously blurring the narrative and filling transcripts with usually boring and misleading verbiage, I have agents write diaries of what they did while working — the decisions, the dead ends, the course corrections, the knowledge gained. Raw sessions are insufferable to read, but through a diary you get not just the prompt, but how the model ended up interpreting it. The diary is generated after all the work is done, so you get an "interpretation" of how whatever was created — including the nonsensical excursions — relates back to the words you actually typed.

Turning failures into benchmarks

Every model failure is a reusable test case. I had a project last summer I just couldn't refactor with Anthropic models with an LLM without significant manual intervention. I spent days trying different approaches. I gave up and kept the branch (because it was a side project, I wasn't too invested in seeing the refactor through). When GPT-5 came out, I restored benchmark/YYYY-MM-DD/task-that-failed and it one-shot it. I knew there was something very special there, even if it might have been lucky fluke, because I never got close to such a result in dozens of previous attempts. The second day of GPT-5 I was telling everybody what a gamechanger it was, while the zeitgeist on X was how much of a failed launch it was.

This is only possible if you've characterized your failures well enough to replay them. Otherwise you're evaluating new models on vibes, parroting whatever the consensus on Twitter happens to be that week.

Leaving anxiety behind and being an engineer again

The result of this reframing is most felt in my mental health. Pressing the magic blackbox lever, delegating thinking to the model and living in uncertainty ("Will this take one hour or one day? Will I come back next week to "wtf happened here"?) is anxiety-inducing. Reframing failure into engineering opportunities turned the lack of agency, the sense of being at the mercy of a probabilistic system back into engineering. "Will this take one hour or one day? WTF happened here?" becomes "ok, this didn't work, we have a problem to solve." And problems you can solve. In fact, you probably became an engineer because you love solving problems.

There is value in being a turtle and searching for the expression of simplicity underneath all the complexity we throw at these systems. The engineers who thrive won't be the ones pressing the lever fastest or running the most agents in parallel. They'll be the ones who slow down enough to understand what the 15 words in their prompt actually do, who read the transcript instead of glancing at the diff, and who treat every failure as a chance to get better at a craft that barely exists yet.

Next time your agent fails, take a step back. Read the error message. Do some real engineering.