the scapegoat dev

LLMs will fundamentally change software engineering

an AI drawing of a weird cyborg kid in front of an egg shaped computer, in black and white ink. It's very scifi and quite melancholic in some way. The egg computer is also a tree.

Heard about Large Language Models like ChatGPT4, Bing, GPT3? I'm sure you have.

There is one side of the hype around these technologies that I come across pretty often, which is that these technologies are bad for some $reason ("they are stochastic parrots", "they create bullshit", "they can't reason", "they make up facts", "they might replace junior developers, but they will not replace senior developers"), which, while technically true, is missing a much bigger point: if you are in the business of writing software, these things work.

In fact, they work so well that I think we are on track to see a fundamental shift in how software is built. This is going to have a drastic impact on pretty much everything. The irony of the situation is that programming is probably one of the jobs that can most easily be replaced by these technologies. This article is mostly directed at programmers among my readers: we are at a pivotal moment where we programmers need to understand and appropriate these technologies, on our own terms, before capitalism does it for us.

I have written about how I am convinced that LLMs are a paradigm shift for the pragmatic programmer and started giving more detailed insight into what I have been using these models and the applications around them for. I have also written about my budding ethical stance around this technology, and won't get into that aspect in this post.1

I think that we are at ground zero of a tremendous revolution in the way we build software. We don't know how to work with these tools yet: we just discovered alien tractor technology. Many critics try to use it as if it was a plain old gardening rake, dismissing it because it plowed through their flowerbed.

I hope to share some of the insights I've gained into what programming with LLMs actually does for me. I have discovered that developing a practice, a methodology, a workflow is key to intellectual work, be it software or writing or music. Because programming is closely intertwined with productive teamwork (especially in our capitalist context), this practice has to be shared. Programming is about coordinating the work of individuals to create common artifacts, and success is determined by how well we coordinate.

I think that programming with LLMs is about to create a radical shift in software architecture, system architecture, programming practices, communication patterns and organizational structures. These are exciting times, because we are the people who are in position to shape what programming is going to be like in the future.

My background is in system programming

I consider myself a programmer first. I've wanted to press buttons to make machines do things since I was 5, and one of my autistic interests has been computers ever since. Even my music, drawing, writing is inextricably bound to these machines of wonder. I have written millions of line2 of code and barely a day passes by when I don't push a few commits. All that to say that I'm intensely interested in things that work, and that I have tried many many things. I care enough about coding that if LLMs weren't useful, I wouldn't be using them every day.3

I have been programming heavily with Copilot since it was in beta, and have been using chatgpt for every real-world problem I could think of, beginning the week after it was accessible to the public. I write mostly "boring glue code," and have done so in what I call "systems programming,", i.e. building systems (operating systems, embedded systems, distributed systems, logistics, supply chain). Another definition could be "writing memcpy in many different complicated ways." 4

Copilot my uncanny autocomplete

I love "rustic" programming languages: my favourites are PHP and Javascript, Java, Go, (and, for reasons that are beyond the scope of this article: Common Lisp5). For me, "rustic" means that what you see is what the author intended, without necessarily a lot of polish: languages that make their context explicit. It's the dinner table built by your grandfather, upgraded by your mother and you now inherit: it is a bit clunky and the paintjob is flaking off, but it has been doing its job for nearly a century without failing. These are languages that Copilot does a superb job with. The patterns it should use are often very "flat," and are often present in the code around it. A symbol's meaning is not influenced by hidden characteristics, abstractions, or module systems. It is not that it is the "complexity" or abstraction-level of the language that makes for inferior results, it's that it is more difficult to infer from the training corpus what the completion of your code should look like.

This is what allows Copilot to do a superb job (while I tried to use it for Haskell and Rust in the summer of 2022, I haven't tried since. Copilot has made some impressive progress since, so my take here might already be completely out of date. I have grown used to Copilot uncannily inferring what I am trying to do after writing the first couple of words. In fact, I have noticed how my physical muscle memory has changed during the programming process. If my internet is down and I—god beware!—have to type in code myself, I have to undergo a mental switch. At first I'll write 3 words and expect 10 lines of code to be scaffolded out. It takes me a few seconds to realize my magical friend Frank the copilot has gone AWOL.

Stochastic parrots, valid code?

One of the common criticisms of Large Language Models is that they often output wrong code. Which is true (ChatGPT4 significantly ups the bar, but it's not too difficult to get it to output wrong code)! Leaving it at that is, however, I think not looking carefully enough. Quickly written wrong code that can be easily corrected is just another name for good code.

Much of my programming consists of writing trivial ideas in longform. I will start with "I need to copy this data from here to here" (which is why I refer to it as "memcpy programming"), and then spell it out in HTTP call this, promise that, SQS event here, batch job there. Just writing the comment call the HTTP api /api/products and send the result to our SNS topic /products is enough for Copilot to basically complete the entire thing.

Because I am not a rocket scientist, the methods are going to be something like (in fake Javascript):

   then((products) => if (validateProducts(products)) { 
    } else { 
      fail("invalid products")

If you autocomplete copilot after writing the comment above, it might very well come up with:

   then((products) => if (validator.checkProducts(products)) { 
      const sns = new SNS(this.topic)
    } else { 
      throw Exception("invalid products")

Getting hung up on the fact that it used validateProducts and not validator.checkProducts is I think missing the point. The actual working benefit is that I now usually spend 10 minutes on something that would have taken me 2 h.

I think the implications of that fact stretch much further than "Well, now we have just replaced code monkeys." I think that being able to write tedious code at that speed closes so many feedback loops that it leads to emergent effects that change the way we build software.

Aside #1: Helping Copilot output valid code

Although Copilot is very eager to discover your codebase, it needs to see what style you like. It needs to see what APIs and helper methods you already have at your disposal, and which packages to import.

The solution, if you would like to tab-complete 90ish% of the code you intend to write? Visit a couple of files you want Copilot to "learn", or write one example of what it should generate. If you want Copilot to be fluent at using an obscure library, just go browse its source code, visit a few examples, then go back to frantically tab-complete your way to a working application.

The same technique works for ChatGPT. You want to have a decent output for ChatGPT? Just copy paste a ton of example code upfront. Copy paste your class definitions, the comments to your class definition, maybe some DDL, some example CSV. Paste it often, paste it every time it diverges. Take its code, correct it, paste it back. Paste the documentation page. Paste the entire StackOverflow thread. Paste paste paste, context context context is what is needed here.

Methodology shift #1: writing documentation

This is the first change in methodology we have encountered. More than API documentation for humans, we need to write code (and tools!) that make the API discoverable and "understandable" for LLMs. In most cases, the two come hand in hand. Writing clear and concise comments is what gives the LLM the context it has seen in its training set, thus helping it to infer the correct answer. Clear and concise APIs, with meaningful names, allow us to efficiently (the fewer tokens the better) convey our intent to our tools.

I think we will start seeing practices that will expose documentation as both human readable, and easily "machine parseable" (short, concise, with few-shot examples for concrete tasks). Instead of formal languages such as UML, ODL, SOAP, Swagger, JSON Schema, we will go back to simple, no-nonsense READMEs that give a short overview and a couple of usage examples. This will work not because simple is better (there is a reason we keep reinventing the wheel with these things). It works because the README encodes our "human" intent very well, and the source code embodies what we intend the system to do in excruciating detail, and an LLM can combine the two to generate either more "formal" code for the machine to interpret, or less detailed text for humans to interpret.

I don't think we have realized that the most effective way to communicate with humans is now also an efficient way to communicate with machines.

This of course works both ways. Large Language Models are uncannily effective at transforming crappy comments into well-written, articulate paragraphs. They can generate 5 interesting examples in the time it takes me to say "please." They can update the existing documentation to match the updated API after refactoring in a matter of seconds (and it is only cumbersome because it currently requires copy-pasting to and from ChatGPT). Copilot Labs is experimenting with "brushes", but since I use Intellij and not Copilot, I haven't really used them much).

They can update code to match documentation changes in the same way, or generate valid code out of concise, well-written documentation. Heck, you can literally curl a few endpoints in your terminal, paste the entire thing without any editing into ChatGPT. You can then ask it to create an API library with mocks, unit tests, examples and documentation. This will more often than not output something better than I would have written.

I don't bother manually writing any of this data munching / API wrapping / result validating code anymore. I had to build a server-to-server integration with Google Tag Manager recently. I literally copy pasted the webpage into a simple 3 line prompt and can now generate PHP classes, typescript interfaces, event log parsers, SQL serialization with a simple shell command.

What do we do now that well-written documentation is both fundamental and almost free? Do we become writers? Do we become editors?

Personally, I think so. There is no excuse to not write stellar documentation (or tab-complete your way to it), documentation style and documentation quality will become just as automatable and lintable as how many spaces we put in front of our curly braces, and code comments will never be out of date again (why wouldn't your IDE flag documentation that doesn't match the behaviour of the code?).6

And if we become editors, does that mean that learning to work as a programmer is now, from the get go, about learning to read, criticize and correct code? These are the skills that we have painfully turned into practice over the last 30 or 40 years since the explosion of professional software engineering, and are often reserved to the "senior" caste, juniors being busy banging their heads against the compiler or something. But now being a junior means de facto becoming a critical reader. Code reviewing is the new programming. 7

Methodology shift #2: Whiteboarding and rubberducking

One thing using LLMs teaches you is that software architecture is about pattern matching, and those patterns are fairly simple. The thing is that by default, ChatGPT for example is akin to a "design interview bullshitter." It will confidently use a lot of clever words, draw the right diagrams on the whiteboard, and be utterly incompetent at making an actually worthwhile point. Asking ChatGPT how to design a certain application will result in eerily similar heaps of teflon-coated, reality-proof platitudes (ChatGPT4 already sets the bar much higher here...).

It is easy to say the right things, it is easy to look up what an effective event-driven architecture looks like, but it is much more difficult to figure out what exactly needs to be done, what is easy, what is difficult, what works well, and what fails in a real-world scenario. But, using the technique shown above, once you start asking ChatGPT how to build out a concrete application by sketching out potential APIs, fleshing out an infrastructure, deciding which protocols to use, you often get a lot of plausible looking, concrete code.

I find generating "plausible" code useful on its own. I don't need to trust the code to be correct—the overall structure and vibe gives me a sense of how this thing will work, what is problematic, and what is clever. It has a tremendous amount of quality code to rely upon to find interesting patterns and well-named classes; I can guide it when I think of something. I basically have a pretty "intelligent" rubber duck at my disposal.8 Brainstorming with ChatGPT feels very much like sitting in front of the whiteboard with a colleague and just imagining things, except that you often end up with reasonably close to working code by the end. While I have never done a "three-way" whiteboarding/rubberduck session with a human and ChatGPT, I think that this will become a regular practice for some.

Methodology shift #3: Build more prototypes

ChatGPT is great for generating prototypes, big and small. Ask it about a topic and it will not only answer, but usually provide a fully runnable example in the language of your choice. It might or might not execute, but that's a heck of a lot of really boring nonsense that I don't have to type. Once the LLM has typed out a toy example, I can remodel that simple example into many more things:

LLMs lower the cost of exploration to almost nil.

I recently wanted to write a plugin for OBS that would stop the recording if I didn't close a modal within 1 minute. I had never programmed OBS before, but in three hours I was able to do the following:

I was able to try two dead-ends (through no fault of the LLM's own, honestly) and end up with a robust running tool that I will continue expanding. I hate writing UIs, I hate fighting obscure APIs I don't know. I used to only write tools when something was getting so humongously irritating that I just couldn't take it anymore. I have a problematic habit I am actively trying to fight: building personal tools as if they were meant for production at-scale (meaning: after 3 days doing "professional" software engineering, I burn out and the tool ends up in the ditch, full of promise yet unfinished.)

What this means for professional programming is that you can now write code, write a lot of code, write an insane amount of code, and just throw it away. No one will fault you for having a conversation with ChatGPT that generates 5000 lines of code, and then closing the tab. But, the fact is, you wrote 5000 lines of code and decided they were not worth it. When was the last time you did that?

What if it became standard practice to draw up a concise, well-written description of the problem at hand (see the previous section), and then ask the LLM for a microservice architecture sketch in go, a synchronous multithread sketch in rust, a typescript deno version, potentially a lambda. What if you had it generate Terraform for AWS, but also for Azure and GCP? What if an architecture proposal now only gets to the review stage if you try at least A, B, C, D before settling on one, instead of endlessly debating with your colleagues about A vs B? We all know we have biases. We all hold strong opinions that can only be supported by evidence with an n of 1. I am usually more convinced when I see an actual code sketch.

We used to deride "writing code at the speed of mouth," but that is now a reality.

Methodology shift #4: Build more tools

Which brings me to the next practice that I think will be incredibly beneficial. We all know that tools are important, that effective tools are challenging to create, and that management doesn't care or understand the need for tools. LLMs allow us to build tools at the said "speed of mouth." I know that I can now spend 30 to 45 minutes talking to ChatGPT and get a pretty solid tool done. This would probably have taken me 4 to 5 h programming previously, which means that it would have to be split over 2-3 work days (factoring in meetings, code review, lunch breaks, interruptions). Which usually means that the tool won't get built.

A list of tools I built in the last 3 months:

I can't express how fundamentally different programming feels now that I can build 2 quality tools per day, for every single itch I want to scratch.

Aside #2: Fundamentals will probably be just as critical

I believe that abstraction is primarily learnt by having seen and played with enough concrete use cases that higher structures get formed. Abstractions are a two-edged sword, as an ill-fitting abstraction will cause constant friction. While LLMs allow us to work at a "fuzzy abstraction" level (the abstraction is not ready to be formalized, but it is forming and shaping the words used to describe a problem) as well as quickly explore concrete implementations, "controlling" an LLM is best done by having a solid understanding of the problem in mind.

One very real downside I have experienced with conversational LLMs is the temptation to keep chatting and hacking at a problem in the hopes that the model will "get it" at some point. This is exacerbated when you don't have a proper grasp of the problem you are trying to solve. Your focus shifts from having a productive conversation to stumbling over hallucinations that lead you on wild goose chases. In those cases, I found myself closing my internet connection altogether. Instead, I stuck with offline documentation and a book until I got a better sense of what I was dealing with.

With "real" knowledge in hand, asking the right questions of the LLM leads to a much more productive session. Anybody deriding "prompt engineering" as a ridiculous discipline hasn't spent enough time trying to write effective prompts.

Aside #3: Avoid unproductive chat sessions

I hope that future conversational agents will be able to flag when these unproductive spirals happen. Currently, ChatGPT will just continue to engage, but I could see it at some point stopping, and pointing to proper tutorials and resources. Another option would be to ask for more details. The progress that ChatGPT4 seems to have made (I have only used it seriously for 2 days, so I can't really form a proper opinion) is quite impressive. One thing this field has taught me is not to make assumptions about the capabilities of the next generation. It might have the same fundamental issues the current models have, but it might just become "good enough" that for all intents and purposes, it doesn't matter.

Methodology shift #5: Continuous code review

I think that a major focus of tooling is going to be on "continuous code review." A model can watch you build your software, infer your intents and the structure of your thinking, and provide feedback on your approach. It can flag typos, security mistakes, and non-idiomatic uses of APIs. I was impressed by ChatGPT4 correcting my code and doing some pretty effective factoring of the problem into interfaces and functions by itself. In fact, I would say that in the small, ChatGPT4 is a much better programmer than I am. It knows many more idioms, it doesn't forget security issues, and it can spit out unit tests at the speed of token sampling.

People who think that these models will lead to a proliferation of bottom of the barrel stackoverflow code riddled with security mistakes are missing how quickly these models have become better at what they do. I think it is because it is easy to forget just how much good code can now be found online. Great code is bound to be more widely disseminated in its training corpus, and said corpus is most certainly closely scrutinized and tweaked. The jump from ChatGPT3.5 to ChatGPT4 in terms of software architecture "rhetoric" makes that abundantly clear.

I use a series of prompts that ask the model to give its feedback regarding security issues, edge cases I've missed, documentation that is unclear. As of now, I have to manually copy and paste this into chatgpt, provide missing context, and refine its answers. This is however a matter of engineering. The model itself already does an impressive job. This is infinitely better than most code reviews I've received over my career, and it is instantaneous, along with examples to reproduce the issues found and fix suggestions.

It is now basically easier to write complex, fuzzy linters that can check for domain-specific, codebase-specific patterns, because such prompts consist of a couple of (well-informed) human language prompts. It is probably faster to instruct the LLM-linter to "check that singletons are only used when dealing with customer data" than to properly configure curly-brace behaviour in editorconfig. It won't catch every usecase, but it will catch enough to be worth its while.

Methodology shift #6: Cognitive impact of using LLMs

This might be the most subtle shift, but I believe it's also the most profound change that using LLMs has brought to me. After a day of being focused on getting "the tedious stuff" right (using the right API, checking the right errors, properly calling API X, implementing the unit tests and mocks for Y, writing an API wrapper for Z), my brain would be entirely gone, consumed by the minutiae and intensity required to not slip up. I would spend the evening playing video games or watching Netflix to recover.

Since extensively using Copilot and ChatGPT, this cognitive exhaustion is pretty much gone. 6pm strikes and I feel like I spent the day chatting with a buddy, yet 5 PRs have been merged, the unit tests have been written, two tools have been improved and the code has shipped.

This allows me to make significant progress on my open-source projects. I know that I'll be able to get one non-trivial thing done before dinner, and maybe one or two more after. Where I would previously spend 3 h on Saturday trying to get an AWS lambda running, wondering "why the hell am I spending my Saturday this way", I will now happily close one or two tickets, and spend the rest of the day with my family and doing chores.

Doing more "high-level" thinking

I am a firm believer in letting things rest, and doing "shower-thought-driven" software engineering. I believe it is necessary to think deeply about something, try to build a few prototypes, and then let it settle. The real insights (and bug solving) happen when taking a walk, when my brain is well-rested, after exercising or in the proverbial shower. Whereas I would previously maybe be able to 30 minutes to 1h of this "free" time per day, if at all, by being able to fold the tedious stuff that used to take 4h into 1 or 2h (I am averaging here, but writing something like an API wrapper legitimately costs me 10 minutes instead of 2 days nowadays), I now have 3 to 4h of "free" thinking time per day.

This means that I can think about what needs to be done and what doesn't. It means I can spend some time trying out alternative approaches (build more prototypes, as detailed above). I can spend more time talking to stakeholders and figuring out exactly what we need. I can think about the tools that the team itself needs. I can work on pedagogical material. I can make reports pretty. It is not that these things couldn't be done before, but they have become so insanely cheap that there is no good reason not to do them.


I wish more senior developers had spent the time to properly assess these technologies, with an open mind instead of reacting defensively with knee-jerk reactions. These technologies are scary, because they make it clear that programmers are going to be the first in-line to be replaced by machines.

All the "improvements" listed above are going to be exploited by the way our industry is structured (corporations are already AIs, in some way) to squeeze even more soul-crushing productivity to the benefit of a very few individuals. We have an opportunity right now to reckon with how we deal with these emerging powers. I am trying to work out which direction I want to take, besides writing about what I think is valuable engineering insight: taking these technologies out of the moat of Silicon Valley companies; using the "improved" productivity before companies catch up to figure out how to organize; leveraging LLMs to build better open-source; using these technologies to build tools for people, not business.

  1. Furthermore, while I might slip and anthromorphize LLMs, I do so in the way that I would say that "the compiler thinks." As far as I am concerned, LLMs are relatively straightforward pieces of code trained on a shitton of data, and ChatGPT is not more alive or reasoning or feeling than /bin/ls and my S3 bucket of production logs. It doesn't know, it doesn't remember, it has no intent, it has no interactions with the world. I want us to talk about LLMs as what they are: probabilistic models of language that predict the next token based on a given context, after having been trained on exactly that. The fact that this formulation has led to such things as ChatGPT blows my mind, but I'm not interested in discussing irrefutable speculation. I just want to write code, man...

  2. There's a whole thing about programming vs software engineering that I never fully understood. Surely software engineering is about building good software, and programming is, well, building good software too. When I draw diagrams I am programming. When I write documents I am programming. When I talk with colleagues in front of a whiteboard we are programming. When I read and write a lot of software, mindfully, I do a lot of software engineering, because I want my software to work today, tomorrow, in ten years (which means measuring, testing, benchmarking, breaking, stress-testing, documenting). I want it to work even when 30 people work on it under the pressure of deadlines (which means designing, refactoring, testing, building development workflows, writing documentation, communicating, understanding team structures and business goals and legacy and individual cognitive styles). I call it programming because the only tangible result that we can actually all agree on is the resulting artifacts (source code and documentation). I love working on and reading legacy codebases, and you can read most of the "more abstract" things in the source code itself (on the easier side: team dysfunction, badly aligned goals, insufficient communication; on the harder side: good onboarding, individual contributor growth, great business alignment). That's why I tend to bring up coding so much. You can talk beautifully about all kinds of concepts, but what really matters is the code that comes out. To me, good code is synonymous with good engineering (good code can be: code that solves a business problem, where often no code is actually the best code; code that is elegant (some programmers like making elegant things, most like using elegant things); code that is fun; etc...).

  3. If this comes across as braggy, one thing I learned about the last year after discovering I'm autistic is to not care too much about how I come across. I think I am a pretty terrible, careless coder; I love sharing everything I know: I want everybody to find the love I do when I use computers. I am disappointed that I will never get to be a jack of all trades, because there is just too much cool stuff out there. I will leave mastery to people who have a more focused curiosity.

  4. This seems to be another loaded term where people fight over definitions. I am grouping embedded, some web, distributed, operating systems and some database development under the systems programming umbrella. It seems like a very broad umbrella, but from a programming perspective, I see it as the programming of queues, resource ownership, resource initialization and teardown, concurrency, locking; the programming of protocols, data serialization, storage, bandwidth, throughput, latency; the programming of state machines, coroutines, schedulers, threads, drivers and APIs. When we define "systems programming" from the programming perspective (what code we write), we discover many parallels which are "generative," i.e. they generate insight and ideas through the use of code. This is in contrast to defining "systems programming" as say, the programming of operating systems. This shuts down insight. It causes generations of web developers to reinvent the wheel each time they want to manage minified assets instead of using a resource pool. It leads generations of embedded developers to reject better tooling and effective testing and deployment practices. It leads to a smorgasbord of poorly designed embedded UIs and slow web applications and unpleasant concurrency abstractions and operating systems that think everything is an integer.

  5. You might wonder why I group Common Lisp along with PHP. What makes Common Lisp special as a language is that it allows you to build your own language whenever you need to, however you need to. You can refine it interactively. If your goal is to write code with a clear structure and explicit context, Common Lisp is an excellent tool. Instead of shoe-horning your communication into a ready-made, rigid framework, you build the structure to fit your communication. As we will see, this is what allows LLMs to do a good job. Also, take a look at the Common Lisp spec and tell me it's not... clunky with a flaky paintjob.

  6. You might realize by browsing geppetto's repository that it's just as easy to... not write documentation. Entirely free it is not, that is for sure. Another reason here is that I don't want people using this tool just yet. I have put significantly more effort into glazed's documentation, including building an entire HelpSystem inspired by Mathematica's stellar documentation.

  7. I think people worried about junior programmers becoming obsolete should rather be worried about junior programmers replacing seniors. It's easy to believe that your mind stays younger as you grow older. However, hanging around discord with passionate 15 year olds quickly shows me how calcified I have become. I might think that my elegant take on frontend frameworks and component-based CSS, my ideally crafted git workflow and observability practices are the embodiment of craft. In the meantime, the kids think I could just as well write COBOL while they merge their 30th pull request of the day. This will enable their fully automated CICD to deploy the newly created artifacts. Every random repository that gets shared on Tiktok gets 300 (3000? 30000?) stars within a few days. While me writing opensource back in 1998 meant maybe getting a single patch merged because I knew someone on IRC who cared, today, it means growing up fast AF and becoming an agile tech lead quickly.

  8. As someone who needs to ramble and ramble and ramble at somebody in order to clarify my thoughts, this is genuine magic. One downside is that the tool's "personality" changes over time. It latches onto novel concepts, and heavily redacts or tones down more opinionated statements, often in response to prompting attacks and in order to reduce the amount of hallucinations and other side-effects. This is the inverse of what you want when brainstorming, where far-out ideas are what spark joy. As a work-around, I often prompt the worst models and paste their confabulations into the more reasonable big brother, in order to disrupt the process.