LLMs: a paradigm shift for the pragmatic programmer
I love thinking hard about architecture and then writing code that gets shit done. If there is a problem, there is a way to solve it and having effective tools and knowing how to use them is what gets you there. Being a "scholar" and a "practicioner" come hand in hand, because our field is young and moves at the speed of the internet. Different paradigms shape your way of framing a problem and not becoming too attached to one is the key to using them effectively.
Most importantly, having a wide array of tools at your disposal is what allows you to be pragmatic. And pragmatically, large language models represent the biggest paradigm shift in programming that I have personally experienced. It feels an order of magnitude more life-changing than discovering Common Lisp, and I'm only 3 months into using these things intensely.
In this series of blog posts, I will present a high-level overview of the different areas of my practice that I found to be deeply affected by this evolving technology. I won't go into concrete details, and plan to dedicate a full post to each technique presented here, with copious examples and exercises.
In this first post, I will cover:
- why knowing the fundamentals of software development is key
- how language models can be used to do "machine-augmented" test driven development
- how language models are game-changers for fuzz-testing
Future sections will cover:
- debugging, monitoring and observability
- building tools and user-centered development
- changes to software architecture
- writing documentation and technical strategy
- searching for documentation and learning
- personal productivity and cognitive impact
Fundamentals are key
I think that using large language models needs to be approached as a mix of software development workflow, tooling and architectural paradigm.
To quickly and naively recap: a language model is something that models the probability of a token given a certain context. In simpler words, a language model has been trained to "predict" what the most probable word is given a certain amount of other text, based on a metric (usually, match the token appearing in a gargantuan amount of data scraped from the internet, further refined using different techniques).
There is a tremendous amount of research in this field, with new approaches and techniques coming out every day. Feel free to delve into the details... I personally don't have much practical experience doing deeper work, but I have a reasonable grasp of the fundamentals of deep learning, and try to read 2 or 3 papers a week.
The large language models we have these days (GPT3 and ChatGPT being the ones I use personally) are called "foundational" models. They have been trained on a pretty indiscriminate corpus, have an absolutely mind-blowing amount of parameters, and suffer from ludicrous training costs 1. Large language models seem to exhibit emergent effects2 when growing in parameter size. This allows them to function as "in-context learners3", which means that they can be further "trained" by inputting a short prompt. They can be used to write Terraform just as they can be used to create comedy skits or translate from French to Farsi. It's all in the prompt.
Concretely, this means that the context you provide to the large language model before phrasing your concrete request is of extreme importance. The field is called prompt engineering and while often derided, is a fascinating field and the key to effectively wield these tools for daily development. Tools like Github Copilot do a tremendous amount of context work without using human language, and developers need to learn to do the same.
A consequence of this probabilistic machine learning approach, as, say, compared to a symbolic approach, means that large language models by definition "make plausible shit up." There are emergent effects that seem akin to understanding (the finer points of that research and debate are way out of my league, and not that interesting to me as a practitioner) break down fairly quickly.
This means that these models will (more often than not) output code that looks plausible but doesn't work, that mixes multiple patterns, uses APIs that don't exist (or exist in another language).
Furthermore, these models are usually autoregressive, which means that they take their own output into consideration when predicting further tokens—this means that once they start veering off-script, they will continue doing so if not corrected. A conversational UX as offered by ChatGPT is a tremendous asset. However, a more straightforward use of the GPT3 API requires considerable attention to the prompt to avoid oblique answers.
This means that a programmer cannot just take the code generated by an LLM and expect it to work—careful review is needed. In fact, I found reviewing code generated by an LLM to be much more difficult to review than code written by a human. It will of course look highly plausible and be off in strange, alien ways. A human would never submit code for review that uses APIs that don't exist, or assumes that just because a named SQL pattern uses "%s" as syntax, it will function like
I had the best results when asking the LLM to generate code in a domain that I know extremely well. This is because I am able to spot subtle mistakes at a glance. When asked to generate code in domains I know less well, I will enter what I call a "hack at it" loop. The model is very adept at promising you that writing "just one more prompt will make the code work", encouraging you to hack at some misguided piece of code without actually looking up proper information in the reference or tutorials written by humans.
This means that fundamentals continue to be absolutely crucial, if not more critical than they already are. I found my use of LLMs to be most efficient in a domain I know extremely well, not just at a "this code will work or not," but in a way where I can intuit what the LLM will "understand" and what will derail it. Fundamentals are what allow you to write a complex and critical piece of code but leave the tedious part to the model; they are what allow you to redirect the model to use the right pattern ("use a fluent constructor API", "use a map reduce", "write a factory with spring-style dependency injection", "use a WITH query"); they are what allows you to understand which context to give the LLM, which piece of code to throw at it, which files to visit with copilot, which comments to write to get a proper answer.
Testing and observability
Many articles have been written about Copilot and ChatGPT being the harbingers of low-quality copy-pasted and error-riddled code, mindlessly inserted into the critical sections of our pristine codebases. I found that heavy use of LLMs actually improved the quality of my code, by shifting their use from generating "problem solving" code to "solution verification" code. Language models are extremely apt at repeating code patterns that they are prompted with, and applying those to the immense amount of code they have seen online.
Machine-augmented Test Driven Development
While we might picture ourselves as brilliant scientists and engineers, building elegant solutions to unique problems, most of the code we produce has been written a million times before. This is especially true of unit tests and the edge cases they cover. Using an LLM, I am now not only able to write an order of magnitude more tests than before. However, the model often reminds me of edge cases I hadn't considered.
Here is a workflow that I call "machine-augmented TDD" that I found works extremely well.
- Think hard about the problem and develop the first sketch of a valid API. Don't write any code yet, an interface is enough
- Paste the interface into ChatGPT and ask for a list of edge cases, downsides, upsides. Rewrite the API if necessary
- Paste the suggested edge cases back into your editor, and write out a unit test or two by hand, so that the LLM can learn your style (which unit testing library, scaffolding patterns, documentation and naming style, etc...)
- Documentation is free, and ChatGPT is proficient at writing clear, if a bit stiff, documentation, so write some! I found that many unit tests suffer from poorly written documentation. This makes it really challenging to come back to a test suite and understand which edge cases were being covered
- Autocomplete as many unit tests as seem necessary. You can usually write at the speed of tab-complete, so no reason to be conservative
- If the boilerplate becomes overwhelming, think about how they can be refactored, write the API necessary for the refactor (for example,
createTable(columns, rows)), paste all the unit tests back into ChatGPT and ask it to refactor them (literally, "refactor using this interface").
- Ask it to write the actual
createTablefunction. LLMs are highly effective at generating test data, so again, don't skimp on how clever you can be
- Ask it to write the actual
- Take a break, go on a little walk. When you come back, turn off ChatGPT, and sit with your code for a moment. Is everything correct? Are there more tests to be written? Would your colleagues approve of this? Should we actually use a different interface altogether? Is the documentation adequate? Using an LLM frees up time to actually do some big picture thinking, use it!
- Once you are satisfied with your test suite, implement the actual function
Here is an example where I generated a test scaffold for a function that converts an integer column index to an Excelalpha string column index. I did switch from 1-indexed to 0-indexed midway without realizing, so that threw the whole thing off a bit, but I'm letting it stay as is, as an example of what happens when things are a bit misaligned.
(I am not able to embed the iframe here on bearblog).
VIDEO: Machine-Augmented TDD with ChatGPT and Copilot
Generating fuzzing scaffolds
Fuzzing is a technique that I have only recently started applying, but found to be applicable to a wide variety of problems. It can of course be used to identify security issues, but with proper fuzzing harnesses, fuzzing can be used to do model checking of state machines, check for data quality problems, find concurrency issues and much more.
They are most effective when using proper fuzzing harnesses. A harness is often a wrapper layer that drives the logic you want to test with a grammar that can be easily modified through common fuzzing heuristics. You can also provide and configure custom heuristics, which could allow the fuzzing engine to explore more problematic parts of your code more quickly (there is some black magic to this, and sticking with standard heuristics is often effective enough).
Writing these harnesses is tedious but very structured. Furthermore, they need to be adapted every time your code changes, which tends to be more expensive than just refactoring unit tests.
One strength of large language models is their ability to translate between semi-structured language structures. They are able to recognize patterns of abstraction in the input and transform these patterns into a different representation, sometimes climbing up or down the abstraction ladder. This makes them extremely effective at working with grammars, code generation and semi-formal languages. I find this code sometimes fun, often very taxing to write. Walking ASTs, getting a parser to behave correctly, and especially later on maintaining this code is quite time consuming and error prone. Ad-hoc DSL languages are typically reified and not modified with the agility necessary to leverage fuzzing during exploratory programming.
Being able to rewrite a fuzzing scaffold at the "speed of mouth" is a life-changing experience. The workflow itself is very similar to the TDD workflow presented in the previous section.
Notes on the videos
These videos are obviously toy examples and were recorded pretty quickly. Large language models make it very tempting to work at full speed and get a full unit test suite and fuzz test suite done in 20 minutes. However, it is easy to be fooled by the ease of chatting with the machine. What I recommend doing is using the time saved by not typing to actually take a short break. Then, come back, turn off ChatGPT and take a close look at the code itself. I will elaborate on the cognitive shifts I had to learn to work effectively with these tools.
What's coming next
I was picturing this article to be a quick overview of what makes LLMs so incredibly effective in my daily workflow. It turns out that even with just a single example and a quick write-up of the individual points I want to present, this exposition will be a whole series on its own. After covering the additional topics in the introduction, I plan on doing an in-depth analysis of each technique, using different languages, presenting different strategies, showing the ugly side when possible and providing a copious list of exercises.
See you then!
I won't do any criticism of LLMs in this article, but there is a lot to be said here. I wrote an initial post on thinking about my ethics approaching large language models and plan to write many more.↩
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. ISSN: 2835-8856. OpenReview↩
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv. https://doi.org/10.48550/ARXIV.2005.14165↩