Summary
- Code quality is likely still important, but is automatable.
- High level software design remains human territory with AI assists.
- Focus shifts away from code to domain models and specifications.
- Specifications could replace code as the single source of truth.
- Rigour and engineering practices remain crucial.
- Hedge against AI’s risks by building systems that can revert to human engineering.
This week, I participated in Thoughtworks’ Future of Software Engineering retreat in Engelberg, Switzerland. It was an invite-only event with about 60 industry leaders, hosted by Martin Fowler. The event used an open space format, where all sessions had the form of group debates, mostly focused on agentic development. Here are some of my observations on the changing role of software design.
The retreat’s participants
The participants had widely varied backgrounds, and their use of agentic development was equally varied. Some were using it for small projects and proofs of concept, others for parts of their production environment, and a few were using it on mission-critical systems and highly regulated environments.
The participants’ styles ranged from humans programming with small changes delegated to the LLM, through LLM pair programming, automated generation with human verification, automated generation with automated verification, to full dark factory pipelines. (In manufacturing, a dark factory is a fully automated facility where humans are not allowed). In this style, human oversight is almost exclusively limited to improving the quality of the pipelines. Regardless of style, many agreed on the value of harness engineering as a way to guarantee outcomes.
Low level code quality
A recurring question was how much we should still care about software design at the granularity of code. These are concerns such as naming, code duplication, dead code, mixing responsibilities, modularity, coupling, and cohesion. I can roughly group the participants’ opinions into two opposing arguments:
- These concerns exist for human readability and understandability of the code, and impact the cost of change. Agents significantly lower the cost of change, and have no need for such design principles.
- Agents benefit from well-structured and well-designed code.
The people in the second camp offered some compelling arguments.
- LLMs have been trained on human languages, and have some limitations that are similar to human programmers. Clarity matters as much to LLMs as to us.
- As with humans, code quality doesn’t matter much on short-lived projects, but has a huge impact on the long term maintainability and changeability of the code. Camp 1 is therefore not looking far enough ahead.
- LLMs’ context windows are limited. More accidental complexity in the code increases the context.
- Complexity increases the token cost.
- Humans may need to debug the code when the LLM can’t. The humans will be unfamiliar with the code, so good design will be crucial.
- And perhaps the most interesting observation: LLMs reinforce bad code. They mimic the style and patterns of the existing code when generating new code.
I want to make a speculative side argument here: As we improve the quality of generated code, the number of production issues should go down in relative terms. However, the amount of production code will increase exponentially, and the issues will go up in absolute terms. The issues that are easily solvable could be solved with agents. What’s left for the humans will be the most complex issues with the most dramatic outcomes. (See Bainbridge’s Ironies of Automation)
So my money is on code design quality:
- If camp 1 is wrong, they will have to catch up on quality, and will need to spend a lot to repair their existing code.
- If camp 2 is wrong, they can easily remove their quality infrastructure at low cost.
- If camp 1 is right, they will have saved money but won’t have anything else in terms of competitive advantage compared to camp 2.
- If camp 2 is right, they will save money and they will have a major advantage.
Achieving code quality
The good news is that people seemed to feel that agentic code quality was working well. This seems to be a great setup:
- Generate code in one pipeline, and verify only whether it works. Commit these changes atomically.
- In a second pipeline, do static analysis on the code, using traditional deterministic tools. Configure them to their most strict settings, even if that produces false positives.
- Feed each code smell from the second pipeline to a third pipeline. An agent assesses whether the recommendation is valid, and whether it is worth fixing.
- The final pipeline refactors the offending code, one smell at a time, and does atomic commits for each fix.
Agentic refactoring seems to have the same constraint as human refactoring: you need high test coverage to ensure that each refactor changes only the form of the system, and not its behaviour. The driver for these multiple pipelines is that generating high quality code in one step is harder than splitting up the process. (Newer models seem better optimised for code quality, so perhaps this is a temporary constraint.) Note that the process resembles TDD: write a working implementation, and only then refactor. Or in the words of the late great Joe Armstrong: “Make it work, then make it beautiful, then if you really, really have to, make it fast. Ninety percent of the time, if you make it beautiful, it will already be fast. So really, just make it beautiful!”
High level software design
If code quality is automatable, what about the high level design? These are some of the essential qualities in non-trivial, critical systems:
- A clear domain model, expressed in all the artefacts including the code, the tests, the documentation, the user interface, and the conversations between business stakeholders and engineers
- An unambiguous, well-defined, and agreed-upon Ubiquitous Language, inspired by the domain language, that captures all concepts, jargon, and metaphors.
- Bounded Contexts that put a boundary around the language and model, while pragmatically serving the needs of the system designers.
- A continuous process of updating language, models, and boundaries, as the domain evolves and our insight into the domain improves.
Let’s work our way through these backwards.
It’s fairly self-evident to me that organisations will always want evolvability, as it’s essential to being competitive. Model drift is the growing distance between the (mental and documented) domain models, and how the code actually works. This increases the cognitive load on engineers significantly and increases the cost of change. Obviously, there isn’t much data on long term evolvability with agentic development, as the LLM space is young and changing rapidly.
Bounded Contexts seem useful for the same reason as the modularity argument before: a smaller scope means a smaller context window and a smaller token cost. With a documented definition of that scope, agents can be instructed to warn when a feature request violates the scope.
Defining boundaries remains an art: too large and it becomes hard to make changes; too small and all changes affect multiple Bounded Contexts simultaneously. Not coincidentally, the art of defining Bounded Contexts depends heavily on the quality of your modelling and the design of your Ubiquitous Language.
But does language matter? Again there’s the argument that the machine doesn’t care. All names in your code could be random IDs, and it wouldn’t affect the compiler, and perhaps it wouldn’t affect the LLM either. But the argument in favour is that language is not just for code and agents. Human conversation, specifications, diagrams, user interfaces, APIs, and MCPs all need to share that language. Some people at the event argued that we could create programming languages and inter-agent interfacing languages that are not human readable. The counterargument is that LLMs are trained on massive amounts of human language, and there is not enough training data for these hypothetical new languages. LLMs simulate human reasoning.
Do we still need domain models? In my own experiments, feeding the agent a domain model first, before oneshotting an application, yielded much better results than just prompting my requirements. Without a model, the agent regularly made dubious choices for the code structure and database schemas. The code worked, but was harder for me to understand, and decreased the database performance. Participants at the event reported that they iterated over domain models with an agent, and only instructed it to write code after these had been settled. Some would build a mix of text descriptions and text-based diagrams (using tools like PlantUML, Mermaid…) and feed these to the agent.
Formal modelling is on the rise as well: it has long been a small niche in software engineering, as it has a steeper learning curve, but agents can create and explain these models for us, and then use them to verify their outputs.
Specifications or code?
One of the most divisive questions at the event was whether code is still ground truth. Again I can roughly separate two camps:
- Treat the code as the single source of truth. Iterate over code with atomic changes. Verify it using agents as well as traditional methods like code reviews, automated tests, acceptance tests.
- Specifications are the single source of truth. The code is not for humans anymore. Iterate over specifications instead, and verify whether the application satisfies them.
The first camp keeps the way software engineering has been done for decades, and extends it with agents. The second camp represents a radical shift. Software engineering becomes specification engineering.
Participants reported different styles of doing specification engineering. Many used agents in some way to assist writing specifications, but there was a split over whether the specifications are only text, only executable specifications, or a mix (which undermines the goal of having a single source of truth). In any case, the quality and rigour of the specifications is critical.
An interesting question then is whether the code becomes disposable. Some people would iterate on the specifications and have the agent patch the code (and then perhaps use automated code quality pipelines like I described above). Others would iterate over the specifications and the code, and then throw the code away as soon as the agent had trouble patching it. Then they would simply regenerate the code from the latest specifications. The debate was about whether throwing away code made the whole thing less deterministic, and whether the token cost of frequent regeneration was higher or lower than the cost of patching code.
Placing your bets
As a long-time advocate of automated tests, I feel a strong attraction to Spec-Driven Development with executable specs. I’m an engineer and I like deterministic systems, and if we let non-deterministic agents loose on our code, at least the verification of the code should predictably find faults. The specifications need to be (sufficiently) human readable so we can verify them. And therefore they need clear language, models, and boundaries, whether the LLM cares about those or not. I’m less convinced that we can throw away the code and regenerate it at will. It seems that regeneration not only has its own cost, but it could also increase the cost of verifying functional and non-functional requirements (such as security and performance) of each regeneration. I’m also not sure whether investing in low level code quality will matter much in its own right, apart from token cost. Still, I’ll bet on it, for the reasons below.
People running dark factories reported that presently, these are very hard to set up and require highly skilled engineers and an established high-quality engineering culture. Someone expressed the concern that many organisations will want to set up dark factories without having these skills and culture, and risk high costs or even major accidents. (I’ve often made a similar argument about Domain-Driven Design. If you have no tests, code quality, agility, engineering process, and communication channels with the domain experts, slapping DDD onto it will not give you all it promises. But even bad DDD is not as risky as pushing unverified code to production.) Make sure you cover your fundamentals of quality and design first. As one participant, Sam Ruby, wrote about the event: “rigor doesn’t vanish when the agent writes the code — it migrates. Upstream into specifications, down into test suites treated as first-class artifacts, into type systems and constraints, into tiering code by how much damage a mistake could do.”
But there’s another reason why I would stick to building systems that have high test coverage, sound models and boundaries, and great code quality. The externalities of AI are huge:
- threatening demands on energy and water;
- data centres also impact communities through noise, air pollution, and land usage;
- the exploitation of labour in developing countries to train AI;
- worries of declining cognition and critical thinking skills in frequent AI users;
- AI companies burning funds that they will need to recoup somehow;
- the risk of an economic bubble imploding.
AI may become too expensive for all but the wealthiest companies, and even if that doesn’t happen, these are valid arguments for regulating AI. I want to be able to revert from agentic code generation to human engineering, and I want to set up my systems in ways that make this easy. There’s no reason why all technology should progress forward, and we may finally meet its limits.
I’d like to thank Thoughtworks for the invitation. I deeply enjoyed debating AI in terms of engineering, design, ethics, and philosophy, with an exceptional and fun group of experts.
(No AI was used for writing this article, only for improving readability and identifying issues with the content.)