How to Fix a Bug: Tests, Hypotheses, Timeboxes

Here’s roughly how I fixed bugs early on in my career:

Browse around in the code.
Try stuff.
See if it works.

Below is my preferred way of doing it since about 2012:

Step 1. Pair / Ensemble

Find one or more people to collaborate with. Linus’ Law “Given enough eyeballs, all bugs are shallow” was formulated at least 25 years ago, and yet we do so much work in isolation.

Step 2. Test

Write a failing test (or multiple) that proves the existence of the bug. This test will make sure you don’t fix the wrong thing, or a perceived bug. The test will help to ensure that the bug doesn’t return later. Bugs are the result of some misunderstanding about the system, and fixing a bug doesn’t imply that everybody’s misunderstanding is fixed. But if the test fails again in the future, at least you’ll know something’s up.
Besides a descriptive name, add the bug’s ticket number in the test name (or use a custom annotation). This will help keep track of the discussions about the bug.
Commit and push the test. Yes, this will break the build. I think this is perfectly fine: the code was already broken, it was just invisible. I get that breaking the build will be controversial in some organisations. In that case, if your tooling allows it, make sure the failing test shows as a warning but don’t let keep it from deploying.

If writing a test is hard or impossible in your environment:

Push a test that simply fails with an explanation of the bug and why it’s hard to test.
Once you’re confident the bug is fixed, remove the fail command and (depending on your tooling) mark it as “not implemented” “or “skipped”.
Invest in making it easier to test.

Step 3. Form Hypotheses

Brainstorm as many hypotheses about what causes the bug as you can think of. Having more people in the group helps, as groups simply have more knowledge and ideas than individuals.
As with all brainstorming, avoid debating the hypotheses. Just try to list as many as you can, including the ones you’re skeptical about.
In fact, I even recommend to come up with some hypotheses that are very unlikely to be the correct one. They trigger more ideas, and occasionally, the bug’s cause is the most unexpected one.
Definitely resist the urge to jump in the code and start fixing things. If you do need to look at the code to form hypotheses, don’t try to fix it right away.

Step 4. Prioritise and Timebox

Reorder your list starting with the most probable hypotheses.
Add an estimate to each hypothesis. The estimate is not the time you need to fix the bug, but the time you need to falsify the hypotheses.
Now move the hypotheses that are both highly probably and have a short estimate to the top of the list. If something is quite improbable but only takes a minute to verify, you may also want to put those at the top.

Step 5. Timebox and Falsify

Set a timer.
Try to proof as fast as possible that the first hypothesis is not the cause of the bug.
To falsify, you can either write another test, or do the minimum effort to the make the original test pass.
Stop as soon as you have your evidence.
If your timer runs out before you’re done, decide to add another timebox, or move on to the next hypothesis.
Keep in mind that bugs can have multiple interacting causes, so proving or disproving one hypothesis doesn’t guarantee you falsified the others.

Step 6. Fix

Now that you’re confident about the cause of the bug, fix it properly.
Write more tests to help guide your fix.

One of my programming adages is

“When in doubt, write a test.”

And of course there’s

“Doubt is the origin of wisdom.”

— René Descartes

The Engineer’s Duty

Does this process take more time?

“There is never enough time to do it right, but there is always enough time to do it over.”

— John W. Bergman

That said, I find that for complicated bugs, this process avoids a lot of diving into the wrong rabbit holes, chasing wild geese, doggedly unnecessarily rewriting entire chunks of code with your head in the sand. Instead of these animal metaphors, you can be rational about bugs. The group mind identifies better hypotheses; priorities and timeboxes avoid getting into a flow state for the wrong hypothesis.

If people in your organisation don’t do this sort of process, or don’t write tests, I have a hypothesis on why that is: Most organisations implicitly belief that the engineer’s duty is to deliver features. Fixing bugs is seen as a distraction from delivery, so you need to get it over with as fast as possible.

However, there’s a healthier way of looking at it: The engineer’s duty is building a system that works, that keeps working over time, and that keeps being evolvable and understandable over time. Basically, don’t just deliver features, but make it easy to deliver high quality features. In the same vein, tests reduce bugs and regressions, more tests lead to more testability, and more testability makes it easier to avoid more bugs. Jumping onto this self-reinforcing feedback loop may be tricky, but I promise it speeds up as you go along.