Vol. III · No. 24The weeklyMarkets · PomegraAI Digest · AI/TLDR
Skip to content

The Grammar of the Watchdog

In every microcontroller datasheet I have read in the last fourteen years, the watchdog timer is described in the same flat, declarative voice the chip vendor reserves for peripherals nobody is excited about. The watchdog is described, in order, by its clock source, its prescaler, its reload register, and the precise sequence of register writes required to feed it. There is rarely a paragraph about why you would want one. There is never a paragraph about what kind of engineer you become when you start writing firmware that takes the watchdog seriously.

I have been thinking, this week, about the second paragraph.

What the peripheral does, in eight lines

A watchdog timer is a counter that decrements on every clock tick. When it reaches zero, it resets the device. To prevent the reset, your firmware has to write a specific value, in a specific order, to a specific register, at intervals shorter than the timeout. The act of writing to the register is called, by long convention, "kicking" or "feeding" the dog. If you stop kicking it, it bites. Whatever the device was doing in the moment it was bitten — a transaction with a sensor, a packet half-assembled in a buffer, a state machine three transitions deep — is lost. The device starts over.

This is, considered in the most generous possible light, a brutal mechanism. It is also, in my view, one of the two or three most useful peripherals on any embedded chip.

The wrong way to use it

The first watchdog timer I ever wrote code for, in 2012, was on an STM32F103. I configured it, in the bring-up phase, to fire after eight seconds. Then, having other things to do, I added a single line at the top of main() that started a hardware timer, with a one-second period, whose interrupt handler kicked the dog. I did not think about the watchdog again for six weeks. The interrupt handler ran. The dog stayed fed. The product shipped.

What I had built, in retrospect, was an extraordinarily expensive heartbeat indicator. The hardware timer interrupt was driven by a clock that had nothing to do with the rest of the system. If the application crashed, or hung in a deadlock, or sat for thirty seconds in an infinite loop reading from a sensor that had stopped responding, the hardware timer would continue to tick, and the interrupt would continue to fire, and the watchdog would continue to be fed. The device would not reset. It would simply sit there, ticking, not doing the thing it was supposed to do, while the watchdog dutifully reported that everything was fine.

I will spare you the description of the field failure that, eight months later, taught me that this was the wrong way to use a watchdog. The lesson was: a watchdog that kicks itself is not a watchdog. It is a fan club.

The right way is harder

The discipline of using a watchdog correctly is, properly speaking, the discipline of arranging your firmware so that the act of kicking the dog is co-located with the proof that the system is making progress. The dog should be kicked from the main loop, or from the scheduler, or from some other place in the code whose continued execution is evidence that all of the things that need to happen are happening. If a single subsystem hangs — the sensor reader, the network stack, the UI task — the kick should stop, and the device should reset.

This sounds straightforward. In practice, it requires you to have a model of what "making progress" means for your device, and to make that model explicit in code. On a single-task system, the model can be as simple as: the main loop has executed at least once in the last N milliseconds. On a multi-task system, the model has to be more elaborate. Each task registers with a small bookkeeping module. Each task reports, on each iteration, that it is alive. The bookkeeping module kicks the watchdog only when every registered task has reported within its expected window.

I have, in the years since the F103, written perhaps a dozen of these bookkeeping modules. They are usually under two hundred lines of C. They are, in my experience, the most carefully reviewed two hundred lines of C in the entire codebase, because everyone on the team understands that a bug in the bookkeeping module is the kind of bug that takes a fleet of devices off the network for an afternoon.

A watchdog is a contract between you and the chip. The chip promises to reset the device if it stops making progress. You promise to define, in writing and in code, what progress means.

The shape of a good kick

Let me describe, in some detail, the shape of a good watchdog kick on a small RTOS-based system. The bookkeeping module exposes two functions. The first is wd_register(task_id, deadline_ms), which a task calls at startup to register itself with the watchdog and to declare the maximum interval at which it expects to check in. The second is wd_feed(task_id), which a task calls at the end of every iteration of its main loop. The bookkeeping module maintains, for each registered task, a timestamp of the last wd_feed call. A separate, low-priority task — the watchdog task — wakes up every fifty milliseconds, scans the table, and confirms that no task has missed its deadline. Only then does it kick the hardware watchdog.

The asymmetry is important. The watchdog task is the only piece of code in the system that touches the watchdog register. Every other task has to convince the watchdog task that it is alive. If any single task hangs, the watchdog task notices, refuses to kick the dog, and the device resets within the next hardware watchdog interval.

This pattern has a name, in some embedded literature: a "soft watchdog" or "task-level watchdog." I have seen it implemented well perhaps three times and badly perhaps fifteen. The most common failure mode is that the watchdog task itself becomes the heartbeat — a high-priority loop that wakes up, kicks the dog, and goes back to sleep, with no actual check on the other tasks. This is, in the precise sense, the F103 mistake again, dressed up in a slightly more elaborate idiom. It is not safer because it is more elaborate. It is, in some ways, less safe, because it gives the engineer the false impression that they have done something thoughtful.

What the watchdog teaches

The reason I keep coming back to this peripheral, in essays and in workshops and in the late-evening conversations I have with younger engineers who are trying to figure out what kind of practitioners to be, is that the watchdog forces a particular intellectual honesty on the firmware. To use it correctly, you have to write down, explicitly, your theory of what your device does and how you would know if it had stopped doing it. Most embedded codebases do not, in any clear way, contain that theory. The watchdog, when taken seriously, forces it onto the page.

It forces you to admit, for instance, that you do not actually know whether your sensor task is supposed to wake up every ten milliseconds or every fifty. You have always meant to measure it. The deadline you write down in the call to wd_register is the first time you have ever committed to an answer.

It forces you to admit that the network task, the one that sometimes blocks on a DNS lookup, does not have a hard upper bound on its iteration time. You will have to give it one. You will have to add a timeout to the lookup, or move it to a separate task, or admit that the device must tolerate the absence of DNS for some defined period.

It forces you to think about what should happen if the device resets unexpectedly. Will the device come up in the same state, or will it lose the contents of the in-memory buffers it has been accumulating? Have you arranged for the persistent state to be written to flash often enough? Is the boot sequence fast enough that the user will not notice?

These are not questions about the watchdog. They are questions about whether your firmware has, in the strict sense, a design. The watchdog is the peripheral that forces you to answer them.

A small case for the unloved peripheral

The chips I am writing for, today, have many more peripherals than the F103 did. They have dual-core architectures, hardware security modules, neural-network accelerators, sophisticated DMA engines. Each of these peripherals has its own chapter in the reference manual, its own application note, its own ecosystem of libraries and tutorials and stack-exchange threads. The watchdog has, on a modern chip, the same eight pages it has always had. It does the same thing it has always done.

This is, I think, part of why it is interesting. The watchdog has not been improved. There is no new generation of watchdog timers, with machine learning, that adaptively predict the right kick interval. There is just a counter, and a register, and a reset line, and the discipline you bring to using them. The discipline is the entire peripheral. Everything else is bookkeeping.

The board on my bench, this morning, has been running for forty-one hours. The watchdog has been kicked, by my count, somewhere on the order of twelve million times. Every one of those kicks was a small assertion that everything the firmware was supposed to be doing was, in fact, being done. The forty-second hour begins in a few minutes. I am going to make a coffee, and then I am going to add another task to the bookkeeping table, because I noticed, late last night, that the LED task does not currently register itself with the watchdog. That is not a serious bug. But it is, in a small way, a hole in the device's theory of itself, and I would prefer not to ship the device with a hole.

A note on the prose. The reporting in this magazine is one writer’s — mistakes are her own, and corrections are welcome. Where code appears, it has been tested on at least the hardware named. Where photographs appear, they are the author’s unless captioned otherwise.