Appearance
A Grammar of Interrupts
Every embedded program is, at heart, a small theatre, and the dramatis personae are interrupts. Some are loud — a button press, a packet arriving on Ethernet — and arrive with the kind of fanfare that lets you reason about them clearly. Some are quiet, almost subliminal — a millisecond tick, a DMA-complete flag, the rising edge of a clock signal you forgot existed — and these are the ones that, when something goes wrong, will take three days of your life to find.
I have been thinking about interrupts more than usual this month, because I have been debugging a small SPI driver for a STM32H7 that, under load, was occasionally returning data from the previous transaction instead of the current one. The bug was the kind of thing you cannot reproduce on demand: it happened once in maybe ten thousand transactions, only under sustained traffic, and only when the system clock was running at its full 480 megahertz. It took me four days to find. The root cause, when I finally found it, was a single line of code that I had written, deliberately, eight months ago, for what had at the time seemed like a very good reason.
This essay is not, primarily, about that bug. It is about the grammar of the language in which the bug was written.
The simplest sentence
The simplest sentence in the language of interrupts is the one that says: when this thing happens, run this code. On an ARM Cortex-M, it looks like this:
c
void EXTI0_IRQHandler(void) {
// do the thing
EXTI->PR1 = EXTI_PR1_PR0; // clear the pending bit
}A pin goes high, the NVIC notices, the processor saves its current context, jumps to the handler, runs the handler, restores its context, and goes back to whatever it was doing. The whole thing takes, on a 480-megahertz part, about a hundred nanoseconds of overhead, before you get to whatever the handler actually does. It is, as embedded code goes, very nearly the simplest possible operation.
What is hidden in this simplicity is everything. The "save its current context" step is itself a small drama, in which a particular set of registers is stacked, in a particular order, with particular alignment, onto a particular stack. The "jump to the handler" step depends on a vector table that you, at some point in your build, arranged to be at a particular address. The "go back to whatever it was doing" step depends on the assumption that nothing else has, in the meantime, corrupted the state the original code was relying on.
Each of these assumptions is, in production code, somebody's responsibility. The startup file is responsible for the vector table. The linker script is responsible for the stack. The original developer is responsible for the assumption that the handler will not, somehow, end up running on a different core than the code it was preempting. The chip is responsible for actually doing the right thing, every nanosecond, billions of times a second, for years.
When the chip does not do the right thing — which is, mercifully, almost never — the failure is spectacular and quick. When the assumptions are wrong, the failure is slow and intermittent. The four-day bug I mentioned earlier was an assumption failure, not a chip failure.
The hidden tense
There is a tense in the grammar of interrupts that has no equivalent in spoken language. It is the tense of "this is happening, and you are not allowed to know exactly when." A typical user-space program, written for a desktop operating system, runs in a kind of placid linear time: the next line of code runs after the current line, more or less, and the only thing that can interrupt that linearity is the operating system, which is mostly invisible. An interrupt service routine breaks this contract. At any moment between two ordinary instructions, an interrupt can fire, and ten different lines of code, in a completely different file, can run, modify memory, and then disappear, leaving the main program none the wiser.
This is the hidden tense. The bug I was chasing lived in this tense.
The SPI driver, in its main body, did something like this:
c
volatile uint8_t spi_buf[64];
volatile size_t spi_pos = 0;
void spi_start_transaction(const uint8_t* data, size_t len) {
memcpy((void*)spi_buf, data, len);
spi_pos = 0;
SPI1->CR1 |= SPI_CR1_SPE;
// ... and so on
}The SPI peripheral, on completion, would fire an interrupt that drained spi_buf into a result buffer and reset spi_pos. Under normal load, this was fine. Under sustained load — when transactions were arriving faster than my main code could initiate them — the next call to spi_start_transaction could begin, with its memcpy, before the previous transaction's interrupt handler had finished running. The two would race for control of spi_buf. Sometimes one won. Sometimes the other. The bug appeared as occasional stale data.
The fix, when I found it, was four lines: a critical section, implemented with __disable_irq() and __enable_irq(), around the memcpy and the subsequent peripheral configuration. The fix was easy. The diagnosis was hard, because the symptom — stale data — looked like a peripheral problem, not a software-timing problem. I spent two days on the oscilloscope, looking at the SPI bus, before I thought to look at the firmware.
What I should have written eight months ago
The line of code I had written, deliberately, eight months ago, was a comment:
c
// SPI buf is single-producer; protected by convention.The comment was, when I wrote it, true. There was, at the time, exactly one place in the code that called spi_start_transaction. The convention I was relying on was that this single caller was not itself called from an interrupt context, and therefore there could be no race. The comment was, in my mind, a substitute for a lock: a documentation of the invariant that the caller had to maintain.
Eight months later, somebody else — me, in fact, but a version of me with no memory of writing the original comment — had added a second caller. The second caller was, in fact, an interrupt handler, for an entirely different peripheral. The invariant had been broken silently. The comment was still there. It was still telling the truth about a world that no longer existed.
This is, I think, the most fundamental difficulty of working with interrupts. The invariants you rely on are global, and the code that maintains them is local. A reader who looks at spi_start_transaction in isolation cannot tell whether the convention is being followed. A reader who looks at the new interrupt handler in isolation cannot tell that the convention exists. The only way to know is to read the entire codebase, every time, before making any change. Nobody does this. Nobody can.
A small humility
I do not have a grand prescription, here, for how to write safer interrupt-driven code. The discipline of disabling interrupts around critical sections is well understood; the discipline of using lock-free data structures is well understood; the discipline of preferring DMA to interrupt-per-byte is well understood. The literature is good. The textbooks are clear. The problem is not knowledge. The problem is that the world, on the inside of an embedded program, is one in which the order of execution is not, in general, knowable from the source code, and in which the assumptions that the source code is making are, in general, written down only in comments that may or may not be up to date.
The grammar of interrupts is a grammar in which the simplest-looking sentences are the ones most likely to be lying to you. The handler that "just sets a flag" is the handler that, six months from now, will run between your two-line atomic-counter update and produce a value that violates an invariant your entire system depends on. The peripheral that "always finishes before the next transaction" is the peripheral that, under sustained load, occasionally does not. The convention that is "protected by convention" is the convention that, the moment a new developer joins the team, ceases to exist.
I have been writing embedded firmware for fourteen years, and the lesson I keep relearning, in the small hours of debug sessions like the one this week, is that interrupts are humbling. They are humbling because they punish, exactly, the kind of fluent overconfidence that other languages reward. The fluent embedded engineer is the engineer who has, at some point, been bitten badly enough by an interrupt race that they will, for the rest of their career, write critical sections defensively, comment them obsessively, and treat every comment of the form "protected by convention" as a TODO.
I am, I think, finally one of those engineers. It only took fourteen years.