The Bug That Wasn't a Memory Bug

engineeringbehind-the-scenesai

May 26, 2026·Coby Randquist

We usually write about HOA governance here. Today is a little different — a look behind the curtain at the kind of problem we wrestle with so you don't have to think about it.

Because here's the thing about building software you're meant to trust with your community's documents: most of the work is invisible. When it works, you upload a file, ask a question, and get a cited answer. You never see the machinery. This is a story about a day the machinery broke in an interesting way — and about how easy it is, even for people who do this for a living, to confidently fix the wrong thing.

What happens when you upload a document

When you upload a document to SayWhat, it doesn't just sit in a folder. We read it. In plain terms: we open the file, figure out what kind of document it is (a set of CC&Rs reads differently from a page of meeting minutes), slice it into small searchable passages, and index every passage so that later, when someone asks "can I install solar panels?", the AI can find the exact paragraph that answers them instead of guessing.

That whole sequence — read, identify, slice, index — is what we call ingestion. It usually finishes in well under a minute, even for a hundred-page document. You upload, you wait a moment, you're ready to ask questions.

One day, for one particular document, it stopped finishing at all.

The symptom

A community uploaded a 77-page architectural-guidelines PDF — nothing exotic, the kind of document we process constantly. And the server doing the ingestion got shut down. Not once. Three times in an hour, every time someone tried that same file.

Each time, the pattern was identical: the server would climb to the edge of its memory budget and then get forcibly killed.

What's a memory budget — and an "OOM kill"?

Every program running on a server is given a slice of memory to work with, like a desk of a fixed size. If a program tries to pile on more than its desk can hold, the operating system steps in and force-quits it — not to be cruel, but to protect everything else running on that machine from being crowded out. Engineers call this an "OOM kill," short for out-of-memory. The telltale sign is a program that grows and grows and then dies abruptly, mid-task.

So the evidence in front of us was unambiguous: a biggish document, a program that swelled up to its memory limit, and a force-quit, every single time. If you've built this kind of software before, you've seen this movie. The diagnosis writes itself: we're holding too much in memory at once.

That diagnosis was reasonable. It fit every fact we had. And it was wrong.

The confident wrong turn

We did what the evidence told us to do: we made the pipeline use less memory.

This is worth being honest about, because it's the human part of the story. We didn't cut corners. We re-engineered the document pipeline to process things in small batches instead of all at once, added a safety valve that refuses documents above a certain size, and built better tools for spotting a stuck document in the future. All of it was real, durable improvement to how the system works.

It just didn't fix the problem. We'd added durability — solid, genuine durability — but not the kind of durability this particular problem needed. We shipped those changes, re-uploaded the troublesome file to confirm the win... and the server fell over again. Same document, same spot, same shutdown.

It's a strange feeling to fix something carefully and watch it not matter.

The clue we'd left ourselves

Here's where one of those "future-proofing" changes quietly paid off — just not in the way we'd planned.

One of the improvements was a kind of running diary: every stage of ingestion now announces "starting" the moment it begins and "done" the moment it finishes. We added it so that if a document ever got stuck again, we'd be able to see exactly which step it died on instead of staring at silence.

The very next day, that diary told us the answer. The log read: read the file — done. Identify the document — done. Slice it into passages — starting... and then nothing. The program died while slicing the document into passages. Every improvement we'd made to the later steps — the batching, the safety valve — lived after the point where things were actually breaking. We'd reinforced the wrong room.

The twist

So we finally did the thing we should have done first. We stopped re-engineering and just watched — ran the troublesome document through the slicing step on a test machine with a memory meter attached, to see exactly where the memory was going.

The result was almost funny. The slicer was supposed to turn the document into roughly a hundred passages. It had produced twenty-one — and was holding ten gigabytes of memory, more than a thousand times what it should need. It wasn't slicing a huge document into a few big pieces. It was producing the same tiny scrap of text, over and over, hundreds of thousands of times a second, and piling every copy onto its desk until the desk collapsed.

This wasn't a memory problem at all. It was a getting-stuck problem wearing a memory problem's clothes.

The cause was a single instruction in the slicing code that decides where to start the next passage. Think of it as a bookmark. Normally, after cutting one passage, the bookmark moves forward to where the next one should begin. But on this specific document — which happened to have a long stretch of text with almost no punctuation, like a sparse cover page — the bookmark calculated a new position that landed behind where it already was, and then quietly snapped back to exactly where it started. So it cut the same little scrap, moved the bookmark, ended up back at the start, cut the same scrap again, forever.

For the curious: the entire bug was one line that set the next starting point without checking that it actually moved forward. The fix was four lines: if the next position wouldn't advance the bookmark, nudge it forward instead. After that, the same document went from ten gigabytes and a crash to thirty megabytes and a clean finish in fourteen seconds.

A bookmark that forgot how to move forward. That was it. That was the thing that knocked a server offline three times.

One bug, two different disguises

The detail I find most instructive is that the exact same bug looked like two completely different problems depending on which machine it ran on.

On one server, with a strict memory limit, the runaway program hit its ceiling fast and got cleanly force-quit — a textbook out-of-memory shutdown. That's the version that convinced us it was a memory problem.

On another server, things looked entirely different, because that machine had a buffer called swap.

What's swap?

When a machine runs low on memory, some systems can spill the overflow onto the hard drive as a temporary cushion — a practice called "swapping." It keeps things from crashing outright, but the hard drive is far slower than memory, so the machine grinds nearly to a halt instead of failing cleanly.

On that second server, the buffer meant to protect us actually worked against us: instead of a clean, obvious shutdown, the machine just slowed to a crawl and went unresponsive, leaving none of the usual fingerprints. The safety cushion erased the one piece of evidence that would have pointed us at the truth fastest. Same bug. One machine shouted "out of memory!"; the other just went quiet.

What we took away from it

A few lessons, the kind that outlast any single bug:

A symptom is not a diagnosis. "It ran out of memory" describes what we saw, not why. We treated the symptom as the cause and reached for the obvious remedy. The question that would have cracked it on day one — how many passages is this document actually producing before it dies? — we didn't ask until day two. The answer had been sitting in plain sight the whole time.

Watch before you rebuild. The thing that finally solved it wasn't a clever redesign. It was a small, cheap measurement — attaching a memory meter and looking. It cost half an hour. We did it last, because we were so sure we already knew the answer. Confidence is exactly what hides the cheap experiment from you.

The boring safety net is worth it. That little running diary we added "just in case" was what redirected us to the real problem. Unglamorous preparation quietly does the heavy lifting when something goes wrong.

Why we're telling you this

We could have written this up internally and moved on. We're sharing it because we think the way a company handles the unglamorous, invisible parts of its work tells you something real about whether to trust it.

We trust our system with the documents that govern people's homes. That responsibility means chasing a single misbehaving file across two servers and a day and a half until we understand exactly what went wrong — not just patching until the symptom disappears. It also means being honest, including with ourselves, about the times we confidently fixed the wrong thing first.

A bookmark forgot to move forward. We found it, we fixed it, and we're a little more careful now about the diagnoses that feel a little too comfortable.

If you'd like a document platform built by people who care about this layer of the work, let's talk.

← Back to Blog