A wake word is a small cryptography. The model listens to every second of audio in the room, computes a probability that the last few hundred milliseconds were the trained phrase, and stays silent unless that probability crosses a threshold. The threshold is the line between I am being summoned and the kettle is on. Tuning it well is the whole problem.
The wake word for the phone app and the office wall is Mr Graves — the formal address. The casual Harold is what the speakers in the rest of the house listen for, and is also what the household says in passing at the breakfast table; a wake word listening for Harold would fire on every conversational mention. Mr Graves lives in a quieter zone of the language. It is rarely said by accident.
The model runs on-device — no microphone audio leaves the phone or the wall — and was trained from a few thousand audio samples in a small Colab notebook that ran for an hour and twenty minutes one afternoon in May.
The first version did not work. A naïve trainer reported ninety-percent test accuracy and looked excellent on the validation curves. The moment it was deployed to a real Pixel in a real room, it fired every two to three seconds on ambient noise — typing, footsteps, the dishwasher, conversation in the next room, music from the kitchen. The test split lied. Four thousand carefully balanced test samples are nothing like the variety of background sound a wake word actually meets in production.
The second version mirrored what the open-source community had already learned about training these models. The fix was not novel. It was a three-stage learning rate, a hard-negative mining loop that re-sampled the training set with the model’s own worst false-positives, and a validation pass against a wide-and-varied audio corpus that scored each checkpoint by false-fires-per-hour rather than by accuracy. The model that won was a percentile-ensemble of three near-final checkpoints.
When deployed to the wall, it scored thirteen out of thirteen times the phrase was spoken in real conditions — across rooms, across distances, across normal household noise. Threshold zero point eight five. Median confidence across those thirteen fires landed around the same number, which is the right shape: tightly clustered above the line, not scraping its underside.
The four Atom Echo speakers in the rest of the house use a different wake word — Hey Harold — trained on the same methodology by a different family of model. Same idea, same threshold-tuning patience, different syllables. If you live with custom wake words for long enough, you start to develop opinions about which two-syllable English phrases are easiest to disambiguate from the surrounding language, and which you say so often in casual conversation that the model spends half its time learning to ignore you.
The public repository ships the self-driving Colab notebook that trained the v2 model — about ninety minutes wallclock on a paid Colab L4 — together with the v1 post-mortem and the six saga-specific gotchas (a Piper-TTS install order that has to be exactly right, an editable openWakeWord install path that breaks idempotently, the four others). Anyone training their own wake word can take the notebook, point it at a different two-syllable phrase, and avoid the bugs we paid for.