What I Learned at the Acoustic Keystroke Classification Talk at SCaLE 23x

Your keyboard is leaking data every time you type. Not through software, not through a network connection. Through sound. That was the premise of David von Thenen's talk at SCaLE 23x, titled "The Sound of Your Secrets," and the demo made it concrete.

The attack surface here is acoustic. Different keyboards, switch types, and even individual typing habits produce measurable differences in keystroke sound. Those differences are consistent enough that a machine learning model can learn to distinguish them. From there, the model can attempt to reconstruct what was typed.

Acoustic keystroke classification pipeline: WAV audio to mel spectrogram images (64 mel bins, 64 frames, hop 255) to CNN classifier, with single-keyboard and multi-keyboard training paths, correction pass, and attack surface showing microphone sources.

The pipeline David walked through starts with raw audio. The dataset is built by recording multiple WAV samples of each key, not just one. Multiple instances of the same keystroke give the model enough variation to generalize across different typing force and timing. Those WAV files get converted into mel spectrogram images, a format that captures frequency content over time as a grid of pixels. The parameters matter: 64 mel bins, 64 time frames, hop length of 255. Those choices control the resolution and the shape of the input the model sees.

A CNN classifier runs on those spectrogram images. The first training phase covers a single keyboard, building a model tied to one keyboard's acoustic fingerprint. In the demo, this hit 100% accuracy. The second phase extends this to multiple keyboards simultaneously, where the model has to distinguish which keyboard a keystroke came from and what key it was. That is a harder problem, and accuracy dropped to around 43% before correction. It requires per-keyboard models trained on labeled data from each source.

The correction phase handles the cases where inference gets it wrong. The classifier produces top-k candidates per keystroke position rather than a single hard prediction. Those candidates get enumerated across a word segment and filtered against a spellchecker, keeping combinations that form valid words. David described this in terms of how a masked language model works: given partial or ambiguous input, score the most probable completion. The implementation uses an external spellcheck API rather than a language model, but the intuition is the same. Improbable character sequences get pruned; plausible words survive.

The GPU requirement for training is real. David specified H100 or better, which is not homelab hardware. He provided pre-trained models for people who wanted to run inference without training from scratch. The point of the workshop was to show the technique, not just the compute bill.

The threat model is the part I keep thinking about. A phone sitting on a desk during a call, a shared office with ambient mics, a video call where your microphone is live during typing, a smart speaker within range of your desk. None of those require physical access. None require compromising your machine. The attack works from any microphone that can hear your keyboard. David also pointed out PIN pads as a particularly tractable target. A numeric PIN pad has ten digit keys and a few function keys. The key space is small and the inputs are high-value. A model trained on a specific PIN pad model can narrow down what was entered with fewer samples than a full keyboard requires.

Defense is not simple. David covered acoustic masking, where you introduce noise to obscure the keystroke signatures. Keyboard choice matters too, because different switch types have meaningfully different acoustic profiles. Membrane keyboards are harder to classify than mechanical switches with distinct click events. Typing behavior, cadence, and force also affect detectability. There is no single fix, but there are layers that raise the cost of a successful attack.

The project is on GitHub at github.com/davidvonthenen/2026-scale-23x-keystroke, including pre-built models and all four demo stages.