Hands-free dictation has had a quiet failure mode for years: it transcribes whoever happens to be loud enough. The colleague at the next desk. A podcast on your speakers. Your kid asking what's for dinner. They all end up in the buffer.
Voice Fingerprint, shipping in SpeechButton 2.12, fixes that. The app builds a small, on-device model of how you sound, then drops every speech segment that doesn't match — before it ever reaches the transcriber. The result: hands-free that actually works in shared spaces, and push-to-talk that doesn't accidentally pick up your neighbour mid-pause.
Three rooms where it matters
The café scenario
You're working from a coffee shop. Hands-free is on so you can think and talk while you scroll a Figma file. Two metres away, someone is on a video call with their team — loud, energetic, exactly the kind of voice a microphone loves.
Without Voice Fingerprint, that conversation lands in your transcript. Half-sentences from a stranger get stitched into your notes. You catch them later and waste five minutes deleting fragments you didn't write.
With it on, the gate compares each speech segment to your profile. Their voice scores low and gets dropped silently. Yours scores high and goes through. Your transcript stays yours.
The home office scenario
Hands-free is convenient at home until it isn't. Your partner takes a phone call in the next room. A kid asks something from the doorway. The dishwasher beeps. The microphone catches all of it.
Voice Fingerprint splits each push-to-talk recording at the silences and checks every utterance independently. If you talk, then the room talks, then you talk again, only the middle bit is dropped — your two passes go through unchanged. The same logic runs in hands-free between voice activations.
The background-audio scenario
YouTube is playing on your second monitor. A podcast is on the kitchen speakers. You're in a Zoom meeting on mute, listening, and want to dictate a quick follow-up.
Pre-recorded speech has a different acoustic signature from your live voice — different room reverb, different microphone path, different compression. The fingerprint picks up on those differences. A presenter on a podcast doesn't pass for you, even if their cadence is similar. You speak; the rest stays out.
How the gate works
Each speech segment goes through a small extra step before transcription:
ResNet-34"] B -->|256-dim
fingerprint| C{"Cosine vs
your profile"} C -->|"≥ 0.55
(it's you)"| D["✅ Transcribe"] C -->|"< 0.55
(not you)"| E["🚫 Drop"] style A fill:#fefce8,stroke:#eab308,color:#713f12 style B fill:#eff6ff,stroke:#2563eb,color:#1e3a5f style C fill:#faf5ff,stroke:#9333ea,color:#581c87 style D fill:#f0fdf4,stroke:#16a34a,color:#14532d style E fill:#fef2f2,stroke:#ef4444,color:#7f1d1d
The model is WeSpeaker ResNet-34, an open-source neural network trained to verify speakers. It produces a 256-number vector — a "fingerprint" — for any second or more of speech. Two recordings from the same person produce vectors that point in nearly the same direction; recordings from different people point elsewhere.
SpeechButton compares your live segment to a stored profile using cosine similarity. On clean audio, your own voice typically scores between 0.65 and 0.85; someone else lands between 0.10 and 0.40. The default threshold sits at 0.55 — high enough to reject most strangers, low enough that a slight cold or a busy room doesn't lock you out of your own app.
The whole pipeline — embedding extraction, similarity check, profile lookup — runs locally inside SpeechButton. There is no server in the loop.
Why the fingerprint is per-device
If you've ever recorded yourself on AirPods and then on a MacBook microphone, you'll know they sound different. Bluetooth codecs add their own colour. Built-in microphones near-field a voice differently from far-field room mics. USB headsets compress dynamics. To a speaker-verification model, those differences are large.
A naive design averages all your recordings into a single profile. That fails one of two ways: either you set a tight threshold and the app rejects you the moment you switch from MacBook mic to AirPods, or you set a loose threshold and start letting strangers through.
SpeechButton sidesteps the trade-off by storing a separate fingerprint for every input device. Switch from your MacBook to AirPods, and the app starts a fresh five-recording bootstrap on the AirPods side. The MacBook profile stays exactly where you left it — when you plug the laptop directly into the desk dock tomorrow morning, recognition picks up where it left off, without retraining.
The profile file holds one entry per device, keyed by the macOS device name your system already uses.
Calibration takes ten seconds
There is no "training mode". You don't read a passage. You don't talk to a setup wizard.
Flip Settings → Voice Fingerprint → Only respond to my voice. The model downloads once (~25 MB) if you haven't used it before. The next five times you record anything — a normal hotkey press, a hands-free chunk, a quick note — the app accepts the audio and adds it to your profile. The Settings panel shows the count: 2/5 enrolling…, then Enrolled (5 recordings).
From the sixth recording on, the gate is live. After that, every accepted segment slowly nudges the profile forward — five percent per recording — so it adapts to a head cold, a new microphone batch, the way your voice changes over months. One bad recording can't drift your profile, but a new normal eventually does.
Switched microphones? You'll see the count restart for that device. Want to start over completely? Reset voice profile wipes the file and begins fresh.
What never leaves your Mac
Voice biometrics in cloud services are a quiet privacy problem. Your voice is identifying in a way text isn't — once it's enrolled in a remote service, it's enrolled. We didn't want to ship that. So we didn't.
🔒 Voice Fingerprint is fully on-device.
- The WeSpeaker model runs locally inside SpeechButton.
- Your fingerprint never leaves your Mac. There is no upload step.
- Raw audio is not stored. Only the averaged 256-number vector is kept.
- The file is plain JSON at
~/Library/Application Support/SpeechButton/models/speaker_profiles.json— about 1 KB per device. You can read it. You can delete it. - Disabling Voice Fingerprint stops the gate immediately. Resetting the profile wipes the file.
Privacy claims need teeth. Here are the technical limits we're willing to put in print: no part of the speaker-gate pipeline opens a network socket. The only network call related to this feature is the one-time download of the model file from a public release. After that, the feature works fully offline. If your Mac never goes online again, Voice Fingerprint keeps working.
What it doesn't claim
Voice biometrics aren't magic. Some honest limits:
Try it
Voice Fingerprint ships in SpeechButton 2.12 — already in the production build. Update or install:
Update SpeechButton
Free 5 min/day · macOS 15+ · Apple Silicon
Download for macOSAfter updating: Settings → Voice Fingerprint → toggle on. The model downloads once, then works offline.
Already running 2.12? You'll find it under Settings → Voice Fingerprint in the SpeechButton menu bar app.
Voice fingerprinting in SpeechButton uses the open-source
WeSpeaker
speaker-verification model and the
sherpa-onnx
runtime, both released under Apache-2.0. Default model file: wespeaker_en_voxceleb_resnet34.onnx.