Configuration Guide

Hotkeys & Channels

The first [[hotkey]] block without a channel field is the default channel. It activates when you hold the hotkey alone.

Channels are optional. They activate when you press a secondary key (1-9, a-z) while holding the main hotkey. Each channel can have its own output type, transform, and destination.

# Default channel — hold RightCommand, speak, release
[[hotkey]]
name = "default"
key = "RightCommand"
paste = "accessibility"      # paste | file | webhook | exec as separate fields
output_format = "text"        # text | json

# Channel 1 — hold RightCommand, tap 1, speak
[[hotkey]]
name = "claude-code"
key = "RightCommand"
channel = "1"
paste = "accessibility"
transform = "~/.config/speechbutton/scripts/transform_claude.py ~/.config/speechbutton/prompts/translate-es.md"

key — which hotkey triggers this channel (see Available Hotkey Names)

channel — secondary key (1-9, a-z) to activate this channel; omit for default (string in quotes)

paste — paste text at cursor via accessibility API: paste = "accessibility"

file — append text to a file: file = "~/path/to/file.txt"

webhook — POST JSON to a URL: webhook = "http://..."

exec — pipe text to a command: exec = "command"

language — global setting in [global]; language code or "auto" for auto-detection

output_format — "text" for plain text, "json" for structured data

transform — path to a transform script; prompt is passed as an argument: transform = "script.py prompt.md"

transform and exec paths use ~/ expansion

Output Types

Each hotkey channel can send transcribed text to one of four output types.

Paste

Types text at your cursor using accessibility APIs. Falls back to clipboard paste if accessibility is unavailable.

[[hotkey]]
name = "default"
key = "RightCommand"
paste = "accessibility"

File

Appends transcribed text to a file. Creates the file if it does not exist.

[[hotkey]]
name = "notes"
key = "RightCommand"
channel = "2"
file = "~/Documents/voice-notes.md"

Webhook

Sends an HTTP POST with a JSON payload to a URL. Useful for integrating with APIs, Zapier, Make, or custom backends.

[[hotkey]]
name = "api"
key = "RightCommand"
channel = "4"
webhook = "https://api.example.com/voice"
output_format = "json"

Exec

Pipes the transcribed text (or JSON) to a shell command via stdin. The command runs in a shell, so pipes and redirects work.

[[hotkey]]
name = "slack"
key = "RightCommand"
channel = "3"
exec = "~/.config/speechbutton/scripts/send-to-slack.py"

Transform Pipeline

Transforms process text between speech recognition and output. They run as external scripts that read from stdin and write to stdout.

Audio → Transcribe → [Transform] → Output

stdin/stdout contract: the transform script receives the transcribed text on stdin, processes it, and writes the result to stdout.

Exit codes: exit code 0 means success (stdout is used as output). Any non-zero exit code aborts the pipeline — the output is discarded and an error is logged.

Built-in Scripts

SpeechButton ships with two built-in AI transform scripts:

scripts/transform_claude.py — transforms text via Claude API (uses OAuth from ~/.claude/.credentials.json)
scripts/transform_openai.py — transforms text via OpenAI API (uses OAuth from ~/.codex/auth.json)

Prompt files live in ~/.config/speechbutton/prompts/

Custom Transform Example

#!/bin/bash
# scripts/uppercase.sh — convert text to uppercase
tr '[:lower:]' '[:upper:]'

#!/usr/bin/env python3
# scripts/fix_grammar.py — AI grammar fix
import sys, os, anthropic

text = sys.stdin.read()
client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"Fix grammar and punctuation. Return ONLY the corrected text:\n\n{text}"}]
)
print(msg.content[0].text)

JSON Format

When output_format = "json", the transcription is wrapped in a JSON object. This is the payload sent to webhooks, exec commands, file output, and transform scripts.

{
  "text": "The transcribed speech content",
  "lang": "en",
  "model": "parakeet-tdt-0.6b-v3-int8",
  "duration_ms": 3420,
  "source": "ptt",
  "device": "MacBook Pro Microphone",
  "timestamp": "2026-04-02T10:30:00Z"
}

text — the transcribed speech content

lang — detected or configured language code (ISO 639-1)

model — speech recognition model used

duration_ms — audio duration in milliseconds

source — "ptt" (push-to-talk) or "vad" (voice activity detection)

device — name of the audio input device used

timestamp — ISO 8601 timestamp of the recording

Voice Activity Detection (VAD)

VAD uses the Silero V4 model to detect speech pauses. When enabled, text is sent as you speak — each pause triggers a transcription chunk. You do not have to release the hotkey to get output.

[vad]
enabled = true
chunk_silence_sec = 0.55   # seconds of silence before sending a chunk

chunk_silence_sec — how long to wait after you stop speaking before triggering a chunk. Lower values (0.3) make it more responsive but may split mid-sentence. Higher values (1.5) wait longer for natural pauses. Default: 0.55.

Push-to-Talk Chunking

When VAD is disabled, push-to-talk mode is used. Audio is captured while the hotkey is held and transcribed when the key is released. The entire recording is processed as a single chunk.

# Push-to-talk (VAD off) — hold to record, release to transcribe
[vad]
enabled = false

This is the default mode. Best for short dictation (a sentence or two). For longer dictation, enable VAD so you get progressive output.

Auto-Send

When enabled, SpeechButton presses Enter after pasting transcribed text. Useful for chat apps, terminals, and AI assistants where you want to send the message immediately.

[global]
auto_send = false
auto_send_hotkey = "ctrl+enter"  # hotkey to toggle auto-send on/off
send_delay_sec = 3.0            # with VAD: silence before auto-send (seconds)

auto_send — when true, presses Enter after each paste

auto_send_hotkey — keyboard shortcut to toggle auto-send on/off at runtime. Format: modifier+key (e.g. ctrl+enter, cmd+shift+space). Modifiers: ctrl, shift, alt/option, cmd/command. Keys: enter, space, tab, escape, delete, or any letter/number.

send_delay_sec — with VAD enabled, waits this long after the last speech chunk before pressing Enter. This lets you pause between thoughts without triggering a premature send. Only applies when both VAD and auto_send are enabled.

Speech Recognition

SpeechButton runs speech recognition locally using the Apple Neural Engine. All models run on-device with zero network connections.

Model	Languages	Speed	Best for
parakeet-tdt-0.6b-v3-int8	25+	Fastest	English, common European languages
ggml-large-v3-turbo-q5_0.bin	100+	Fast	All languages, best accuracy

Language auto-detect: set language = "auto" and SpeechButton will detect the language automatically. To lock to a specific language, use an ISO 639-1 code like "en", "es", "ja".

Local AI Settings

SpeechButton runs a local LLM (Gemma 4) for text transforms. All processing happens on-device using the Apple Neural Engine — no API keys, no cloud, no data leaves your Mac. These settings control the LLM behavior.

[llm]
temperature = 0.3       # 0.0-1.0 — lower = more deterministic
top_k = 40              # 1-100 — fewer = more focused
top_p = 0.9             # 0.0-1.0 — nucleus sampling
context_size = 32768    # max 131072 (128K) for Gemma 4 E2B
max_tokens = 0          # 0 = auto (fills remaining context)

temperature — controls randomness. Use 0.0–0.3 for structured output (code, bug reports), 0.5–0.8 for creative writing.

top_k — limits token selection to the top K candidates. Lower values produce more focused output.

top_p — nucleus sampling. Considers tokens until cumulative probability reaches this threshold.

context_size — maximum context window in tokens. Default 32768. Gemma 4 E2B supports up to 131072 (128K). Larger values use more memory.

max_tokens — maximum output tokens. Set to 0 for automatic (fills remaining context after input).

Device Settings

Control which microphone SpeechButton uses and how it behaves.

[global]
input_device = ""       # empty = system default, or specify by name

# Keep iPhone mic always-on (no 300ms wake delay)
[[device_rule]]
match = "iPhone"
keep_hot = true

input_device — empty string "" uses system default, or specify a device name like "MacBook Pro Microphone" or "iPhone"

[[device_rule]] — per-device overrides. match is a substring match against the device name.

keep_hot — when true, keeps the audio stream open even when not recording. Eliminates the ~300ms wake-up delay when you start talking. Uses slightly more power. Recommended for external microphones like iPhone via Continuity.

Writing Custom Prompts

Prompt files tell the AI transform what to do with your transcribed text. They are plain Markdown files stored in the ~/.config/speechbutton/prompts/ directory.

Location: ~/.config/speechbutton/prompts/

Format: plain .md files. The entire file content is used as the system prompt.

Key rule: always end with "Return ONLY the result" — otherwise the AI may include explanations or formatting in its response.

# prompts/translate-es.md

Translate the following text from any language to Spanish.
Preserve the original tone and meaning.
Do not add explanations or commentary.

Return ONLY the translated text.

# prompts/fix-grammar.md

Fix grammar, spelling, and punctuation in the following text.
Preserve the original meaning and tone.
Do not change technical terms, proper nouns, or code.
If the text is already correct, return it unchanged.

Return ONLY the corrected text.

Writing Custom Scripts

Custom scripts run as transform or exec targets. They receive text on stdin and write the result to stdout. Any language works — bash, Python, Node, Ruby, etc.

Location: ~/.config/speechbutton/scripts/

Contract: read from stdin, write to stdout, exit 0 on success

Permissions: scripts must be executable (chmod +x)

Bash: uppercase

#!/bin/bash
# scripts/uppercase.sh
tr '[:lower:]' '[:upper:]'

Python: send to Slack

#!/usr/bin/env python3
# scripts/send-to-slack.py
import sys, os, json, urllib.request

text = sys.stdin.read().strip()
webhook_url = os.environ["SLACK_WEBHOOK_URL"]
payload = json.dumps({"text": text}).encode()

req = urllib.request.Request(webhook_url, data=payload,
    headers={"Content-Type": "application/json"})
urllib.request.urlopen(req)
print(text)  # pass through for chaining

Curl: webhook one-liner

#!/bin/bash
# scripts/webhook.sh
TEXT=$(cat)
curl -s -X POST "https://api.example.com/voice" \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"$TEXT\"}"
echo "$TEXT"

Example Configurations

Complete working examples. Copy any of these into your config.toml.

1. Default — paste raw text

# Hold RightCommand, speak, release → text appears at cursor
[[hotkey]]
name = "default"
key = "RightCommand"
paste = "accessibility"

2. Channel 1 — translate and paste

# Hold RightCommand, tap 1, speak → translated text pasted
[[hotkey]]
name = "claude-code"
key = "RightCommand"
channel = "1"
paste = "accessibility"
transform = "~/.config/speechbutton/scripts/transform_claude.py ~/.config/speechbutton/prompts/translate-es.md"

3. Channel 2 — notes to file

# Hold RightCommand, tap 2, speak → text appended to file
[[hotkey]]
name = "notes"
key = "RightCommand"
channel = "2"
output_format = "text"
file = "~/Documents/voice-notes.md"

4. Channel 3 — Slack via exec

# Hold RightCommand, tap 3, speak → message sent to Slack
[[hotkey]]
name = "slack"
key = "RightCommand"
channel = "3"
exec = "~/.config/speechbutton/scripts/send-to-slack.py"

5. Channel 4 — webhook to API

# Hold RightCommand, tap 4, speak → JSON posted to API
[[hotkey]]
name = "api"
key = "RightCommand"
channel = "4"
output_format = "json"
webhook = "https://api.example.com/voice"

6. Channel 5 — Claude Code agent

# Hold RightCommand, tap 5, speak → text piped to Claude Code
[[hotkey]]
name = "ai-transform"
key = "RightCommand"
channel = "5"
transform = "~/.config/speechbutton/scripts/transform_claude.py ~/.config/speechbutton/prompts/code-command.md"
exec = "claude -p --bare"

[vad]
enabled = true
chunk_silence_sec = 0.55

[global]
language = "auto"
auto_send = true
send_delay_sec = 3.0

Available Hotkey Names

Use these names in the key field of a [[hotkey]] block.

Key Name	Physical Key	Notes
RightCommand	Right ⌘	Default. Recommended.
LeftCommand	Left ⌘	May conflict with system shortcuts
RightOption	Right ⌥
LeftOption	Left ⌥
RightControl	Right ⌃
LeftControl	Left ⌃
RightShift	Right ⇧
LeftShift	Left ⇧	May interfere with typing
CapsLock	Caps Lock	Requires remapping in System Settings

Diagnostics

When something goes wrong, check the logs and test your transforms manually.

Log file

# View live logs
tail -f ~/.config/speechbutton/logs/speechbutton.log

Check transform errors

Transform errors appear in the log with TRANSFORM_ERROR. Common issues: missing OAuth credentials, script not executable, non-zero exit code.

Test a transform manually

# Pipe text into your transform and check the output
echo "hello world this is a test" | ~/.config/speechbutton/scripts/uppercase.sh

# Test an AI transform with a prompt
echo "helo wrld ths is a tst" | ~/.config/speechbutton/scripts/transform_claude.py \
  ~/.config/speechbutton/prompts/fix-grammar.md

# Check exit code
echo $?  # should be 0

AI Agent Setup

SpeechButton's config.toml is designed to be readable and writable by AI agents. Give your AI agent the CLAUDE.md file and it can configure SpeechButton for you — adding channels, setting up transforms, and tuning settings via natural language.

CLAUDE.md on GitHub: github.com/speechbutton/config/blob/main/CLAUDE.md

This file contains the full configuration spec in a format optimized for AI agents. It describes every field, valid values, and examples.

Give your agent the config spec

# Show the agent-readable config documentation
cat ~/.config/speechbutton/CLAUDE.md

Then ask your agent things like: "Add a channel that translates to Japanese and pastes" or "Set up a Slack webhook on channel 3". The agent reads CLAUDE.md, understands the config format, and edits config.toml directly. Changes take effect immediately thanks to hot-reload.

⚡ Quick Start