Voice AI

Build a Private AI Voice Assistant with OpenClaw + Whisper

Your voice, transcribed locally by Whisper, acted on by OpenClaw. No Alexa, no Google, no Siri. Zero audio leaves your machine.

12 min readApril 6, 2026 Privacy-first

Why Add a Voice Layer to Your AI Agent?

Text is fast. But voice is hands-free. There are moments — cooking, driving, working out — where you want to ask your AI something and typing isn't an option.

The commercial options (Alexa, Google Assistant, Siri) all have the same problem: they're listening servers in your home sending audio to corporate data centers. Every query you make goes somewhere. That tradeoff was acceptable when assistants were dumb. It's not acceptable now that they're powerful.

OpenClaw already handles the agent layer — memory, tools, cron jobs, multi-channel messaging. What it's missing out of the box is a voice input layer. That's what we're adding today, using OpenAI's Whisper model running entirely on your hardware.

Audio stays local
Zero cloud transmission
~200ms latency
On Apple Silicon M2+
Works offline
Whisper runs on-device

The Architecture

The pipeline is simple and each component is swappable:

[Microphone / Audio File]
        ↓
[Whisper (local STT)]  ←── runs on your CPU/GPU
        ↓
[Transcribed Text]
        ↓
[OpenClaw Agent]       ←── memory + tools + skills
        ↓
[Text Response]
        ↓
[TTS (local or API)]   ←── optional: piper / macOS say
        ↓
[Speaker Output]

The only component that touches a cloud is the OpenClaw agent itself — and that's your choice. You can run OpenClaw with a cloud model (Claude, GPT-4, Gemini) or swap it torun fully local with Ollamafor 100% private operation.

Install Whisper Locally

We're using faster-whisper — a community-optimized port of Whisper that runs 4-8x faster than the reference implementation with the same accuracy.

# Install faster-whisper
pip install faster-whisper

# Or with uv (recommended):
uv pip install faster-whisper

Test it immediately with a quick Python snippet:

test_whisper.py
from faster_whisper import WhisperModel

# Use "base" for speed, "small" for accuracy, "large-v3" for best
model = WhisperModel("base", device="cpu", compute_type="int8")

segments, info = model.transcribe("your_audio.wav", beam_size=5)

for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

Model size guide for 2026 hardware:

ModelSizeSpeed (M2)Best For
tiny75 MB~80ms/utteranceWake word / low-power
base145 MB~150ms/utteranceDaily driver (recommended)
small466 MB~250ms/utteranceAccents, mixed language
large-v31.5 GB~600ms/utteranceMax accuracy, M3 Pro+

Capture Audio on Any Device

For desktop (macOS/Linux), use sounddeviceto capture from your microphone with a push-to-talk hotkey:

record_and_transcribe.py
import sounddevice as sd
import numpy as np
import scipy.io.wavfile as wavfile
from faster_whisper import WhisperModel
import tempfile, os

SAMPLE_RATE = 16000
DURATION = 8  # seconds to record

model = WhisperModel("base", device="cpu", compute_type="int8")

def record_and_transcribe():
    print("🎙️  Recording... (press Ctrl+C to stop early)")
    audio = sd.rec(
        int(DURATION * SAMPLE_RATE),
        samplerate=SAMPLE_RATE,
        channels=1,
        dtype="float32"
    )
    sd.wait()
    
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        wavfile.write(f.name, SAMPLE_RATE, audio)
        segments, _ = model.transcribe(f.name)
        text = " ".join(s.text for s in segments).strip()
        os.unlink(f.name)
    
    return text

if __name__ == "__main__":
    result = record_and_transcribe()
    print(f"\n📝 Transcribed: {result}")

Wire Whisper into OpenClaw

OpenClaw exposes an HTTP API on your local machine. Once you have a transcription, posting it to OpenClaw is a single HTTP call:

send_to_openclaw.py
import requests

OPENCLAW_URL = "http://localhost:3000"  # default OpenClaw port
OPENCLAW_TOKEN = "your-gateway-token"   # from openclaw gateway status

def ask_openclaw(text: str) -> str:
    """Send transcribed text to OpenClaw and get agent response."""
    resp = requests.post(
        f"{OPENCLAW_URL}/api/session/message",
        json={"message": text},
        headers={"Authorization": f"Bearer {OPENCLAW_TOKEN}"}
    )
    resp.raise_for_status()
    return resp.json().get("reply", "")

# Full pipeline
transcribed = record_and_transcribe()
print(f"You said: {transcribed}")

reply = ask_openclaw(transcribed)
print(f"OpenClaw: {reply}")

This is the minimal loop. Your voice goes in, OpenClaw's full agent — with memory, tools, cron jobs, and all installed skills — fires on it, and the response comes back as text.

Text-to-Speech Replies

Reading responses on screen defeats the purpose. Add TTS to close the loop. On macOS, the built-in say command is surprisingly good and requires zero setup:

import subprocess

def speak(text: str, voice: str = "Samantha"):
    """Speak text using macOS built-in TTS."""
    # Remove markdown formatting
    clean = text.replace("**", "").replace("*", "").replace("`", "")
    subprocess.run(["say", "-v", voice, clean])

# Full voice loop
transcribed = record_and_transcribe()
reply = ask_openclaw(transcribed)
speak(reply)

For Linux or higher quality voices, use Piper TTS — a fast, local, neural TTS model that sounds significantly better than festival:

# Install piper
pip install piper-tts

# Download a voice model (en_US-lessac-medium is excellent)
python -m piper --download-dir ./voices en_US-lessac-medium

# Use in Python
from piper import PiperVoice
import wave, io

voice = PiperVoice.load("./voices/en_US-lessac-medium.onnx")

def speak_piper(text: str):
    with io.BytesIO() as wav_io:
        with wave.open(wav_io, "wb") as wav_file:
            voice.synthesize(text, wav_file)
        # Play the audio
        import sounddevice as sd
        import soundfile as sf
        wav_io.seek(0)
        data, rate = sf.read(wav_io)
        sd.play(data, rate)
        sd.wait()

Wake Word Detection (Optional)

Push-to-talk is fine for most use cases. But if you want hands-free triggering, add openWakeWord — a lightweight, open-source wake word model that runs on CPU in under 5ms:

pip install openwakeword

from openwakeword.model import Model
import pyaudio, numpy as np

oww_model = Model(wakeword_models=["hey_jarvis"])  # or custom

def listen_for_wake_word():
    p = pyaudio.PyAudio()
    stream = p.open(rate=16000, channels=1, format=pyaudio.paInt16, input=True)
    
    print("👂 Listening for wake word...")
    while True:
        audio_chunk = np.frombuffer(stream.read(1280), dtype=np.int16)
        oww_model.predict(audio_chunk)
        
        for model_name, scores in oww_model.prediction_buffer.items():
            if scores[-1] > 0.5:  # confidence threshold
                print(f"🎙️ Wake word detected!")
                stream.stop_stream()
                return  # trigger your recording loop
⚠️ Wake Word Privacy Note

openWakeWord runs 100% on-device — it never sends audio anywhere. The model listens for acoustic patterns only, not semantic content. Still, if this concerns you, push-to-talk via a hotkey (e.g., ⌥ Space) is the more privacy-preserving option.

Mobile: Telegram Voice Notes (Zero Setup)

If you already have OpenClaw connected to Telegram (and you should — it's in thesetup guide), you get voice input on mobile for free. OpenClaw's Telegram integration automatically transcribes voice notes using the OpenAI Whisper API.

Just send a voice note to your bot and it responds like a normal text message. This uses the Whisper API (cloud) rather than local Whisper, but it's Telegram's servers you're trusting — not a always-on microphone in your home.

For fully local mobile voice input, you'd need a companion app — which is on the roadmap for the OpenClaw mobile client. For now, Telegram voice notes are the pragmatic path.

Privacy & Data Flow: What Goes Where

Microphone audio
Never leaves your machine (local Whisper)
Whisper transcription
Processed locally → text sent to OpenClaw
⚠️
OpenClaw agent (cloud model)
Text sent to your chosen API (Anthropic/OpenAI/etc)
OpenClaw agent (Ollama)
Stays 100% local
TTS (macOS say / Piper)
Runs locally, audio stays on device

The yellow warning is for cloud model usage — but that's true of any AI assistant. The difference is you're sending text queries, not continuous audio. Your voice never touches a third-party server.

Next Steps & Extensions

Once the basic pipeline works, there's a lot you can bolt on:

  • Add a homelab dashboard — voice-control your smart home via OpenClaw skills
  • Schedule voice reminders using OpenClaw's cron system
  • Build a kitchen assistant that reads recipes aloud and answers follow-ups
  • Connect to your calendar via the Google Workspace skill
  • Run the whole stack on a Raspberry Pi 5 for a dedicated home assistant device

If you're running this on a VPS or remote server, check theVPS hosting guide— a cheap $5/month box can run Whisper base + OpenClaw + Piper without breaking a sweat.

The Big Picture

Alexa and Google Assistant are voice-activated search boxes with corporate backends. What you're building here is an actual agent — one that remembers context, executes multi-step tasks, schedules future actions, and has access to your real data. The voice layer is just the input method. The capability underneath is orders of magnitude richer. And it's yours.

Want to reduce the API cost further? See how tocalculate your OpenClaw costsand whichAPI providers are cheapest in 2026.

Want more guides like this?

Weekly AI agent tutorials, model comparisons and automation ideas.

Read the full setup guide →
We use cookies for analytics. Learn more
Run your own AI agent for $6/month →