# Voice Interface Research - ElevenLabs + Whisper

## Overview
Goal: Add voice capabilities for hands-free conversations while driving. You speak, I listen via Whisper (speech-to-text), I respond via ElevenLabs (text-to-speech).

---

## Option 1: ElevenLabs (TTS) + OpenAI Whisper API (STT)

### ElevenLabs Text-to-Speech (My Voice)
**Good news:** OpenClaw already has a skill called `sag` for ElevenLabs TTS integration!

**Pricing:**
- **Free tier:** 10k credits/month (~10 min audio)
- **Starter:** $5/month - 30k credits (~30 min)
- **Creator:** $11/month (50% off first month) - 100k credits (~100 min)
- **Pro:** $99/month - 500k credits (~500 min)

**Cost per extra minute:**
- Starter: ~$0.30/min
- Creator: ~$0.24/min  
- Pro: ~$0.18/min

### OpenAI Whisper Speech-to-Text (Your Voice)
**Pricing:** $0.006 per minute ($0.36/hour)

**Example usage:**
- 30 min drive conversation: ~$0.18
- Daily 30 min commute: ~$5.40/month
- Heavy use (2 hrs/day): ~$21.60/month

---

## Option 2: Self-Hosted Whisper (Cheaper)

Since you're on a VPS, you could run Whisper locally:

**Options:**
1. **OpenAI Whisper** (open source) - Free but needs GPU/CPU resources
2. **whisper.cpp** - Optimized C++ version, runs well on CPU
3. **whisper.api** - Self-hosted API wrapper

**Requirements:**
- Your VPS needs decent CPU or a small GPU
- whisper.cpp runs on CPU with reasonable latency for short utterances
- No per-minute costs, just server resources

**Trade-off:** More setup, but zero ongoing API costs for STT.

---

## Recommended Setup

### Phase 1 (Start Here): Managed APIs
**ElevenLabs:** Creator plan ($11/month) = ~100 min included + $0.24/min overage
**Whisper API:** Pay-as-you-go at $0.006/min

**Estimated monthly cost:**
- Light use (30 min/day): ~$11 + $5.40 = **$16.40/month**
- Moderate (1 hr/day): ~$11 + $10.80 = **$22/month**
- Heavy (2 hrs/day): ~$11 + $21.60 = **$33/month**

### Phase 2 (If usage grows): Hybrid
Keep ElevenLabs for TTS, self-host Whisper on your VPS for STT.

---

## OpenClaw Integration

The `sag` skill already exists for ElevenLabs TTS. For voice input, we'd need:

1. **Your phone app** captures audio, sends to Telegram voice message
2. **OpenClaw** receives voice message, transcribes via Whisper
3. **I process** the text, generate response
4. **ElevenLabs (sag)** converts my response to audio
5. **You receive** voice reply via Telegram

**OR** if you want real-time conversation without Telegram as intermediary:
- Custom mobile app that streams audio to your VPS
- WebSocket connection for bidirectional audio
- More complex but seamless driving experience

---

## Next Steps (for tomorrow)

1. **Get ElevenLabs API key** - Sign up, pick plan (recommend starting with Creator)
2. **Install `sag` skill** in OpenClaw - `npx openclaw skills add sag`
3. **Configure voice** - Pick a voice ID you like
4. **Test TTS** - Make sure audio playback works on your setup
5. **Decide on STT approach:**
   - Quick start: OpenAI Whisper API ($0.006/min)
   - Cost optimize: Self-host whisper.cpp on VPS
6. **Test voice conversation** via Telegram voice messages

---

## Key Questions for Tomorrow

1. Do you want to start with Telegram voice messages, or build a custom mobile app?
2. Prefer managed APIs for simplicity or self-hosted for cost savings?
3. Any preference on voice style for ElevenLabs? (Professional, warm, energetic, etc.)
