The problem
Many tasks are easier or more natural when done by voice—hands-free use, accessibility, or quick queries. Building voice-first agents that understand intent, call tools, and respond in real time is non-trivial and often tied to proprietary stacks.
The solution
Voice Agents is a hobby project exploring voice-in, voice-out AI agents: speech-to-text, LLM or agent logic, and text-to-speech (or structured responses) in a modular pipeline you can extend or self-host.
Without voice agents
Text-only interfaces or locked-in vendor solutions; no clear path to customize or own the pipeline.
With Voice Agents
Modular voice pipeline: STT → agent/LLM → TTS or actions; adaptable to your stack and use cases.
What it does
- Voice input – Speech-to-text integration for user utterances.
- Agent/LLM layer – Intent handling, tool use, and response generation.
- Voice or structured output – Text-to-speech or API responses for downstream use.
- Extensible design – Swap STT/TTS/LLM providers and add new skills.
Tech stack
See the repository for current stack (e.g. Python, FastAPI, or frontend + backend). Designed to plug in common STT/TTS and LLM providers.