The problem

Many tasks are easier or more natural when done by voice—hands-free use, accessibility, or quick queries. Building voice-first agents that understand intent, call tools, and respond in real time is non-trivial and often tied to proprietary stacks.

The solution

Voice Agents is a hobby project exploring voice-in, voice-out AI agents: speech-to-text, LLM or agent logic, and text-to-speech (or structured responses) in a modular pipeline you can extend or self-host.

Without voice agents

Text-only interfaces or locked-in vendor solutions; no clear path to customize or own the pipeline.

With Voice Agents

Modular voice pipeline: STT → agent/LLM → TTS or actions; adaptable to your stack and use cases.

What it does

  • Voice input – Speech-to-text integration for user utterances.
  • Agent/LLM layer – Intent handling, tool use, and response generation.
  • Voice or structured output – Text-to-speech or API responses for downstream use.
  • Extensible design – Swap STT/TTS/LLM providers and add new skills.

Tech stack

See the repository for current stack (e.g. Python, FastAPI, or frontend + backend). Designed to plug in common STT/TTS and LLM providers.

Next steps

Roadmap & ideas

  • Demo (TBD) – live or recorded walkthrough.
  • Support for more STT/TTS providers and LLM backends.
  • Prebuilt “skills” (e.g. calendar, search, smart home) and docs for adding your own.