Voice & Speech Workflows¶
Set up voice-enabled agents — from basic speech input to fully autonomous phone agents and real-time human agent assistance.
Overview¶
The fifthelement.ai platform provides three levels of voice capability, each building on the previous:
| Level | What It Does | Use Case |
|---|---|---|
| Push To Talk (ASR) | Users speak into a mic button, agent reads text | Chat agents with optional voice input |
| Push To Talk + TTS (ASR + TTS) | Full voice loop — user speaks, agent speaks back | Voice-first chat experiences, accessibility |
| Realtime Voice | Continuous bidirectional voice over phone/SIP/websocket | Autonomous phone agents, IVR replacements |
| AI Listens | Real-time transcription and suggestions for human agents on calls | Human agent co-pilot during live calls |
Quick Start: Adding Voice to an Agent¶
1. Enable Speech-to-Text (Push To Talk)¶

- Open your agent and navigate to Agent Settings > Audio and Speech Settings
- Enable "Speech to Text"
- Select an ASR model from the dropdown
- Save — a Mic button now appears in the chat interface
Users press the Mic button, speak their message, and the ASR model transcribes it to text for the agent to process.
2. Add Text-to-Speech (Voice Responses)¶
- In the same Audio and Speech Settings section, enable "Text to Speech"
- Select a TTS model from the dropdown
- Save
Now when a user sends a voice message, the agent's response is automatically played back as speech.
Message Read Aloud¶
Optionally enable "Message Read Aloud" (requires TTS to be enabled). This adds a Speaker button next to every agent response, letting users listen to any message on demand — even for text-typed conversations.
3. Set Up Realtime Voice (Phone/SIP)¶
For agents that need to handle live phone calls:
- Navigate to Agent Settings > Audio and Speech Settings
- Enable "Realtime Voice Processing"
- Select a Realtime Voice Model
- Configure either SIP or Websocket processing:
SIP (phone calls):
| Field | What to Enter |
|---|---|
| SIP Numbers | Your phone number(s) |
| SIP Username | Auth username |
| SIP Password | Auth password |
| Allowed IP Addresses | CIDR notation (leave empty to allow all) |
Websocket (browser/app integrations):
| Field | What to Enter |
|---|---|
| Agent Identifier | Unique ID used in the to field of websocket messages |
- Save — the agent is now ready to handle live voice calls
4. Enable AI Listens (Human Agent Co-Pilot)¶
AI Listens provides real-time transcription and AI-powered suggestions for human agents during live phone calls. It does not speak to the caller — it assists your human team.
- In Audio and Speech settings, select a Custom Realtime Model
- Enable AI Listens
- Configure phone numbers (with
+country code prefix, must not overlap with SIP numbers) - Set up your telephony provider (e.g., Twilio) with the provided Webhook URL (HTTP POST)
- Choose a channel option:
inbound(caller only),outbound(agent only), orboth_tracks
Access the live dashboard via the three-dot menu on the agent card > AI Listens.
Demo Feature
AI Listens is currently available as a demo feature for feedback gathering. It provides passive monitoring only — no TTS/speaking.
Model Selection Guide¶
| Voice Feature | Model Type Needed | Where to Configure |
|---|---|---|
| Push To Talk (speech input) | ASR Model | Settings > AI Models > ASR Models |
| Voice responses | TTS Model | Settings > AI Models > TTS Models |
| Phone/SIP agent | Realtime Voice Model | Settings > AI Models > Realtime Voice |
| AI Listens | Custom Realtime Model | Settings > AI Models > Realtime Voice |
LiveKit Inference
If you want a simplified setup, LiveKit Inference provides a unified gateway for ASR, TTS, and LLM models using a single credential. Configure it under Settings > Credentials with a LiveKit Inference API key.
Choosing the Right Voice Architecture¶
Simple FAQ / Support Agent¶
- Enable: ASR only (Push To Talk)
- Why: Users can optionally speak queries, but the agent responds in text. Low cost, simple setup.
Accessibility-First Agent¶
- Enable: ASR + TTS + Message Read Aloud
- Why: Full voice loop for users who prefer or need audio interaction.
Autonomous Phone Agent (IVR Replacement)¶
- Enable: Realtime Voice with SIP
- Why: Continuous voice conversation over phone lines. No chat interface needed.
Human Agent Support¶
- Enable: AI Listens
- Why: Your human agents get real-time transcription and suggested responses during live calls without the AI speaking to the customer.
Related Topics¶
- ASR Models — Configure speech-to-text models
- TTS Models — Configure text-to-speech models
- Realtime Voice Models — Configure phone/SIP/websocket voice and AI Listens
- Agent Builder — Advanced Configuration — Audio and Speech settings in the agent builder
- Back to All Guides