Voice & Speech Workflows¶

Set up voice-enabled agents — from basic speech input to fully autonomous phone agents and real-time human agent assistance.

Overview¶

The fifthelement.ai platform provides three levels of voice capability, each building on the previous:

Level	What It Does	Use Case
Push To Talk (ASR)	Users speak into a mic button, agent reads text	Chat agents with optional voice input
Push To Talk + TTS (ASR + TTS)	Full voice loop — user speaks, agent speaks back	Voice-first chat experiences, accessibility
Realtime Voice	Continuous bidirectional voice over phone/SIP/websocket	Autonomous phone agents, IVR replacements
AI Listens	Real-time transcription and suggestions for human agents on calls	Human agent co-pilot during live calls

Quick Start: Adding Voice to an Agent¶

1. Enable Speech-to-Text (Push To Talk)¶

Audio and Speech Settings

Open your agent and navigate to Agent Settings > Audio and Speech Settings
Enable "Speech to Text"
Select an ASR model from the dropdown
Save — a Mic button now appears in the chat interface

Users press the Mic button, speak their message, and the ASR model transcribes it to text for the agent to process.

2. Add Text-to-Speech (Voice Responses)¶

In the same Audio and Speech Settings section, enable "Text to Speech"
Select a TTS model from the dropdown
Save

Now when a user sends a voice message, the agent's response is automatically played back as speech.

Message Read Aloud¶

Optionally enable "Message Read Aloud" (requires TTS to be enabled). This adds a Speaker button next to every agent response, letting users listen to any message on demand — even for text-typed conversations.

3. Set Up Realtime Voice (Phone/SIP)¶

For agents that need to handle live phone calls:

Navigate to Agent Settings > Audio and Speech Settings
Enable "Realtime Voice Processing"
Select a Realtime Voice Model
Configure either SIP or Websocket processing:

SIP (phone calls):

Field	What to Enter
SIP Numbers	Your phone number(s)
SIP Username	Auth username
SIP Password	Auth password
Allowed IP Addresses	CIDR notation (leave empty to allow all)

Websocket (browser/app integrations):

Field	What to Enter
Agent Identifier	Unique ID used in the `to` field of websocket messages

Save — the agent is now ready to handle live voice calls

4. Enable AI Listens (Human Agent Co-Pilot)¶

AI Listens provides real-time transcription and AI-powered suggestions for human agents during live phone calls. It does not speak to the caller — it assists your human team.

In Audio and Speech settings, select a Custom Realtime Model
Enable AI Listens
Configure phone numbers (with + country code prefix, must not overlap with SIP numbers)
Set up your telephony provider (e.g., Twilio) with the provided Webhook URL (HTTP POST)
Choose a channel option: inbound (caller only), outbound (agent only), or both_tracks

Access the live dashboard via the three-dot menu on the agent card > AI Listens.

Demo Feature

AI Listens is currently available as a demo feature for feedback gathering. It provides passive monitoring only — no TTS/speaking.

Model Selection Guide¶

Voice Feature	Model Type Needed	Where to Configure
Push To Talk (speech input)	ASR Model	Settings > AI Models > ASR Models
Voice responses	TTS Model	Settings > AI Models > TTS Models
Phone/SIP agent	Realtime Voice Model	Settings > AI Models > Realtime Voice
AI Listens	Custom Realtime Model	Settings > AI Models > Realtime Voice

LiveKit Inference

If you want a simplified setup, LiveKit Inference provides a unified gateway for ASR, TTS, and LLM models using a single credential. Configure it under Settings > Credentials with a LiveKit Inference API key.

Choosing the Right Voice Architecture¶

Simple FAQ / Support Agent¶

Enable: ASR only (Push To Talk)
Why: Users can optionally speak queries, but the agent responds in text. Low cost, simple setup.

Accessibility-First Agent¶

Enable: ASR + TTS + Message Read Aloud
Why: Full voice loop for users who prefer or need audio interaction.

Autonomous Phone Agent (IVR Replacement)¶

Enable: Realtime Voice with SIP
Why: Continuous voice conversation over phone lines. No chat interface needed.

Human Agent Support¶

Enable: AI Listens
Why: Your human agents get real-time transcription and suggested responses during live calls without the AI speaking to the customer.

ASR Models — Configure speech-to-text models
TTS Models — Configure text-to-speech models
Realtime Voice Models — Configure phone/SIP/websocket voice and AI Listens
Agent Builder — Advanced Configuration — Audio and Speech settings in the agent builder
Back to All Guides