Skip to content

Voice & Speech Workflows

Set up voice-enabled agents — from basic speech input to fully autonomous phone agents and real-time human agent assistance.


Overview

The fifthelement.ai platform provides three levels of voice capability, each building on the previous:

Level What It Does Use Case
Push To Talk (ASR) Users speak into a mic button, agent reads text Chat agents with optional voice input
Push To Talk + TTS (ASR + TTS) Full voice loop — user speaks, agent speaks back Voice-first chat experiences, accessibility
Realtime Voice Continuous bidirectional voice over phone/SIP/websocket Autonomous phone agents, IVR replacements
AI Listens Real-time transcription and suggestions for human agents on calls Human agent co-pilot during live calls

Quick Start: Adding Voice to an Agent

1. Enable Speech-to-Text (Push To Talk)

Audio and Speech Settings

  1. Open your agent and navigate to Agent Settings > Audio and Speech Settings
  2. Enable "Speech to Text"
  3. Select an ASR model from the dropdown
  4. Save — a Mic button now appears in the chat interface

Users press the Mic button, speak their message, and the ASR model transcribes it to text for the agent to process.


2. Add Text-to-Speech (Voice Responses)

  1. In the same Audio and Speech Settings section, enable "Text to Speech"
  2. Select a TTS model from the dropdown
  3. Save

Now when a user sends a voice message, the agent's response is automatically played back as speech.

Message Read Aloud

Optionally enable "Message Read Aloud" (requires TTS to be enabled). This adds a Speaker button next to every agent response, letting users listen to any message on demand — even for text-typed conversations.


3. Set Up Realtime Voice (Phone/SIP)

For agents that need to handle live phone calls:

  1. Navigate to Agent Settings > Audio and Speech Settings
  2. Enable "Realtime Voice Processing"
  3. Select a Realtime Voice Model
  4. Configure either SIP or Websocket processing:

SIP (phone calls):

Field What to Enter
SIP Numbers Your phone number(s)
SIP Username Auth username
SIP Password Auth password
Allowed IP Addresses CIDR notation (leave empty to allow all)

Websocket (browser/app integrations):

Field What to Enter
Agent Identifier Unique ID used in the to field of websocket messages
  1. Save — the agent is now ready to handle live voice calls

4. Enable AI Listens (Human Agent Co-Pilot)

AI Listens provides real-time transcription and AI-powered suggestions for human agents during live phone calls. It does not speak to the caller — it assists your human team.

  1. In Audio and Speech settings, select a Custom Realtime Model
  2. Enable AI Listens
  3. Configure phone numbers (with + country code prefix, must not overlap with SIP numbers)
  4. Set up your telephony provider (e.g., Twilio) with the provided Webhook URL (HTTP POST)
  5. Choose a channel option: inbound (caller only), outbound (agent only), or both_tracks

Access the live dashboard via the three-dot menu on the agent card > AI Listens.

Demo Feature

AI Listens is currently available as a demo feature for feedback gathering. It provides passive monitoring only — no TTS/speaking.


Model Selection Guide

Voice Feature Model Type Needed Where to Configure
Push To Talk (speech input) ASR Model Settings > AI Models > ASR Models
Voice responses TTS Model Settings > AI Models > TTS Models
Phone/SIP agent Realtime Voice Model Settings > AI Models > Realtime Voice
AI Listens Custom Realtime Model Settings > AI Models > Realtime Voice

LiveKit Inference

If you want a simplified setup, LiveKit Inference provides a unified gateway for ASR, TTS, and LLM models using a single credential. Configure it under Settings > Credentials with a LiveKit Inference API key.


Choosing the Right Voice Architecture

Simple FAQ / Support Agent

  • Enable: ASR only (Push To Talk)
  • Why: Users can optionally speak queries, but the agent responds in text. Low cost, simple setup.

Accessibility-First Agent

  • Enable: ASR + TTS + Message Read Aloud
  • Why: Full voice loop for users who prefer or need audio interaction.

Autonomous Phone Agent (IVR Replacement)

  • Enable: Realtime Voice with SIP
  • Why: Continuous voice conversation over phone lines. No chat interface needed.

Human Agent Support

  • Enable: AI Listens
  • Why: Your human agents get real-time transcription and suggested responses during live calls without the AI speaking to the customer.