Skip to content

ASR Models

Configure Automatic Speech Recognition (speech-to-text) models to enable voice input for your agents.


Overview

ASR (Automatic Speech Recognition) models convert spoken language into text, allowing users to interact with your agents using voice. When ASR is enabled, a microphone button appears next to the text input area in the chat interface, enabling Push To Talk functionality.

Microphone permission is requested only when the user actually tries to use the voice feature, reducing friction for users who don't need voice input.

ASR Models are managed under Settings > AI Models > ASR Models.

ASR Models List


Enabling ASR on an Agent

Audio and Speech Settings

To enable speech-to-text for an agent:

  1. Navigate to Agent Settings > Audio and Speech Settings
  2. Enable the "Speech to Text" toggle
  3. Select an ASR model from the configured models in your workspace
  4. Click "Save"

How It Works — Push To Talk

Agent Chat Interface with Mic Button

Once ASR is enabled on an agent:

  1. A Mic button appears next to the text input area in the chat interface
  2. Users click the Mic button and push to talk — speaking their message
  3. The ASR model transcribes the spoken input into text
  4. The agent processes the transcribed text and generates a response

Push To Talk

The voice input uses a push-to-talk model — users press and hold the mic button while speaking. This gives users explicit control over when the agent is listening.


ASR with TTS

When ASR is enabled alongside Text-to-Speech (TTS), the agent supports a full voice conversation flow:

  1. User speaks → ASR converts speech to text
  2. Agent processes the transcribed text
  3. Agent responds → TTS converts the response back to speech, automatically playing it back

This creates a hands-free, voice-first interaction experience.


Supported Providers

ASR models are configured at the workspace level under Settings > AI Models > ASR Models. The available models depend on the providers configured for your workspace.

Provider Models / Notes
OpenAI Latest OpenAI STT models
Azure OpenAI Azure-hosted OpenAI STT
Azure AI Speech Azure AI Speech Services
Google Cloud Google Cloud Speech-to-Text
AssemblyAI Supports Speech Model selection, including Universal Streaming Multilingual (Beta) for improved accuracy across diverse languages
Sarvam Saaras v3 — optimized for Indian languages
ElevenLabs ElevenLabs speech-to-text
Speechmatics Speechmatics ASR
Groq Groq-hosted STT models
LiveKit Inference Access various STT providers via a unified gateway using simple model ID strings. Requires a LiveKit Inference API credential under Settings > Credentials
OpenAI Whisper (Deprecated) Legacy OpenAI Whisper
Azure OpenAI Whisper (Deprecated) Legacy Azure-hosted Whisper

Model Selection

Choose your ASR model based on language support, accuracy requirements, and latency needs. Some models perform better with specific accents or in noisy environments. For multilingual use cases, consider AssemblyAI's Universal Streaming model or Sarvam for Indian language support.