Speech languages reference
This reference lists the supported languages for:
Text-to-speech languages
Infobip supports a total of 77 distinct languages or dialects across 52 base languages, with a total of 960 different voices for text-to-speech conversion. Voices are available across 3 different types:
- Standard
- Neural
- Generative
The following table defines each voice type characteristics and most appropriate use cases for each.
| Standard | Neural | Generative | |
|---|---|---|---|
| What it is | Speech generated by traditional synthesis methods that predate deep learning. Uses either recorded audio samples assembled into phrases (concatenative) or mathematical models processed through signal-processing algorithms (parametric). | Speech synthesized by deep neural networks trained on large datasets of human speech. Predicts prosody (intonation, rhythm, stress) and synthesizes voice simultaneously, producing more natural-sounding output. | Speech produced by large-scale generative AI models built on billion-parameter transformer architectures similar to those behind large language models (LLMs). |
| How it sounds | Clear and intelligible, but can sound noticeably synthetic, particularly in intonation, rhythm, and pauses. | More natural and fluid than standard voices, with smoother intonation and better handling of accents and emphasis. Some synthetic characteristics may still be perceptible. | The most natural and human-like quality available, close to indistinguishable from a real person. Produces conversational nuances such as natural pauses, emotional tone, and contextual emphasis. |
| Default voice | Each language includes a default voice, applied automatically when no voice name is specified. | No default voice. A voice name must always be specified. | No default voice. A voice name must always be specified. |
| Best suited for | High-volume, short, transactional prompts where cost efficiency is the priority and a synthetic tone is acceptable. Examples: OTP codes, routing announcements. | Most production voice scenarios where the voice represents your brand to customers. | Premium and conversational experiences where voice quality directly impacts the outcome. Examples: AI voice agents, contact center waiting strategies, brand-critical IVR, or any scenario where a robotic tone would undermine trust. |
| Simple rule of thumb | "Is this a short, automated message where a synthetic tone is acceptable?" | "Does this voice represent my brand to a customer?" | "Is this a conversation or experience where sounding human truly matters?" |
| Pricing | No charge | Charged per character | Charged per character (rate varies by provider) |
SSML support
Speech Synthesis Markup Language (SSML) serves as a powerful tool to finely tune the text-to-speech synthesis process. With SSML, you can infuse synthesized speech with natural-sounding inflections, emphasis, pauses, and other speech characteristics. This control allows you to tailor the output to match your requirements and create more engaging and lifelike voice experiences for your users.
Infobip only offers SSML support for:
- Google standard and generative voices
- Amazon Polly standard voices
The following SSML tags are supported across Voice API products:
<speak>-- Identifies SSML-enhanced text<break>-- Adds a pause<say-as>-- Controls how special types of words are spoken<p>-- Adds a pause between paragraphs<s>-- Adds a pause between sentences<emphasis>-- Emphasizes words<sub>-- Substitutes pronunciation for acronyms and abbreviations<phoneme>-- Specifies phonetic pronunciation<prosody>-- Controls volume, speaking rate, and pitch<lang>-- Switches pronunciation rules for a specific language segment. Quality depends on the selected voice, language pair, and TTS provider.
Infobip does not support other provider-specific SSML tags such as par, audio, or seq.
The following example shows SSML-enhanced text in a Voice Message API request:
"text": "<speak>Hello. This is a test message sent from <emphasis level=\"strong\">Infobip Voice Message API</emphasis>. Your confirmation code is <say-as interpret-as=\"spell-out\">12345</say-as>.</speak>"
For more information about the complete syntax of these SSML tags, see the official provider documentation:
When using neural and generative voices, SSML characters are part of the total count of characters submitted for synthesis and are taken into account for character charging.
Available voices by language
Select a language to view the available voices, including voice name, gender, provider, and whether the voice is the default for that language.
Speech recognition languages
Infobip integrates with Google, Microsoft, and Deepgram (Flux model) for speech recognition. Microsoft Azure is used by default unless defined otherwise in your transcription requests.
The following table lists the language code to use when selecting a specific language in the speech-to-text request.