Speech languages reference

This reference lists the supported languages for:

Text-to-speech languages

Infobip supports a total of 77 distinct languages or dialects across 52 base languages, with a total of 960 different voices for text-to-speech conversion. Voices are available across 3 different types:

Standard
Neural
Generative

The following table defines each voice type characteristics and most appropriate use cases for each.

	Standard	Neural	Generative
What it is	Speech generated by traditional synthesis methods that predate deep learning. Uses either recorded audio samples assembled into phrases (concatenative) or mathematical models processed through signal-processing algorithms (parametric).	Speech synthesized by deep neural networks trained on large datasets of human speech. Predicts prosody (intonation, rhythm, stress) and synthesizes voice simultaneously, producing more natural-sounding output.	Speech produced by large-scale generative AI models built on billion-parameter transformer architectures similar to those behind large language models (LLMs).
How it sounds	Clear and intelligible, but can sound noticeably synthetic, particularly in intonation, rhythm, and pauses.	More natural and fluid than standard voices, with smoother intonation and better handling of accents and emphasis. Some synthetic characteristics may still be perceptible.	The most natural and human-like quality available, close to indistinguishable from a real person. Produces conversational nuances such as natural pauses, emotional tone, and contextual emphasis.
Default voice	Each language includes a default voice, applied automatically when no voice name is specified.	No default voice. A voice name must always be specified.	No default voice. A voice name must always be specified.
Best suited for	High-volume, short, transactional prompts where cost efficiency is the priority and a synthetic tone is acceptable. Examples: OTP codes, routing announcements.	Most production voice scenarios where the voice represents your brand to customers.	Premium and conversational experiences where voice quality directly impacts the outcome. Examples: AI voice agents, contact center waiting strategies, brand-critical IVR, or any scenario where a robotic tone would undermine trust.
Simple rule of thumb	"Is this a short, automated message where a synthetic tone is acceptable?"	"Does this voice represent my brand to a customer?"	"Is this a conversation or experience where sounding human truly matters?"
Pricing	No charge	Charged per character	Charged per character (rate varies by provider)

SSML support [#ssml-support-text-to-speech-languages]

Speech Synthesis Markup Language (SSML) serves as a powerful tool to finely tune the text-to-speech synthesis process. With SSML, you can infuse synthesized speech with natural-sounding inflections, emphasis, pauses, and other speech characteristics. This control allows you to tailor the output to match your requirements and create more engaging and lifelike voice experiences for your users.

IMPORTANT

Infobip only offers SSML support for:

Google standard and generative voices
Amazon Polly standard voices

The following SSML tags are supported across Voice API products:

<speak> -- Identifies SSML-enhanced text
<break> -- Adds a pause
<say-as> -- Controls how special types of words are spoken
<p> -- Adds a pause between paragraphs
<s> -- Adds a pause between sentences
<emphasis> -- Emphasizes words
<sub> -- Substitutes pronunciation for acronyms and abbreviations
<phoneme> -- Specifies phonetic pronunciation
<prosody> -- Controls volume, speaking rate, and pitch
<lang> -- Switches pronunciation rules for a specific language segment. Quality depends on the selected voice, language pair, and TTS provider.

Infobip does not support other provider-specific SSML tags such as par, audio, or seq.

The following example shows SSML-enhanced text in a Voice Message API request:

html

2 "text": "<speak>Hello. This is a test message sent from <emphasis level=\"strong\">Infobip Voice Message API</emphasis>. Your confirmation code is <say-as interpret-as=\"spell-out\">12345</say-as>.</speak>"

For more information about the complete syntax of these SSML tags, see the official provider documentation:

NOTE

When using neural and generative voices, SSML characters are part of the total count of characters submitted for synthesis and are taken into account for character charging.

Available voices by language

Select a language to view the available voices, including voice name, gender, provider, and whether the voice is the default for that language.

Speech recognition languages

Infobip integrates with Google, Microsoft, and Deepgram (Flux model) for speech recognition. Microsoft Azure is used by default unless defined otherwise in your transcription requests.

The following table lists the language code to use when selecting a specific language in the speech-to-text request.