2026-05-27·8 min read·By Petar

Multilingual Text to Speech API: 33 Languages with One REST Endpoint (2026)

How to use a multilingual text-to-speech API to generate audio in 33 languages with a single REST call. Covers app localization, global content pipelines, language learning, and pricing.

Building a product for a global audience means generating audio in the language your users actually speak. A multilingual TTS API lets you do that with one integration instead of juggling language-specific services.

This guide covers how the Audexum TTS API handles 33 languages, what switching between them looks like in code, and which use cases benefit most from multilingual synthesis.

Why Multilingual TTS Is Hard to Get Right

Most TTS providers support English well and treat everything else as an afterthought. Common problems:

Accent bleeding — the model was trained on English and mispronounces non-Latin scripts
Missing languages — support listed on the homepage, but only a handful of voices actually work
No per-request language switching — you need a separate API key or endpoint per locale
Phoneme gaps — languages like Arabic, Japanese, and Hindi require specific grapheme-to-phoneme rules that cheap models skip

A production-ready multilingual API needs native models for each language, not a single model with a language flag.

Audexum's Approach

Audexum uses dedicated voice models per language rather than a single multilingual model. This means pronunciation is accurate for character-based scripts (Japanese, Korean, Arabic) and tonal languages without requiring any phoneme hints from the caller.

The API accepts a language parameter alongside voice_id. If you pass a voice trained on a specific language, you do not need to set language separately — it is inferred. For edge cases where the text mixes scripts, passing an explicit language code ensures correct tokenization.

Supported Languages

Audexum supports 33 languages across 43 voices. The 20 most commonly used:

Language	BCP-47 Code	Voices Available	Script
English (American)	en-US	6	Latin
English (British)	en-GB	4	Latin
Spanish	es	4	Latin
French	fr	3	Latin
German	de	3	Latin
Italian	it	2	Latin
Portuguese	pt	2	Latin
Arabic	ar	2	Arabic
Hindi	hi	2	Devanagari
Japanese	ja	3	CJK
Korean	ko	2	Hangul
Bulgarian	bg	2	Cyrillic
Russian	ru	2	Cyrillic
Polish	pl	1	Latin
Dutch	nl	1	Latin
Turkish	tr	1	Latin
Swedish	sv	1	Latin
Romanian	ro	1	Latin
Vietnamese	vi	1	Latin
Indonesian	id	1	Latin

The full list of 33 languages is available via the /api/v1/voices endpoint.

Code Examples

List voices by language

python

import requests

API_KEY = "sk_live_abc123xyz"

voices = requests.get(
    "https://audexum.com/api/v1/voices",
    headers={"Authorization": f"Bearer {API_KEY}"},
).json()

# Group by language
by_language = {}
for v in voices:
    by_language.setdefault(v["language"], []).append(v["voice_id"])

for lang, ids in sorted(by_language.items()):
    print(f"{lang:10s}: {', '.join(ids)}")

Synthesize in a specific language

python

import requests

API_KEY = "sk_live_abc123xyz"
API_URL = "https://audexum.com/api/v1/tts"

def synthesize(text: str, voice_id: str, output_file: str):
    response = requests.post(
        API_URL,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={"text": text, "voice_id": voice_id},
    )
    response.raise_for_status()
    with open(output_file, "wb") as f:
        f.write(response.content)

# Japanese
synthesize(
    text="こんにちは、Audexumへようこそ。",
    voice_id="ja_female_01",
    output_file="welcome_ja.wav",
)

# Arabic
synthesize(
    text="مرحبًا بك في Audexum.",
    voice_id="ar_female_01",
    output_file="welcome_ar.wav",
)

# German
synthesize(
    text="Willkommen bei Audexum.",
    voice_id="de_female_01",
    output_file="welcome_de.wav",
)

Batch localization pipeline

This pattern is useful when you maintain a string table and need to generate audio assets for every locale:

python

import requests
from pathlib import Path

API_KEY = "sk_live_abc123xyz"
API_URL = "https://audexum.com/api/v1/tts"

STRINGS = {
    "en_us_female_01": "Your order has been confirmed.",
    "es_female_01":    "Tu pedido ha sido confirmado.",
    "fr_female_01":    "Votre commande a été confirmée.",
    "de_female_01":    "Ihre Bestellung wurde bestätigt.",
    "ja_female_01":    "ご注文が確定しました。",
    "ko_female_01":    "주문이 확정되었습니다.",
    "ar_female_01":    "تم تأكيد طلبك.",
    "hi_female_01":    "आपका ऑर्डर कन्फर्म हो गया है।",
}

output_dir = Path("audio_assets")
output_dir.mkdir(exist_ok=True)

for voice_id, text in STRINGS.items():
    lang_code = voice_id.split("_")[0]
    response = requests.post(
        API_URL,
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"text": text, "voice_id": voice_id},
    )
    if response.ok:
        out = output_dir / f"order_confirmed_{lang_code}.wav"
        out.write_bytes(response.content)
        print(f"Wrote {out}")
    else:
        print(f"Failed {voice_id}: {response.status_code}")

Running this against 8 locales produces 8 WAV files in under 10 seconds and costs roughly 400 characters of your quota.

Use Cases

App Localization

Mobile and web apps increasingly use voice as a UI layer — onboarding narration, error messages read aloud, accessibility mode. Generating audio in a user's locale at build time (rather than at runtime) eliminates latency and works offline.

The batch pattern above fits this case: you maintain a translation file, run the synthesis job as part of your CI pipeline, and ship the audio assets alongside your app bundle.

Global Content Pipelines

Podcasts, YouTube videos, and e-learning courses are expensive to record in multiple languages. TTS-generated audio for translated scripts cuts production cost significantly while keeping quality acceptable for non-flagship content.

A typical workflow: translate the script with a translation API, synthesize with Audexum, mix with the original music bed. The result is publishable content at a fraction of studio recording cost.

Language Learning Apps

Language learning requires clear, accurate pronunciation — exactly what native-model TTS provides. Generating audio on demand (rather than pre-recording a fixed word list) lets you cover arbitrary vocabulary and sentence construction without a recording studio.

Audexum's phoneme accuracy for non-Latin scripts (Arabic, Japanese, Korean, Devanagari) makes it viable for vocabulary drill audio where mispronunciation would actively harm the learner.

Voice Assistants and Chatbots

Bots serving international audiences need to respond in the user's language. Detecting the user's language and selecting the matching voice_id per request takes one extra parameter in the TTS call.

Pricing

All 33 languages are available on every plan — there is no language surcharge.

Plan	Credits/mo	Price	Effective per 1M
Free	30,000	€0	—
Starter	250,000	€4	€16
Pro	1,200,000	€12	€10
Scale	4,000,000	€30	€7.50
PAYG	Unlimited	€20/1M	€20

For comparison, ElevenLabs charges $11+/1M chars and OpenAI TTS charges $15/1M. Audexum plan subscribers get effective rates from €7.50/1M (Scale) to €16/1M (Starter), with a PAYG list rate of €20/1M for one-off purchases — and Audexum is the only provider with an ongoing free tier and unified dictation credits.

The free tier (30K credits/month, no card required) is enough to test synthesis in every language you need before committing to a paid plan.

Getting Started

Sign up at audexum.com/signup — no credit card required for the free tier
Copy your sk_live_ key from the dashboard
Call /api/v1/voices to get the current voice list and pick your target voice_id values
Plug those IDs into the batch script above

The full API reference, including phoneme override syntax for edge cases, is at audexum.com/docs.

Related: Text to speech API Python tutorial
Related: Cheapest text-to-speech API in 2026
Related: TTS API for Discord bots

By Petar, founder of Audexum. Building multilingual TTS that actually handles non-Latin scripts correctly.