Multilingual Text to Speech API: 33 Languages with One REST Endpoint (2026)
How to use a multilingual text-to-speech API to generate audio in 33 languages with a single REST call. Covers app localization, global content pipelines, language learning, and pricing.
Building a product for a global audience means generating audio in the language your users actually speak. A multilingual TTS API lets you do that with one integration instead of juggling language-specific services.
This guide covers how the Audexum TTS API handles 33 languages, what switching between them looks like in code, and which use cases benefit most from multilingual synthesis.
Why Multilingual TTS Is Hard to Get Right
Most TTS providers support English well and treat everything else as an afterthought. Common problems:
- Accent bleeding — the model was trained on English and mispronounces non-Latin scripts
- Missing languages — support listed on the homepage, but only a handful of voices actually work
- No per-request language switching — you need a separate API key or endpoint per locale
- Phoneme gaps — languages like Arabic, Japanese, and Hindi require specific grapheme-to-phoneme rules that cheap models skip
A production-ready multilingual API needs native models for each language, not a single model with a language flag.
Audexum's Approach
Audexum uses dedicated voice models per language rather than a single multilingual model. This means pronunciation is accurate for character-based scripts (Japanese, Korean, Arabic) and tonal languages without requiring any phoneme hints from the caller.
The API accepts a language parameter alongside voice_id. If you pass a voice trained on a specific language, you do not need to set language separately — it is inferred. For edge cases where the text mixes scripts, passing an explicit language code ensures correct tokenization.
Supported Languages
Audexum supports 33 languages across 43 voices. The 20 most commonly used:
| Language | BCP-47 Code | Voices Available | Script |
|---|---|---|---|
| English (American) | en-US | 6 | Latin |
| English (British) | en-GB | 4 | Latin |
| Spanish | es | 4 | Latin |
| French | fr | 3 | Latin |
| German | de | 3 | Latin |
| Italian | it | 2 | Latin |
| Portuguese | pt | 2 | Latin |
| Arabic | ar | 2 | Arabic |
| Hindi | hi | 2 | Devanagari |
| Japanese | ja | 3 | CJK |
| Korean | ko | 2 | Hangul |
| Bulgarian | bg | 2 | Cyrillic |
| Russian | ru | 2 | Cyrillic |
| Polish | pl | 1 | Latin |
| Dutch | nl | 1 | Latin |
| Turkish | tr | 1 | Latin |
| Swedish | sv | 1 | Latin |
| Romanian | ro | 1 | Latin |
| Vietnamese | vi | 1 | Latin |
| Indonesian | id | 1 | Latin |
The full list of 33 languages is available via the /api/v1/voices endpoint.
Code Examples
List voices by language
import requests
API_KEY = "sk_live_abc123xyz"
voices = requests.get(
"https://audexum.com/api/v1/voices",
headers={"Authorization": f"Bearer {API_KEY}"},
).json()
# Group by language
by_language = {}
for v in voices:
by_language.setdefault(v["language"], []).append(v["voice_id"])
for lang, ids in sorted(by_language.items()):
print(f"{lang:10s}: {', '.join(ids)}")Synthesize in a specific language
import requests
API_KEY = "sk_live_abc123xyz"
API_URL = "https://audexum.com/api/v1/tts"
def synthesize(text: str, voice_id: str, output_file: str):
response = requests.post(
API_URL,
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={"text": text, "voice_id": voice_id},
)
response.raise_for_status()
with open(output_file, "wb") as f:
f.write(response.content)
# Japanese
synthesize(
text="こんにちは、Audexumへようこそ。",
voice_id="ja_female_01",
output_file="welcome_ja.wav",
)
# Arabic
synthesize(
text="مرحبًا بك في Audexum.",
voice_id="ar_female_01",
output_file="welcome_ar.wav",
)
# German
synthesize(
text="Willkommen bei Audexum.",
voice_id="de_female_01",
output_file="welcome_de.wav",
)Batch localization pipeline
This pattern is useful when you maintain a string table and need to generate audio assets for every locale:
import requests
from pathlib import Path
API_KEY = "sk_live_abc123xyz"
API_URL = "https://audexum.com/api/v1/tts"
STRINGS = {
"en_us_female_01": "Your order has been confirmed.",
"es_female_01": "Tu pedido ha sido confirmado.",
"fr_female_01": "Votre commande a été confirmée.",
"de_female_01": "Ihre Bestellung wurde bestätigt.",
"ja_female_01": "ご注文が確定しました。",
"ko_female_01": "주문이 확정되었습니다.",
"ar_female_01": "تم تأكيد طلبك.",
"hi_female_01": "आपका ऑर्डर कन्फर्म हो गया है।",
}
output_dir = Path("audio_assets")
output_dir.mkdir(exist_ok=True)
for voice_id, text in STRINGS.items():
lang_code = voice_id.split("_")[0]
response = requests.post(
API_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
json={"text": text, "voice_id": voice_id},
)
if response.ok:
out = output_dir / f"order_confirmed_{lang_code}.wav"
out.write_bytes(response.content)
print(f"Wrote {out}")
else:
print(f"Failed {voice_id}: {response.status_code}")Running this against 8 locales produces 8 WAV files in under 10 seconds and costs roughly 400 characters of your quota.
Use Cases
App Localization
Mobile and web apps increasingly use voice as a UI layer — onboarding narration, error messages read aloud, accessibility mode. Generating audio in a user's locale at build time (rather than at runtime) eliminates latency and works offline.
The batch pattern above fits this case: you maintain a translation file, run the synthesis job as part of your CI pipeline, and ship the audio assets alongside your app bundle.
Global Content Pipelines
Podcasts, YouTube videos, and e-learning courses are expensive to record in multiple languages. TTS-generated audio for translated scripts cuts production cost significantly while keeping quality acceptable for non-flagship content.
A typical workflow: translate the script with a translation API, synthesize with Audexum, mix with the original music bed. The result is publishable content at a fraction of studio recording cost.
Language Learning Apps
Language learning requires clear, accurate pronunciation — exactly what native-model TTS provides. Generating audio on demand (rather than pre-recording a fixed word list) lets you cover arbitrary vocabulary and sentence construction without a recording studio.
Audexum's phoneme accuracy for non-Latin scripts (Arabic, Japanese, Korean, Devanagari) makes it viable for vocabulary drill audio where mispronunciation would actively harm the learner.
Voice Assistants and Chatbots
Bots serving international audiences need to respond in the user's language. Detecting the user's language and selecting the matching voice_id per request takes one extra parameter in the TTS call.
Pricing
All 33 languages are available on every plan — there is no language surcharge.
| Plan | Characters/mo | Price | Cost per 1M chars |
|---|---|---|---|
| Free | 10,000 | €0 | — |
| Starter | 100,000 | €4 | €40 |
| Pro | 500,000 | €12 | €24 |
| Scale | 2,000,000 | €30 | €15 |
| PAYG | Unlimited | €8/1M | €8 |
For comparison, ElevenLabs charges approximately $11/1M chars (multilingual add-on required on some plans) and OpenAI TTS charges $15/1M chars. Audexum's PAYG rate of €8/1M chars undercuts both — and the only one of the three with an ongoing free tier and bundled dictation.
The free tier (10K chars/month, no card required) is enough to test synthesis in every language you need before committing to a paid plan.
Getting Started
- Sign up at audexum.com/signup — no credit card required for the free tier
- Copy your
sk_live_key from the dashboard - Call
/api/v1/voicesto get the current voice list and pick your targetvoice_idvalues - Plug those IDs into the batch script above
The full API reference, including phoneme override syntax for edge cases, is at audexum.com/docs.
- Related: Text to speech API Python tutorial
- Related: Cheapest text-to-speech API in 2026
- Related: TTS API for Discord bots
By Petar, founder of Audexum. Building multilingual TTS that actually handles non-Latin scripts correctly.