Skip to content
All posts
·8 min read·By Petar

Multilingual Text to Speech API: 33 Languages with One REST Endpoint (2026)

How to use a multilingual text-to-speech API to generate audio in 33 languages with a single REST call. Covers app localization, global content pipelines, language learning, and pricing.


Building a product for a global audience means generating audio in the language your users actually speak. A multilingual TTS API lets you do that with one integration instead of juggling language-specific services.

This guide covers how the Audexum TTS API handles 33 languages, what switching between them looks like in code, and which use cases benefit most from multilingual synthesis.

Why Multilingual TTS Is Hard to Get Right

Most TTS providers support English well and treat everything else as an afterthought. Common problems:

  • Accent bleeding — the model was trained on English and mispronounces non-Latin scripts
  • Missing languages — support listed on the homepage, but only a handful of voices actually work
  • No per-request language switching — you need a separate API key or endpoint per locale
  • Phoneme gaps — languages like Arabic, Japanese, and Hindi require specific grapheme-to-phoneme rules that cheap models skip

A production-ready multilingual API needs native models for each language, not a single model with a language flag.

Audexum's Approach

Audexum uses dedicated voice models per language rather than a single multilingual model. This means pronunciation is accurate for character-based scripts (Japanese, Korean, Arabic) and tonal languages without requiring any phoneme hints from the caller.

The API accepts a language parameter alongside voice_id. If you pass a voice trained on a specific language, you do not need to set language separately — it is inferred. For edge cases where the text mixes scripts, passing an explicit language code ensures correct tokenization.

Supported Languages

Audexum supports 33 languages across 43 voices. The 20 most commonly used:

LanguageBCP-47 CodeVoices AvailableScript
English (American)en-US6Latin
English (British)en-GB4Latin
Spanishes4Latin
Frenchfr3Latin
Germande3Latin
Italianit2Latin
Portuguesept2Latin
Arabicar2Arabic
Hindihi2Devanagari
Japaneseja3CJK
Koreanko2Hangul
Bulgarianbg2Cyrillic
Russianru2Cyrillic
Polishpl1Latin
Dutchnl1Latin
Turkishtr1Latin
Swedishsv1Latin
Romanianro1Latin
Vietnamesevi1Latin
Indonesianid1Latin

The full list of 33 languages is available via the /api/v1/voices endpoint.

Code Examples

List voices by language

python
import requests

API_KEY = "sk_live_abc123xyz"

voices = requests.get(
    "https://audexum.com/api/v1/voices",
    headers={"Authorization": f"Bearer {API_KEY}"},
).json()

# Group by language
by_language = {}
for v in voices:
    by_language.setdefault(v["language"], []).append(v["voice_id"])

for lang, ids in sorted(by_language.items()):
    print(f"{lang:10s}: {', '.join(ids)}")

Synthesize in a specific language

python
import requests

API_KEY = "sk_live_abc123xyz"
API_URL = "https://audexum.com/api/v1/tts"

def synthesize(text: str, voice_id: str, output_file: str):
    response = requests.post(
        API_URL,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={"text": text, "voice_id": voice_id},
    )
    response.raise_for_status()
    with open(output_file, "wb") as f:
        f.write(response.content)

# Japanese
synthesize(
    text="こんにちは、Audexumへようこそ。",
    voice_id="ja_female_01",
    output_file="welcome_ja.wav",
)

# Arabic
synthesize(
    text="مرحبًا بك في Audexum.",
    voice_id="ar_female_01",
    output_file="welcome_ar.wav",
)

# German
synthesize(
    text="Willkommen bei Audexum.",
    voice_id="de_female_01",
    output_file="welcome_de.wav",
)

Batch localization pipeline

This pattern is useful when you maintain a string table and need to generate audio assets for every locale:

python
import requests
from pathlib import Path

API_KEY = "sk_live_abc123xyz"
API_URL = "https://audexum.com/api/v1/tts"

STRINGS = {
    "en_us_female_01": "Your order has been confirmed.",
    "es_female_01":    "Tu pedido ha sido confirmado.",
    "fr_female_01":    "Votre commande a été confirmée.",
    "de_female_01":    "Ihre Bestellung wurde bestätigt.",
    "ja_female_01":    "ご注文が確定しました。",
    "ko_female_01":    "주문이 확정되었습니다.",
    "ar_female_01":    "تم تأكيد طلبك.",
    "hi_female_01":    "आपका ऑर्डर कन्फर्म हो गया है।",
}

output_dir = Path("audio_assets")
output_dir.mkdir(exist_ok=True)

for voice_id, text in STRINGS.items():
    lang_code = voice_id.split("_")[0]
    response = requests.post(
        API_URL,
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"text": text, "voice_id": voice_id},
    )
    if response.ok:
        out = output_dir / f"order_confirmed_{lang_code}.wav"
        out.write_bytes(response.content)
        print(f"Wrote {out}")
    else:
        print(f"Failed {voice_id}: {response.status_code}")

Running this against 8 locales produces 8 WAV files in under 10 seconds and costs roughly 400 characters of your quota.

Use Cases

App Localization

Mobile and web apps increasingly use voice as a UI layer — onboarding narration, error messages read aloud, accessibility mode. Generating audio in a user's locale at build time (rather than at runtime) eliminates latency and works offline.

The batch pattern above fits this case: you maintain a translation file, run the synthesis job as part of your CI pipeline, and ship the audio assets alongside your app bundle.

Global Content Pipelines

Podcasts, YouTube videos, and e-learning courses are expensive to record in multiple languages. TTS-generated audio for translated scripts cuts production cost significantly while keeping quality acceptable for non-flagship content.

A typical workflow: translate the script with a translation API, synthesize with Audexum, mix with the original music bed. The result is publishable content at a fraction of studio recording cost.

Language Learning Apps

Language learning requires clear, accurate pronunciation — exactly what native-model TTS provides. Generating audio on demand (rather than pre-recording a fixed word list) lets you cover arbitrary vocabulary and sentence construction without a recording studio.

Audexum's phoneme accuracy for non-Latin scripts (Arabic, Japanese, Korean, Devanagari) makes it viable for vocabulary drill audio where mispronunciation would actively harm the learner.

Voice Assistants and Chatbots

Bots serving international audiences need to respond in the user's language. Detecting the user's language and selecting the matching voice_id per request takes one extra parameter in the TTS call.

Pricing

All 33 languages are available on every plan — there is no language surcharge.

PlanCharacters/moPriceCost per 1M chars
Free10,000€0
Starter100,000€4€40
Pro500,000€12€24
Scale2,000,000€30€15
PAYGUnlimited€8/1M€8

For comparison, ElevenLabs charges approximately $11/1M chars (multilingual add-on required on some plans) and OpenAI TTS charges $15/1M chars. Audexum's PAYG rate of €8/1M chars undercuts both — and the only one of the three with an ongoing free tier and bundled dictation.

The free tier (10K chars/month, no card required) is enough to test synthesis in every language you need before committing to a paid plan.

Getting Started

  • Sign up at audexum.com/signup — no credit card required for the free tier
  • Copy your sk_live_ key from the dashboard
  • Call /api/v1/voices to get the current voice list and pick your target voice_id values
  • Plug those IDs into the batch script above

The full API reference, including phoneme override syntax for edge cases, is at audexum.com/docs.


By Petar, founder of Audexum. Building multilingual TTS that actually handles non-Latin scripts correctly.

Start for free — 10,000 characters/month, no credit card.