[{"data":1,"prerenderedAt":443},["ShallowReactive",2],{"article-on-prem-speech-synthesis":3},{"id":4,"title":5,"alt":6,"author":7,"body":11,"description":427,"extension":428,"field":429,"img":430,"imgCredit":431,"meta":434,"navigation":435,"path":436,"published":437,"seo":438,"stem":439,"topics":440,"updated":441,"__hash__":442},"articles/articles/on-prem-speech-synthesis.md","Modern on-prem speech synthesis","A woman wearing a headset in a colourful space",{"name":8,"title":9,"website":10},"Giorgio Sidari","Knowledge & SW Engineer","https://www.sidari.it",{"type":12,"value":13,"toc":405},"minimark",[14,19,33,74,77,81,86,91,94,98,101,108,112,119,123,127,135,158,162,165,177,180,184,187,191,198,201,204,243,246,253,256,259,263,267,270,324,328,334,339,353,373,376,381,385],[15,16,18],"h2",{"id":17},"your-words-stay-in-house","Your words stay in-house",[20,21,22,23,32],"p",{},"Most businesses evaluating ",[24,25,31],"mark",{"dataAos":26,"dataAosDelay":27,"className":28},"highlight-text","250",[29,30],"aos-init","aos-animate","TTS"," (Text-to-speech, or Voice Synthesis) share most of these requirements:",[34,35,36,44,50,56,62,68],"ul",{},[37,38,39,43],"li",{},[40,41,42],"strong",{},"On-premises",": text never leaves the machine, no API calls, no third-party data exposure",[37,45,46,49],{},[40,47,48],{},"Quality",": natural enough to actually listen to, not the robotic cadence of classic TTS",[37,51,52,55],{},[40,53,54],{},"Multilingual",": able to speak the native language of employees",[37,57,58,61],{},[40,59,60],{},"Streaming /  batch",": realtime speech for interactive use, or file output for longer content",[37,63,64,67],{},[40,65,66],{},"Permissive license",": usable commercially without negotiation",[37,69,70,73],{},[40,71,72],{},"Forecastable cost",": infrastructure spend is fixed and predictable, with no exposure to vendor pricing changes",[20,75,76],{},"This post is a report of how well TTS technologies I reviewed and work with deal with these requirements.",[15,78,80],{"id":79},"adoption-drivers","Adoption drivers",[82,83,85],"h3",{"id":84},"privacy-by-architecture","Privacy by architecture",[20,87,90],{"className":88},[89],"lead","\nWith a self-hosted TTS service, the text being synthesized never leaves your infrastructure.\n",[20,92,93],{},"No API call to OpenAI, Google, or ElevenLabs. No terms-of-service clause about training data. No audit trail at a third party. For businesses handling sensitive documents, customer communications, legal content, or regulated data, this removes an entire category of risk.",[82,95,97],{"id":96},"cost-that-scales-with-hardware-not-usage","Cost that scales with hardware, not usage",[20,99,100],{},"Cloud TTS is priced per character. At volume (automated reports, customer notifications, content libraries, call center prompts) the bill compounds fast.",[20,102,103,107],{},[24,104,106],{"dataAos":26,"dataAosDelay":27,"className":105},[29,30],"Self-hosted TTS has a fixed infrastructure cost."," Once the GPU is running, synthesizing one sentence or ten thousand costs the same.",[82,109,111],{"id":110},"reliability-and-control","Reliability and control",[20,113,114,115],{},"No rate limits, no upstream outages, no deprecation notices. ",[24,116,118],{"dataAos":26,"dataAosDelay":27,"className":117},[29,30],"The model you deploy is the model you keep.",[15,120,122],{"id":121},"what-this-unlocks","What this unlocks",[82,124,126],{"id":125},"a-consistent-brand-voice-across-all-languages-and-markets","A consistent brand voice across all languages and markets",[20,128,129,130,134],{},"In many modern TTS models, voice character and language are decoupled. ",[24,131,133],{"dataAos":26,"dataAosDelay":27,"className":132},[29,30],"The same narrator reads your content in French, Italian, English, or Japanese",", with consistent timbre throughout. No per-market voice casting, no coordination across regional teams.",[136,137,138,139,138,147],"figure",{},"\n  ",[140,141],"img",{"src":142,"alt":143,"className":144},"/assets/img/blog/multilingual-speech.jpg","Multilingual speech synthesis",[145,146],"img-fluid","rounded",[148,149,150,151,138],"figcaption",{},"\n    Photo:\n    ",[152,153,157],"a",{"href":154,"rel":155},"https://unsplash.com/@tonybear2",[156],"noopener","\n      zhendong wang\n    ",[82,159,161],{"id":160},"tone-on-demand","Tone on demand",[20,163,164],{},"Voices accept a plain-language instruction alongside the text to read.",[166,167,169,170,173,174,176],"blockquote",{"className":168},[166],"\n\"Speak slowly and warmly, as if explaining to someone unfamiliar with the topic.\"",[171,172],"br",{},"\n\"Fast-paced and confident, like a professional anchor.\"",[171,175],{},"\n\"Calm and reassuring, with deliberate pauses.\"\n",[20,178,179],{},"Calm for customer support, authoritative for compliance notices, energetic for product announcements. No re-recording, no voice actor availability to manage.",[82,181,183],{"id":182},"suitable-for-regulated-and-air-gapped-environments","Suitable for regulated and air-gapped environments",[20,185,186],{},"Once model weights are downloaded, the system has no external dependencies. Suitable for finance, healthcare, defense, or any context where data cannot leave a controlled environment.",[15,188,190],{"id":189},"common-questions","Common questions",[166,192,194,195,197],{"className":193},[166],"\n\"How many channels can run in parallel?\"",[171,196],{},"\n\"What are the harware requirements?\"\n",[20,199,200],{},"Exact throughput of concurrent streams on a single GPU depends on the model, available VRAM, and average text length.",[20,202,203],{},"What shapes the number:",[34,205,206,225,231,237],{},[37,207,208,211,212,218,219,224],{},[40,209,210],{},"Model size:"," the lighter ",[152,213,217],{"href":214,"rel":215},"https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",[216],"nofollow","Qwen3-TTS"," (1.7B parameters) fits more concurrent requests in the same VRAM budget than the heavier ",[152,220,223],{"href":221,"rel":222},"https://huggingface.co/mistralai/Voxtral-4B-TTS-2603",[216],"Voxtral"," (4B).",[37,226,227,230],{},[40,228,229],{},"Batching:"," the serving layer uses continuous batching, requests share GPU time rather than queuing one at a time.",[37,232,233,236],{},[40,234,235],{},"Text length:"," short utterances (notifications, prompts) allow more parallelism than long-form documents.",[37,238,239,242],{},[40,240,241],{},"Streaming vs. batch:"," streaming holds a GPU slot for the duration of synthesis; batch mode frees it immediately.",[20,244,245],{},"For a call center or notification system generating short utterances, a single high-end GPU can handle tens of concurrent streams. For a narration pipeline producing long audio files, throughput is lower but latency per job stays predictable.",[136,247,138,248],{},[140,249],{"src":250,"alt":251,"className":252},"/assets/img/blog/voxtral-vllm.png","Voxtral running on vLLM-Omni",[145,146],[20,254,255],{},"On a mid-range GPU, one voice channel plays in realtime. Running several in parallel is not yet viable on this hardware (RTX 4060 Ti 16GB). The utilization data suggests a modest hardware upgrade would support 16 or more simultaneous channels.",[20,257,258],{},"Qwen3-TTS is lighter than Voxtral and can presumably run in a higher number of parallel instances. A transient limitation to the vLLM-Omni libray doesn't allow to test streaming Qwen at the moment.",[15,260,262],{"id":261},"comparison","Comparison",[82,264,266],{"id":265},"tts-space","TTS space",[20,268,269],{},"Several TTS systems were considered before focusing on Qwen and Voxtral:",[34,271,272,282,292,302,318],{},[37,273,274,281],{},[40,275,276],{},[152,277,280],{"href":278,"rel":279},"https://github.com/OpenBMB/VoxCPM/",[216],"VoxCPM",": Multilingual (30+ languages), fast and under Apache 2.0 licence. Voice is defined through a prompt. Demos are very promising. Will put to the test soon.",[37,283,284,291],{},[40,285,286],{},[152,287,290],{"href":288,"rel":289},"https://github.com/OHF-Voice/piper1-gpl",[216],"Piper",": Fast, CPU-only, many voices across many languages. Quality is noticeably lower, closer to classic TTS. Good for resource-constrained deployments (edge, IoT, embedded). Its license is GPL-3.0, that is unusable in most business contexts because it forces to opensource derivative work.",[37,293,294,301],{},[40,295,296],{},[152,297,300],{"href":298,"rel":299},"https://github.com/myshell-ai/MeloTTS",[216],"MeloTTS",": Multilingual (EN, ZH, JP, KR, ES, FR), fast on CPU, MIT license. Decent quality but no instruction-following and limited voice variety.",[37,303,304,317],{},[40,305,306,311,312],{},[152,307,310],{"href":308,"rel":309},"https://github.com/HumeAI/tada",[216],"TADA"," and ",[152,313,316],{"href":314,"rel":315},"https://github.com/ysharma3501/LuxTTS",[216],"LuxTTS",": Both voice-cloning systems that require a reference audio file to define the speaker.",[37,319,320,323],{},[40,321,322],{},"Cloud TTS (OpenAI, ElevenLabs, Google, Azure)",": Highest quality ceiling, lowest setup cost, but all the privacy and billing trade-offs described above.",[82,325,327],{"id":326},"license-qwen3-tts-vs-voxtral","License: Qwen3-TTS vs. Voxtral",[20,329,330],{},[24,331,333],{"dataAos":26,"dataAosDelay":27,"className":332},[29,30],"Qwen3-TTS is the commercially viable choice. Voxtral is ok for internal use, with caution.",[20,335,336,338],{},[40,337,217],{},": Apache-2.0 throughout (code and weights). No restrictions on commercial or internal use beyond standard attribution.",[20,340,341,347,348,352],{},[40,342,343],{},[152,344,223],{"href":345,"rel":346},"https://mistral.ai/news/voxtral",[216],": CC BY-NC 4.0 on the model weights, inherited from voice reference training data (EARS, CML-TTS, IndicVoices-R, Arabic Natural Audio datasets). ",[349,350,351],"em",{},"NC = NonCommercial."," Three practical scenarios:",[34,354,355,361,367],{},[37,356,357,360],{},[40,358,359],{},"Pure internal use"," (internal tools, employee-facing apps, R&D with no connection to a product sold to customers): generally permitted under CC BY-NC 4.0.",[37,362,363,366],{},[40,364,365],{},"Internal use tied to a commercial activity"," (e.g. generating voice content that feeds into a product, even if not directly sold): ambiguous, a commercial license from Mistral AI is recommended.",[37,368,369,372],{},[40,370,371],{},"External product or service:"," requires a commercial license from Mistral AI.",[20,374,375],{},"For unrestricted commercial deployment, Qwen3-TTS (Apache-2.0) is the simpler choice. For strictly internal, non-commercial use, Voxtral is also available.",[20,377,378],{},[349,379,380],{},"Confirm with your legal team for your specific scenario. Source: Mistral AI (Le Chat, April 2026).",[15,382,384],{"id":383},"available-voices","Available Voices",[34,386,387,396],{},[37,388,389,395],{},[40,390,391],{},[152,392,394],{"href":221,"rel":393},[216],"Voxtral-4B-TTS-2603",": 20 preset voices across 9 languages (English, French, German, Spanish, Italian, Portuguese, Dutch, Arabic, Hindi). Streaming (chunked PCM) supported.",[37,397,398,404],{},[40,399,400],{},[152,401,403],{"href":214,"rel":402},[216],"Qwen3-TTS-12Hz-1.7B-CustomVoice",": 9 preset voices, synthesizes any language regardless of the speaker's native language. Strong emotional range and instruction-following.",{"title":406,"searchDepth":407,"depth":407,"links":408},"",2,[409,410,416,421,422,426],{"id":17,"depth":407,"text":18},{"id":79,"depth":407,"text":80,"children":411},[412,414,415],{"id":84,"depth":413,"text":85},3,{"id":96,"depth":413,"text":97},{"id":110,"depth":413,"text":111},{"id":121,"depth":407,"text":122,"children":417},[418,419,420],{"id":125,"depth":413,"text":126},{"id":160,"depth":413,"text":161},{"id":182,"depth":413,"text":183},{"id":189,"depth":407,"text":190},{"id":261,"depth":407,"text":262,"children":423},[424,425],{"id":265,"depth":413,"text":266},{"id":326,"depth":413,"text":327},{"id":383,"depth":407,"text":384},"Self-hosted, multilingual text-to-speech options to avoid data leaks and unpredictable cloud bills.","md","Voice technology","/assets/img/blog/synthetic-voices.jpg",{"name":432,"url":433},"Prateek Gautam","https://unsplash.com/@pgauti",{},true,"/articles/on-prem-speech-synthesis","2026-04-16",{"title":5,"description":427},"articles/on-prem-speech-synthesis","Speech synthesis, Privacy, Infrastructure, AI, TTS",null,"ozS3ISyT16EDYdgUTXvDocG6MOq3wjXad6U2odp7TWc",1776344173549]