TADA: Speech generation through text-acoustic synchronization

by smusamashahon 3/11/2026, 5:42 AMwith 27 comments

by microtherionon 3/11/2026, 12:15 PM

To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.

The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.

by mpalmeron 3/11/2026, 12:19 PM

"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

by earthnailon 3/11/2026, 12:07 PM

I don’t understand the approach

> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

So basically just concatenating the audio vectors without compression or discretization?

I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

by ilakshon 3/11/2026, 1:23 PM

okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context?

Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?

I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.

by kavalgon 3/11/2026, 1:25 PM

MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt.

https://huggingface.co/HumeAI/tada-3b-ml

https://github.com/HumeAI/tada

by tcbrahon 3/11/2026, 12:56 PM

the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO

by qinqiang201on 3/11/2026, 8:53 AM

Could it run on Macbook? Just on GPU device?

by OutOfHereon 3/11/2026, 7:41 AM

Will this run on CPU? (as opposed to GPU)