"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.
I don’t understand the approach
> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.
So basically just concatenating the audio vectors without compression or discretization?
I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.
okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context?
Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?
I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.
MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt.
the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO
Could it run on Macbook? Just on GPU device?
Will this run on CPU? (as opposed to GPU)
To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.
The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.