You can run it for free here: https://huggingface.co/spaces/ResembleAI/Chatterbox
Chatterbox is fantastic.
I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/
Best voice cloning option available locally by far, in my experience.
> Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...
I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?
Silly question, what’s the lowest spec hardware this will run ?
The emotional exaggeration is interesting, though I don't think I've come across anything quite so versatile and easy to "sculpt" as Elevenlabs and it's ability to generate a voice on the basis of a description of how you want the voice to sound. SparkTTS allows some additional parameters, and it's project on GitHub has placeholders in its code that indicate the model might be refined for more fine grained emotional control. As it is, I've had some success with it and other models by trying to influence prosody and tonality with some heavy handed queues in the text, which can then be used with VC to get closer to desired results, but it's a much more cumbersome process than Eleven.
I've found it excellent with really common accents but with other accents (that are pretty common too) it can easily get stuck picking a different accent. For instance several Scottish recordings ended up Australian, likewise a fairly mild Yorkshire accent
Are these things good enough to narrate a book convincingly or does the voice lose coherence after a few paragraphs being spoken?
Just a regular reminder to tell your friends and family to be extra skeptical about phone conversations.
It’s becoming much more likely that the friend who desperately needs a gift card to Walmart isn’t the friend at all. :(
What is the current state of the art for open source multilingual TTS? I have found Kokoro to be great as English as well, but am still searching for a good solution for French, Japanese, German...
Example implementation with sample inference code + voice cloning example:
https://github.com/basetenlabs/truss-examples/tree/main/chat...
Still working on streaming
I just tested it out locally, really excellent quality, the server was easy to set up and well documented.
I'd love to get to real-time generation if that's in the pipeline? Would like to use it along with Home Assistant.
Interesting demo. A few observations, having uploaded a snippet of my own voice, and testing with some of my own text:
- the output had some of the qualities of my voice, but wasn't super similar. (Then again, the fact it could even do this from such a tiny snippet was impressive)
- increasing "CFG/pace" (whatever CFG is) even a little bit often just breaks down into total gibberish
- it was very inconsistent whether it would come out with a kind of British accent or an American one. (My accent is Australian...)
- the emotional exaggeration was interesting, but it seemed to vary a lot exactly what kind of emotion would come out
They should put the meaning of "TTS" in the readme somewhere, probably near the top. Or their website.
Does anyone know of an open-source TTS like this that can also encode speech to do voice conversion alongside TTS? i.e. a model that would take speech as input and convert it to one of the pretrained TTS voices.
Anyone know how this compares to Kokoro? I've found Kokoro very useful for generating audiobook but it almost always pronounces words with paired vowels incorrectly. Daisy becomes die-zee, leave becomes lay-ve, etc.
> the emotion intensity control is killer. actual param you can tune per line. > and the perth watermarking baked into every output, that’s the part most people are sleeping on. survives mp3, editing, even resampling. no plugin, no postprocess. > also noticed the chatterboxtoolkitui floating in the org, with audiobook mode and batch voice conversion already wired in.
is it a banger??? yes ig so, a full setup ready for indies shipping voicefirst products right now.
It's only for English sadly
Has anyone developed a way to annotate the input to provide emotional context?
In the past I've used different samples from the same speaker for this.
I’d sign up for a service that calls a pharmacy on my behalf to refill prescriptions. In certain situations, pharmacies will not list prescriptions on their websites, even though they have the prescriptions on file, which forces the customer to call by phone — a frustrating process.
I do feel bad for pharmacists, their job is challenging in so many ways.
Anyone know a good free open source speech to text? Looking for something for my laptop which is running Fedora KDE plasma.
How do you set the voice?
On the Huggingface demo, there seems to be no option for it.
It has a female voice. Any way to set it to a male voice?
I love chatterbox, it's my favourite. While the generation speed is quick, i wonder what performance optimization i could try on my 3090 to improve throughput. It's not quite enough for realtime.
The voice cloning is okay, not as good as Eleven Labs. There's a Rick (from Rick and Morty) voice example, and the generated audio sounds muffled and low quality. I appreciate that its open source though.
definitely worse than the new elevenlabs model(v3). that model is really good
in my experience, TTS has been a "pick two" situation:
- fast / cheap to run
- can clone voices
- sounds super realistic
from what I can tell, Chatterbox is the first that apparently lets you pick 3! (have not tried it myself yet, this is just what I can deduce)
Fun stuff... I don't know how or why, but connecting bluetooth while on this site, made all of the audio clips play at once (Firefox, Linux). Not the best listening experience.
I always have issues with TTS models that do not allow you to send large chunks of text. Seems this one does not resolve this either. Always has a limit of like 2-3 sentences.
Here's an open-source serving implementation: https://lightning.ai/bhimrajyadav/studios/build-a-production...
Also, a deployable model: https://lightning.ai/bhimrajyadav/ai-hub/temp_01jwr0adpqf055...
There are only english voices, even in the paid version. Using them in other languages results in an accent.
Looks good! What is the difference between the open-source version and the priced version?
How does one train a TTS model with an LLM backbone? Practically, how does this work?
How would I install this alongside librechat or ollama using docker?
Chatterbox CLI https://pypi.org/project/voice-forge/
How does it perform on multi-lingual tasks?
Watermarking is easily disabled in the code. I a wondering when they will release model weights with embedded watermarking.
There’s been surprisingly little advancement in TTS after a rapid leap forward three years ago or so.
There’s eleven labs which is quite good but not incredible and very expensive.
Everything else ……. all the big AI companies …. have TTS systems that are kinda meh.
Everything else in AI has advanced in leaps and bounds, TTS remains deep in the uncanny valley.
What is the latency?
for this, what does it take to support another language?
wow! 200mms very good!
> Supported Lanugage
> Currenlty only English.
meh
very cherry picked
another TTS that is only supporting English. This really irritates me
Previously, on Hacker News:
https://news.ycombinator.com/item?id=44120204
https://news.ycombinator.com/item?id=44144155
https://news.ycombinator.com/item?id=44195105
https://news.ycombinator.com/item?id=44230867
https://news.ycombinator.com/item?id=44172134
It took me ages to understand what TTS means!
Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)
This is a good release if they're not too cherry picked!
I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.