Hacker News

by divamguptaon 8/6/2025, 5:04 AMwith 343 comments

Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. We are excited to launch a preview of our smallest model, which is less than 25 MB. This model has 15M parameters.

This release supports English text-to-speech applications in eight voices: four male and four female. The model is quantized to int8 + fp16, and it uses onnx for runtime. The model is designed to run literally anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required!

We're releasing this to give early users a sense of the latency and voices that will be available in our next release (hopefully next week). We'd love your feedback! Just FYI, this model is an early checkpoint trained on less than 10% of our total data.

We started working on this because existing expressive OSS models require big GPUs to run them on-device and the cloud alternatives are too expensive for high frequency use. We think there's a need for frontier open-source models that are tiny enough to run on edge devices!

by mlbosson 8/6/2025, 2:00 AM

Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

by nine_kon 8/6/2025, 1:42 AM

I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.

by peanut_merchanton 8/6/2025, 2:55 PM

I ran some quick benchmarks.

Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX

  Performance Results:

  Initial Latency: ~315ms for short text

  Audio Generation Speed (seconds of audio per second of processing):
  - Short text (12 chars): 3.35x realtime
  - Medium text (100 chars): 5.34x realtime
  - Long text (225 chars): 5.46x realtime
  - Very Long text (306 chars): 5.50x realtime

  Findings:
  - Model loads in ~710ms
  - Generates audio at ~5x realtime speed (excluding initial latency)
  - Performance is consistent across different voices (4.63x - 5.28x realtime)

by blopkeron 8/6/2025, 2:00 AM

Web version: https://clowerweb.github.io/kitten-tts-web-demo/

It sounds ok, but impressive for the size.

by MutedEstate45on 8/6/2025, 3:58 AM

The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.

by antisolon 8/6/2025, 7:15 AM

  System Requirements
  Works literally everywhere

Haha, on one of my machines my python version is too old, and the package/dependencies don't want to install.

On another machie the python version is too new, and the package/dependencies don't want to install.

by klipklopon 8/6/2025, 6:52 AM

I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.

by keyleon 8/6/2025, 2:58 AM

I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

by sandreason 8/6/2025, 3:32 AM

Cool.

While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.

With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.

For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.

1: https://github.com/fishaudio/fish-speech

2: https://github.com/SWivid/F5-TTS

3: https://github.com/woheller69/ttsengine

by wkat4242on 8/6/2025, 2:26 AM

Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

by dr_kiszonkaon 8/6/2025, 8:48 AM

Microsoft's and some of Google's TTS models make the simplest mistakes. For instance, they sometimes read "i.e." as "for example." This is a problem if you have low vision and use TTS for, say, proofreading your emails.

Why does it happen? I'm genuinely curious.

by toisanjion 8/6/2025, 1:49 AM

Wow, amazing and good work, I hope to see more amazing models running on CPUs!

by vahid4mon 8/6/2025, 4:07 AM

amazing! can't wait to integrate it into https://desktop.with.audio I'm already using KokorosTTS without a GPU. It works fairly well on Apple Silicon.

Foundational tools like this open up the possiblity of one-time payment or even free tools.

by ricardobeaton 8/6/2025, 9:33 AM

The samples featured elsewhere seem to be from a larger model?

After testing this locally, it still sounds quite mechanical, and fails catastrophically for simple phrases with numbers ("easy as 1-2-3"). If the 80M model can improve on this and keep the expressiveness seen in the reddit post, that looks promising.

by onair4youon 8/6/2025, 1:59 AM

Okay, lots of details information and example code, great. But skimming through I didn’t see any audio samples to judge the quality?

by dangon 8/6/2025, 5:12 AM

Most of these comments were originally posted to a different thread (https://news.ycombinator.com/item?id=44806543). I've moved them hither because on HN we always prefer to give the project creators credit for their work.

(it does however explain how many of these comments are older than the thread they are now children of)

by spapas82on 8/6/2025, 12:05 PM

This great for english, but is there something similar for other languages? Could this be trained somehow to support other languages?

by rishav_sharanon 8/6/2025, 9:56 AM

Question for the experts here; What would be a SOTA TTS that can run on an average laptop (32GB RAM, 4GB VRAM). I just want to attach a TTS to my SLM output, and get the highest possible voice quality/ human resembleness.

by maxlohon 8/6/2025, 3:53 AM

Hi. Will the training and fine-tuning code also be released?

It would be great if the training data were released too!

by pkayeon 8/6/2025, 2:11 AM

Where does the training data come for the models? Is there an openly available dataset the people use?

by babycommandoon 8/6/2025, 6:12 AM

Someone please port this to ONNX so we don't need to do all this ass tooling

by victorbjorklundon 8/6/2025, 6:15 AM

It is not the best TTS but it is freaking amazing it can be done by such a small model and it is good enough for so many use cases.

by RobKohron 8/6/2025, 2:16 AM

What's a good one in reverse; speech to text?

by akxon 8/6/2025, 2:30 PM

This is a fun model for circuit-bending, because the voice style vectors are pretty small.

For instance, try adding `np.random.shuffle(ref_s[0])` after the line `ref_s = self.voices[voice]`...

EDIT: be careful with your system volume settings if you do this.

by junonon 8/6/2025, 9:25 AM

This feels different. This feels like a genuinely monumental release. Holy cow.

Very well done. The quality is excellent and the technical parameters are, simply, unbelievable. Makes me want to try to embed this on a board just to see if it's possible.

by killerstormon 8/6/2025, 8:30 AM

I'm curious why smallish TTS models have metallic voice quality.

The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?

by gunalxon 8/6/2025, 5:01 PM

Would love to se something like this trained for multilingual purposes. It seems kinda like the same tier as piper, but a bit faster.

by tecleandoron 8/6/2025, 9:47 AM

Not bad for the size (with my very limited knowledge of this field) !

In a couple tests, the "Male 2" voice sounds reasonable, but I've found it has problem with some groups of words, specially when played with little context. I think it's small sentences.

For example, if you try to do just "Hey gang!", it will sound something like "Chay yang". But if you add an additional sentence after that, it will sound a bit different (but still weird).

by C-Loftuson 8/6/2025, 12:36 PM

Awesome work! Often times in the TTS space, human-similarity is given way too much emphasis at the expense of hurting user access. Frankly as long as a voice is clear and you listen to it for a while, the brain filters out most quirks you would perceive on the first pass. Hence why many blind folks still are perfectly fine using espeak-ng. The other properties like speed of generation and size make it worth it.

I've been using a custom AI audiobook generation program [0] with piper for quite a while now and am very excited to look at integrating kitten. Historically piper has been the only good option for a free CPU-only local model so I am super happy to see more competition in the space. Easy installation is a big deal, since piper historically has had issues with that. (Hence why I had to add auto installation support in [0])

[0] https://github.com/C-Loftus/QuickPiperAudiobook

by binary132on 8/6/2025, 11:14 AM

I’m new to TTS models but is this something I can plug into my own engine like with LLMs, or does it require the Python stack it ships with?

by wewewedxfgdfon 8/6/2025, 4:19 AM

Chrome does TTS too.

https://codepen.io/logicalmadboy/pen/RwpqMRV

by zelphirkalton 8/6/2025, 10:15 AM

What I am still looking for is a way to clone voice locally. I have OK hardware. For example I can use Mistral Small 3.1 or what it is called locally. Premade voices can be interesting too, but I am looking for custom voice. Perhaps by providing audio and the corresponding transcript to the model, training it, and then give it a new text and let it speak that.

by the_arunon 8/6/2025, 2:38 PM

I like the direction we are heading. Build models that can run on CPUs & AI can become even more mainstream.

by imprezagx2on 8/6/2025, 11:23 AM

BEAT THIS! Commodore C64 has the same feature called SAM - speaker synthesizer, speaks English and Polish. 48 kB of RAM

BEAT THIS!

by maylion 8/6/2025, 1:46 AM

Is this english only?

by butzon 8/6/2025, 2:43 PM

How does one build similar model, but for different languages? I was under impression that being open source, there would be some instructions how to build everything on your own.

by wewewedxfgdfon 8/6/2025, 2:11 AM

say is only 193K on MacOS

  ls -lah /usr/bin/say
  -rwxr-xr-x  1 root  wheel   193K 15 Nov  2024 /usr/bin/say

Usage:

  M1-Mac-mini ~ % say "hello world this is the kitten TTS model speaking"

by dirkcon 8/6/2025, 12:16 PM

Have you considered adding some 'rendered' examples of what the model sounds like?

I'm curious, but right now I don't want to install the package and run some code.

by MrGilberton 8/6/2025, 11:01 AM

A localized version of this, and I could finally build my tiny Amazon Echo replacement. I would love to see all speech synthesis performed on a local device.

by indigodaddyon 8/6/2025, 5:37 AM

Can coqui run in cpu only?

by mrfakenameon 8/6/2025, 8:16 PM

Cool, it looks like this model is pretty similar to StyleTTS 2? Would it be possible to confirm?

by yunusabdon 8/6/2025, 3:53 PM

Impressive, might use this for https://hnup.date

by pjcodeson 8/6/2025, 8:39 PM

This look pretty awesome. I will definitely give it a try and let you know the results

by mgon 8/6/2025, 5:38 AM

Good TTS feels like it is something that should be natively built into every consumer device. So the user can decide if they want to read or listen to the text at hand.

I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.

I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.

But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.

The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".

It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.

by thedangleron 8/6/2025, 1:26 PM

Elixir folks. How would I use this with Elixir? I'm new to Elixir and could use this in about 15 days.

by righthandon 8/6/2025, 5:24 AM

The sample rate does more than change the quality.

by marcobambinion 8/7/2025, 5:41 AM

Is there any way to get a .gguf version?

by bashkiddieon 8/6/2025, 10:34 AM

TL;DR: If you are interested in TTS, you should explore alternatives

I tried to use it...

Its python venv has grown to 6 GBytes in size. The demo sentence

> "This high quality TTS model works without a GPU"

works, it takes 3s to render the audio. Audio sounds like a voice in a tin can.

I tried to have a news article read aloud and failed with

> [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/bert/Expand' > Status Message: invalid expand shape

If you are interested in TTS, you should explore alternatives

by GaggiXon 8/6/2025, 1:41 AM

https://huggingface.co/KittenML/kitten-tts-nano-0.1

https://github.com/KittenML/KittenTTS

This is the model and Github page, this blog post looks very much AI generated.

by Perz1valon 8/6/2025, 7:58 AM

Is the name a joke on "If the emperor had a tts device"? It's funny

by tapperon 8/6/2025, 7:54 AM

I am blind and use NVDA with a sinth. How is this news? I don't get it! My sinth is called eloquence and is 4089KB

by OrangeMusicon 8/7/2025, 1:03 PM

It's just so annoying and idiotic that there aren't a few samples on the home page. It didn't occur to you that it's the very first thing people would want to hear?

by alexwang123on 8/7/2025, 6:14 AM

This is really great.

by BenGosubon 8/6/2025, 8:48 AM

I wonder what would it take to extend it with a custom voice?

by ghm2180on 8/7/2025, 12:10 PM

Just amazing

by akrymskion 8/7/2025, 2:33 PM

Now if only we could get LLMs to this sort of size! I don't know much about how TTS works under the hood, why is it so much easier?

by mattfrommarson 8/6/2025, 4:21 PM

Can this work on intel npu unit?

by countfengon 8/6/2025, 7:34 AM

Very good model, thanks for the open source

by andaion 8/6/2025, 2:36 AM

Can you run it in reverse for speech recognition?

by anthkon 8/6/2025, 10:23 AM

Atom n270 running flite with a good voice -slt- vs this... would it be fast enough to play a MUD? Flite it's almost realtime fast...

by android521on 8/6/2025, 6:24 AM

it would be great if there is typescript support in the future

by system2on 8/6/2025, 7:17 PM

One thing any GitHub project never has. A few-second demo.

by moomoo11on 8/6/2025, 6:20 PM

Are there any speech to text (opposite direction) that I can load on mobile app?

by 77pt77on 8/6/2025, 5:28 PM

How does this compare to say piper-tts?

I ask because their models are pretty small. Some sound awesome and there is no depdendency hell like I'm seeing here.

Example: https://rhasspy.github.io/piper-samples/#en_US-ryan-high

by alexnewmanon 8/6/2025, 10:19 AM

I'm so confused on how the model is actually made. It doesn't seem to be in the code or this stuff is way simpler than i thought. It seems to use a fancy library from japan, not sure how much it's just that

by yahoozooon 8/6/2025, 10:02 AM

Is there a paper describing the architecture of the model?

by Piratyon 8/7/2025, 1:30 PM

25M ? lol . the venv is 6.9G

by glietuon 8/6/2025, 4:07 AM

Kudos guys!

by jainilprajapation 8/6/2025, 3:53 AM

♥

by m00dyon 8/6/2025, 4:29 PM

I think one of the female voices belongs to Elizabeth Warren.

by khananon 8/6/2025, 6:48 AM

"please join our DISCORD!"...

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model