Grey market fast-follow via distillation seems like an inevitable feature of the near to medium future.
I've previously doubted that the N-1 or N-2 open weight models will ever be attractive to end users, especially power users. But it now seems that user preferences will be yet another saturated benchmark, that even the N-2 models will fully satisfy.
Heck, even my own preferences may be getting saturated already. Opus 4.5 was a very legible jump from 4.1. But 4.6? Apparently better, but it hasn't changed my workflows or the types of problems / questions I put to it.
It's poetic - the greatest theft in human history followed by the greatest comeuppance.
No end-user on planet earth will suffer a single qualm at the notion that their bargain-basement Chinese AI provider 'stole' from American big tech.
Lets not miss that MiniMax M2.5 [1] is also available today in their Chat UI [2].
I've got subs for both and whilst GLM is better at coding, I end up using MiniMax a lot more as my general purpose fast workhorse thanks to its speed and excellent tool calling support.
GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.
So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.
It's looking like we'll have Chinese OSS to thank for being able to host our own intelligence, free from the whims of proprietary megacorps.
I know it doesn't make financial sense to self-host given how cheap OSS inference APIs are now, but it's comforting not being beholden to anyone or requiring a persistent internet connection for on-premise intelligence.
Didn't expect to go back to macOS but they're basically the only feasible consumer option for running large models locally.
Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.
It's live on openrouter now.
In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.
To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.
The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.
Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.
The inherent problem with evaluating coding performance of models remains: most day-to-day coding tasks are open-ended/partially-spec'd, and as such there is huge uncertainty on how the "right" solution looks.
It's very hard to rank models' solutions on such problems, which is why they rarely appear in benchmarks (I'd be glad to stand corrected).
Even Opus 4.5 coding a C compiler from scratch - jaw-dropping as it is - doesn't tell the whole story. Most of my tasks are not that well spec'd.
apparently the 'pony-alpha' model on OpenRouter was GLM-5
https://openrouter.ai/openrouter/pony-alpha
z.ai tweet:
I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.
There is a well-known CLI tool for JSON processing called jq. I have just asked GLM-4.7 for the name of jq's built function to convert a string to lowercase. It is called ascii_downcase() according to the manual:
https://jqlang.org/manual/#ascii_downcase-ascii_upcase
However GLM-4.7 insists that is called ascii_down().
I tried to correct it and gave the exact version number, but still, after a long internal monologue, This is its final world:
"In standard jq version 1.7, the function is named ascii_down, not ascii_downcase.
If you are receiving an error that ascii_down is not defined, please verify your version with jq --version. It is possible you are using a different binary (like gojq) or a version older than 1."
GLM-5 gives me the correct answer, ascii_downcase, but I can get this in the Chat Window. Via the API I get HTTP Status 429 - too many requests.
What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].
We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.
I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?
How do you read this?
Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.
Certainly seems to remember things better and is more stable on long running tasks.
If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :
Claude Opus 4.6: 65.5%
GLM-5: 62.6%
GPT-5.2: 60.3%
Gemini 3 Pro: 59.1%
Interesting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now
It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.
So that was pony alpha (1). Now what's Aurora Alpha?
What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].
US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.
[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...
[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...
[3] https://www.reuters.com/world/china/chinas-customs-agents-to...
I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.
I've been using GLM 4.7 with opencode.
It is for sure not as good but the generous limits mean that for a price I can afford I can use it all day and that is game changer for me.
I can't use this model yet as they are slowly rolling it out but I'm excited to try it.
Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing
Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?
There is also a GLM 5-code model.
Wut? Was glm 4.7 not just a few weeks ago?
I wonder if I will be able to use it with my coding plan. Paid just 9 usd for 3 month.
I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.
I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.
GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.
Edit: Input tokens are twice as expensive. That might be a deal breaker.
Let's hope they release it to huggingface soon.
I tried their keyboard switch demo prompt and adapted it to create a 2D Webgl-less version to use CSS, SVG and it seem to work nicely, it thinks for a very long time however. https://chat.z.ai/c/ff035b96-5093-4408-9231-d5ef8dab7261
Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.
744B params is ~1.5TB VRAM (FP16). Even at int4, you need ~372GB just to load the weights (MoE sparsity saves FLOPs, not VRAM capacity). That's not a workstation, that's a rack with 5x H100s or a cluster of 8x RTX 6000 Adas.
The only real use cases here are strict data sovereignty (can't use US APIs) or using it as a teacher for distillation. Otherwise, the ROI on self-hosting is nonexistent.
Also, the disconnect between SOTA on Terminal bench and ~30% on Humanity's Last Exam suggests it overfitted on agent logs rather than learning deep reasoning.
While GLM-5 seems impressive, this release also included lots of new cool stuff!
> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.
A new type of model has joined the series, GLM-5-Coder.
GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.
Looks like they also released their own agentic IDE, https://zcode.z.ai
I don’t know if anyone else knows this but Z.ai also released new tools excluding the Chat! There’s Zread (https://zread.ai), OCR (seems new? https://ocr.z.ai), GLM-Image gen https://image.z.ai and Voice cloning https://audio.z.ai
If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.
Very fascinating stuff!
Bought some API credits and ran it through opencode (model was "GLM 5").
Pretty impressed, it did good work. Good reasoning skills and tool use. Even in "unfamiliar" programming languages: I had it connect to my running MOO and refactor and rewrite some MOO (dynamic typed OO scripting language) verbs by MCP. It made basically no mistakes with the programming language despite it being my own bespoke language & runtime with syntactical and runtime additions of my own (lambdas, new types, for comprehensions, etc). It reasoned everything through by looking at the API surface and example code. No serious mistakes and tested its work and fixed as it went.
Its initial analysis phase found leftover/sloppy work that Codex/GPT 5.3 left behind in a session yesterday.
Cost me $1.50 USD in token credits to do it, but z.AI offers a coding plan which is absolutely worth it if this is the caliber of model they're offering.
I could absolutely see combining the z.AI coding plan with a $20 Codex plan such that you switch back and forth between GPT 5.3 and GLM 5 depending on task complexity or intricacy. GPT 5.3 would only be necessary for really nitty gritty analysis. And since you can use both in opencode, you could start a session by establishing context and analysis in Codex and then having GLM do the grunt work.
Thanks z.AI!
I paid for the $30 plan. It's useful to me via OpenCode as a cheap backend for CLI/Agentic workflows.
I also want to try it with Wiggam Loop to test whether they can together build production-level code if guided via prompts and a PRD. Let's see!
They increased their prices substantially
How do you use GLM-5? Last time I tried GLM models the most basic system engineering tasks were not allowed (like SSH)
Blog post and hugging face link are out.
See related thread: https://news.ycombinator.com/item?id=46977210
Soft launch? I can't find a blog post on their website.
I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet
I predict a new speculative market will emerge where adherents buy and sell misween coded companies.
Betting on whether they can actually perform their sold behaviors.
Passing around code repositories for years without ever trying to run them, factory sealed.
It feels like Anthropic's models from 6 months ago. I mean, it's great progress in the open weight world, but I don't have time to use anything less than the very best for the coding I do. At the same time, if Anthropic and OpenAI disappeared tomorrow, I could survive with GLM-5.
It looks like this requires 1.5TB of VRAM? Did I get that wrong? What would be the least unreasonable way you host this without quantizing it?
Do we know if it as vision? That is lacking from 4.7, you need to use an mcp for it.
Maybe it is just the HN effect, but it is really slow.
Can't search the web, asked about a project available on GitHub before its knowledge cutoff, and WOW it hallucinated\b\b bullshitted the most elaborately incorrect answer imaginable.
Immediately deemed irrelevant to me, personally.
I asked chat.z.ai with GLM 5 "How do I start coding with z.ai?" and got this in the answer...
> Z.ai (Personalized Video)
If you literally meant the website z.ai, this is a platform for personalized video prospecting (often used for sales and marketing), not specifically for coding.
It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.
why don't they publish at ARC-AGI ? too expensive?
benchmark and pricing made me realize how good kimi 2.5 is. im an opus 4.6 person but wow, its almost 5x cheaper.
Why are we not comparing to opus 4.6 and gpt 5.3 codex...
Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.
If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.
Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.
5.0 flash with native sub-agents released to huggingface.... one can wish right :)
I wish China starts copying Demis' biotech models as well soon
Rumour says that this model is exclusively trained on Huawei chips.
I hope Cerebras offers this soon. Working with GLM-4.7 from Cerebras was a major boost compared with other models.
we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated
Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?
- meh, i asked what happened to Virginia Guiffre and it told me that she's alive and well living with her husband and children in australia
- i pointed out that she died on 2025 and then it told me that my question was a prank with a gaslighting tone because that date is 11 months into the future
- it never tried to search the internet for updated knowledge even though the toggle was ON.
- all other AI competitors get this right
afaiu this will also be an open weight release (soon?)
> Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI)
Claiming that LLMs are anywhere near AGI is enough to let me know I shouldn't waste my time looking at the rest of the page or any of their projects.
Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.
Submitted url could be blog post: https://z.ai/blog/glm-5
GLM5 is showing very disappointing general problem solving abilities
[flagged]
Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!
How do you get a domain like z.ai?
The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0
Ask the chat what happened in Tiananmen Square at 1989, immediately the chat gets stuck. Chinese moderation is the worst, evil government
I occasionally see z.ai mentioned and then I remember that I had to block their email since they spammed me with an unsolicited ad. Since then I'm very skeptical of using them.
Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...
Solid bird, not a great bicycle frame.