Hacker News

by steeleduncanon 3/8/2026, 11:04 AMwith 18 comments

by teleforceon 3/9/2026, 11:19 PM

Perhaps any new language targetting GPU acceleration would consider TILE based concept and primitive recently supported by major GPU vendors including Nvidia [1],[2],[3],[4].

For more generic GPU targets there's TRITON [5],[6].

[1] NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile and Performance Gains:

https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-ne...

[2] Nvidia Tilus: A Tile-Level GPU Kernel Programming Language:

https://github.com/NVIDIA/tilus

[3] Simplify GPU Programming with NVIDIA CUDA Tile in Python:

https://developer.nvidia.com/blog/simplify-gpu-programming-w...

[4] Tile Language:

https://github.com/tile-ai/tilelang

[5] Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations:

https://dl.acm.org/doi/10.1145/3315508.3329973

[6] Triton:

https://github.com/triton-lang/triton

by MeteorMarcon 3/8/2026, 1:24 PM

That is fun: it lends c-style block markers (curly braces) and python-style line separation (new lines). No objection.

by shubhamintechon 3/8/2026, 11:56 PM

The latency point matters more than it looks imo like the GPU work isn't just async CPU work at a different speed, the cost model is completely different. In LLM inference, the hard scheduling problem is batching non-uniform requests where prompt lengths and generation lengths vary, and treating that like normal thread scheduling leads to terrible utilization. Would be curious if Eyot has anything to say about non-uniform work units.

by sourcegrifton 3/8/2026, 12:55 PM

Don't mean to be rust fanatic or whatever but anyone know of anything similar for rust?

by LorenDBon 3/8/2026, 12:36 PM

This reminds me that I'd love to see SYCL get more love. Right now, out of the computer hardware manufacturers, it seems that only Intel is putting any effort into it.

by CyberDildonicson 3/8/2026, 4:13 PM

Every time someone does something with threading and makes it a language feature it always seems like it could just be done with stock C++.

Whatever this is doing could be wrapped up in another language.

Either way it's arguable that is even a good idea, since dealing with a regular thread in the same memory space, getting data to and from the GPU and doing computations on the GPU are all completely separate and have different latency characteristics.

Show HN: Eyot, A programming language where the GPU is just another thread