How We Optimize LLM Inference for AI Coding Assistant

by kneson 11/25/2024, 7:27 PMwith 5 comments

by mmoskalon 12/1/2024, 4:01 PM

> The academic literature has recently caught on to this approach as well and we want to highlight the excellent papers on Sarathi (link 1, link 2) and DeepSpeed-FastGen (link). The academic literature calls the technique “chunked prefill”.

The Sarathi paper is more than a year old and implemented in both vllm and trtllm. There doesn't seem to be much else of substance in the article. Edit: to be fair it seems the article was written some time ago and only "updated" two weeks ago. So maybe it was more novel at the time.

by Taylor_ODon 12/1/2024, 7:00 PM

This is great. Increased context does seem like the biggest current issue with copilot. It will take a lot of work to understand an entire code base but if we could get to understanding 2-5 files and how they interact... Coding assistants would be a lot more useful.

by mellosoulson 12/1/2024, 7:45 PM

This product looks interesting but very expensive ($60 pm) compared to class leaders like Claude.

Given that I think those alternatives are extremely cheap, that's not necessarily a problem - but it would be useful to understand more about how the results compare when the service is used in the real world, and what exactly is being provided - it seems to be saying this is a direct competitor to other LLMs enhanced with contextual magic, ie it doesn't sit on top of a third party LLM endpoint like, say, Cursor does.

But if it is 3 times the price of services like Cursor or Claude or ChatGPT etc, it really needs to give evidence for that value in its outreach efforts, rather than just saying "this is how context makes things better".

by joaquincabezason 12/1/2024, 6:48 PM

balancing compute-bound (prefill) and memory-bound (decode) is a fine art. Luckily there are lots of improvements (incentives) if you can adjust it to your use case (this time is Coding assistants), but it is generally a lonely journey. Good to see you paired with Colfax International.

by ucefkhon 12/1/2024, 6:00 PM

Well a bigger context is important for understanding the global codebase and knowing the next code or decision.

Pretty good job