This is a weekend hack that I'd like to further develop as it's working surprisingly well.
Using MCTS, we can explore a space of possible verified programs with an LLM. We check the partial programs at each step, and so steer towards programs that pass the verifier.
https://github.com/namin/llm-verified-with-monte-carlo-tree-...
Anyone who's game able to explain what this actually does?
The main loop looks simple:
def generate_complete(text, montecarlo):
text = llm.generate(text, 1)[0]
score = score_func(text)
if score is not None:
if score < 0:
return None
else:
if can_be_solution(text, min_lines, check_fun):
montecarlo.solution = text
return text
else:
return generate_complete(text, montecarlo)
...but, the heart of it (1) just basically checks if the dafny syntax is valid by posting to https://dafny.livecode.ch/checkHow is 'syntax is valid' a valid scoring mechanism here?
If you look at the output examples (2), all I can see if generating functions and lemmas; this is equivalent to generating a function a bunch of tests for it.
I'm not sure I see what value the MCTS is bringing here.
Anyone get this and care to explain?
[1] - https://github.com/namin/llm-verified-with-monte-carlo-tree-...
[2] - https://github.com/namin/llm-verified-with-monte-carlo-tree-...
I love that the code is so short and understandable. Could be turned into a Pearl.
Could this be combined with something like llama.cpp's constraint-based grammar (https://github.com/ggerganov/llama.cpp/blob/master/grammars/...) to always enforce syntactically correct code output?
Here is a dumb question: how does it distinguish between a partial solution that is just missing some characters/lines and nonsense?
Nice! I had a similar idea to use MCTS to explore various AST generations through an incremental AST parser[0], so I could generate runnable code via LLM at a higher success rate than what we were seeing initially with them. I never got around to it, but this makes me curious to circle back around.
[0] There are few parsers that consider "failed, but could run if the next symbol is valid" and "failed and no way to continue" as separate statuses so you can make progress by enumerating (hopefully intelligently so) the space of valid states. Surprisingly TreeSitter wasn't one of them though, for that it's just "fail".