Taming LLMs: Using Executable Oracles to Prevent Bad Code

(john.regehr.org)

18 points | by mad44 3 hours ago ago

4 comments

dktoao an hour ago
"Our goal should be to give an LLM coding agent zero degrees of freedom"
Wouldn't that just be called inventing a new language with all the overhead of the languages we already have? Are we getting to the point where getting LLMs to be productive and also write good code is going to require so much overhead and additional procedures and tools that we might as well write the code ourselves. Hmmm...
[-]
- virgilp 19 minutes ago
  Actually, no. We always needed good checks - that's why you have techniques like automated canary analysis, extensive testing, checking for coverage - these are forms of "executable oracles". If you wanted to be able to do continuous deployment - you had to be very thorough in your validation.
  LLMs just take this to the extreme. You can no longer rely on human code reviews (well you can but you give away all the LLM advantages) so then if you take out "human judgement" *from validation*[1], you have to resort to very sophisticated automated validation. This is it - it's not about "inventing a new language", it's about being much more thorough (and innovative, and efficient) in the validation process.
  [1] never from design, or specification - you shouldn't outsource that to AI, I don't think we're close to an AI that can do that even moderately effective without human help.
- seanw444 25 minutes ago
  Yeah, precision LLM coding is kind of an oxymoron. English language -> codebase is essentially lossily-compressed logic by definition. The less lossy the compression becomes, the more you probably approach re-inventing programming languages. Which then means that in order to use LLMs to code, you're accepting some degree of imprecision.
voxaai 34 minutes ago
ran into this with creative generation. for code, formal constraints work great. but when the quality criteria cant be typed (feels right for this audience, sounds like infrastructure not a toy) constraints made things worse. what worked was competing generators with different objectives, then rank against the brief. the variance from competition was more useful than the precision from constraints.