"And what matters is if it works."
This is a comment about @kabir2023answers, following a theme in my research. \@NektariosAI is replying to \@GaryMarcus saying: "the study still confirms something I (and others) have been saying: people mistake the grammaticality etc of LLMs for truth."
@NektariosAI via Twitter on Aug 10, 2023
I understand. But when it comes to coding, if it's not true, it most likely won't work. And what matters is if it works. Only a bad programmer will accept the answer without testing it. You may need a few rounds of prompting to get to the right answer and often it knows how to correct itself. It will also suggest other more efficient approaches.
I discuss evaluation of general-purpose search results in my dissertation, largely in Ch. 4. Extending searching—which as a whole gets at a key position I have on evaluating the performance of LLMs: we have to extend our observations and analysis well-beyond the raw outputs of these systems. In that chapter I discuss "Running workable code", "Decoupling performance from search", and note that not all qualities desirable in code are so easily testable as simply 'running it' (but the processes and practices of the organization is the final evaluator, for my professional data engineer participants at least):
But workable in the moment isn’t enough. There is other information that isn’t immediately testable by running it through a compiler or interpreter and seeing if it “works”. This is particularly the case for non-functional properties (normally including things like security, reliability, and scalability but could also include considerations of harm both in the implementation and use of the code (Widder et al., 2022, p. 8)).