TL;DR: ChatGPT-4 seems able to now adequately address this specific challenge. Though this doesn’t tell us much. (Also, a reminder that your custom instructions may limit your performance.)


    Melanie Mitchell proposed this challenge to claims re GPT-4 “understanding” on Twitter in May:

    Here we have a toothpick, a bowl of pudding, a full glass of water, and a marshmallow. Please tell me how to stack them onto each other in a stable manner.

    I played with dozens of prompts to nudge towards a solution (see a portion of the conversations here (link to a tweet with a screenshot)) but, as she noted, “the physical intuitions are….well, not robust”. I proposed this classroom exercise and largely forgot about it:

    @danielsgriffin via Twitter on May 16, 2023

    Classroom exercise: Develop, justify, then test 10 prompts to attempt adding to this question to reliably get a solution: “Here we have a toothpick, a bowl of pudding, a full glass of water, and a marshmallow. Please tell me how to stack them onto each other in a stable manner.”

    I was intrigued mostly with how different prompts, like search queries, can produce vastly different responses. (I wasn’t focused on proving understanding or lack thereof.)

    Today I tried the raw challenge prompt again (after failing with a custom instruction, below) and ChatGPT-4 produced much better—seemingly adequate—results (link to conversation with OpenAI’s ChatGPT).

    Note: Both Google’s Bard (link to a tweet with a screenshot; I only looked at the top draft) (note that Google’s Bard now provides a sharing interface for generative search responses, see the interaction here) and Anthropic’s Claude 2 (link to a tweet with a screenshot) struggled in my brief test of the same prompt.

    Aside on custom instructions: I picked it back up because I wanted to try “custom instructions”. I haven’t looked much at the new custom instructions in OpenAI’s ChatGPT—introduced in late July— (something similar, the ability to edit an AI Profile, was released in Perplexity AI in early June; Phind also has an “Answer Profile”) but a recent tweet from Jeremy Howard pushed me to try this old problem again. He’d recently shared his custom instructions and then later showed how ChatGPT’s performance was much better better for few questions than the performance recently elicited and critiqued in a preprint.

    The custom instructions from Jeremy Howard that I tried:

    You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF. You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. If you think there might not be a correct answer, you say so.

    Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.

    Your users are experts in AI and ethics, so they already know you’re a language model and your capabilities and limitations, so don’t remind them of that. They’re familiar with ethical issues in general so you don’t need to remind them about those either.

    Don’t be verbose in your answers, but do provide details and examples where it might help the explanation. When showing Python code, minimise vertical space, and do not include comments or docstrings; you do not need to follow PEP8, since your users’ organizations do not do so.

    The performance on the challenge with the custom instructions (link to conversation with OpenAI’s ChatGPT) was not much better than I’d previously found. I tried the raw challenge with the prompt only one time. (Note that OpenAI provides some disclosure about the use of custom instructions: “This conversation may reflect the link creator’s Custom Instructions, which aren’t shared and can meaningfully change how the model responds.”)

    Your custom instructions, just like your prompts, can sometimes limit the effective performance of your use of a generative AI tool.

    (I tried Google’s Bard w/ the custom instructions, manually inserted and the results were not good (I only looked at the top draft): “…Poke the toothpick through the bottom of the bowl of pudding and the top of the glass of water…”)

    Aside on novelty

    I recognize that perhaps it is no longer a novel problem for the system. I tried dozens of prompts back in May in the ChatGPT interface, including critiquing responses. Curiously, I found reference to the same problem—and credited to Mitchell—in an essay from Douglas Hofstadter: “Is there an “I” in AI?" (This is the only reference to the essay I could find? Hosted at Berryville Institute of Machine Learning.)

    In response to an early draft of this essay, Melanie did something very clever. She gave the latest incarnation of ChatGPT, which uses GPT-4, this prompt: “Here we have a toothpick, a bowl of pudding, a full glass of water, and a marshmallow. Please tell me how to stack them onto each other in a stable manner.” The system’s response was deeply revelatory and absolutely hilarious. ChatGPT-4 said that the bowl of pudding should be used as the base, then the toothpick should be stuck (vertically) into the pudding. (It added the proviso that the pudding should be thick.) Then it said that you should “balance” (its word) the marshmallow on top of the toothpick. (It allowed the possibility that to do so, you might “make a small hole in the bottom of the marshmallow” for the toothpick to fit into.) Finally, it said that you should “carefully balance” the full glass of water on top of the marshmallow (which itself was balanced on top of the toothpick stuck into the pudding). An unlikely story, to say the least.

    (I should also note the distinction between comparing ChatGPT’s use of GPT-4 and the GPT-4 used in the paper Mitchell is critiquing.)

    Addendum 2023-08-10 13:29:30 -0700

    As I went to share this post on Twitter (shared below) I saw that Gary Marcus tweeted at length in response to Howard’s tweet, discussing “what is wrong with the culture of AI today”. I’m not particularly interested in his larger discussion as I am not looking for evidence of “understanding” or “reasoning”. I’m interested in how people actually use these tools. So I only want to comment on two bits of his response.

    1. “The fact that RLHF and possibly other undisclosed components of GPT-4 appear to be regularly updated is not discussed.” This is what I was after in my Aside on novelty, below. Perhaps this should be discussed more often.

    2. “We need science, not anecdotal data.” This reminded me of a frustration I’ve had with conversations about how to evaluate the performance of these systems and their usefulness for different downstream tasks. Again, I’m not focused on some of the abstractions that Marcus and others are. I want to know if users are hurting themselves or others as they imagine, talk about, and use the tools. I want to know how we can know that. I want to know how we can shift the design and use of the tools to encourage particular downstream effects. Someone considering picking up or prohibiting a particular tool will have very different considerations based on their specific situation. We need situated observations of tool use, not just system transparency or systematic evaluations.

    This tweet from Marcus, one other tweet from Dan Roy, and the moment of sharing itself, prompted further reflections and shaped how I prefaced this post:

    @danielsgriffin via Twitter on Aug 10, 2023

    As I share this I’m reflecting on how search results & results-of-search are different.

    There is a difference between atomic text (even “systematic”) & how various people might take it up.

    Even laughably wrong results may still aid a user. Let’s see?