Prompts are not Lipschitz. There are no “small” changes to prompts. Seemingly minor tweaks can yield shocking jolts in model behavior. Any change in a prompt-based method requires a complete rerun of evaluation, both automatic and human. For now, this is the way.
Hora, A. (2021, May). Googling for software development: What developers search for and what they find. 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). https://doi.org/10.1109/msr52588.2021.00044 [hora2021googling]
Lurie, E., & Mulligan, D. K. (2021). Searching for representation: A sociotechnical audit of googling for members of U.S. Congress. https://arxiv.org/abs/2109.07012 [lurie2021searching_facctrec]
Trielli, D., & Diakopoulos, N. (2018). Defining the role of user input bias in personalized platforms. Paper presented at the Algorithmic Personalization and News (APEN18) workshop at the International AAAI Conference on Web and Social Media (ICWSM). https://www.academia.edu/37432632/Defining_the_Role_of_User_Input_Bias_in_Personalized_Platforms [trielli2018defining]
Tripodi, F. (2018). Searching for alternative facts: Analyzing scriptural inference in conservative news practices. Data & Society. https://datasociety.net/output/searching-for-alternative-facts/ [tripodi2018searching]