When we scrutinize our students’ and colleagues’ research work to catch errors, offer clarifications, and suggest other ways to improve their work, we are informally conducting author-assistive peer review. Author-assistive review is almost always a * scoreless*, as scores serve no purpose even for work being prepared for publication review.
Alas, the social norm of offering author-assistive review only to those close to us, and reviewing most everyone else’s work through publication review, exacerbates the disadvantages faced by underrepresented groups and other outsiders.
[ . . . ]
We can address those unintended harms by making ourselves at least as available for scoreless author-assistive peer review as we are for publication review.7
Tags: peer review
Tags: social search request
Tags: end-user-comparison
Anyone got leads on good LLM libraries that can be installed cleanly on Python (on macOS but ideally Linux and Windows too) using “pip install X” from PyPI, without needing a compiler setup?
I’m looking for the quickest and simplest way to call a language model from Python
Tags: social search request
Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs).
Researchers use prompt engineering to improve the capacity of LLMs on a wide range of common and complex tasks such as question answering and arithmetic reasoning. Developers use prompt engineering to design robust and effective prompting techniques that interface with LLMs and other tools.
Prompt engineering is not just about designing and developing prompts. It encompasses a wide range of skills and techniques that are useful for interacting and developing with LLMs. It’s an important skill to interface, build with, and understand capabilities of LLMs. You can use prompt engineering to improve safety of LLMs and build new capabilities like augmenting LLMs with domain knowledge and external tools.
Motivated by the high interest in developing with LLMs, we have created this new prompt engineering guide that contains all the latest papers, learning guides, models, lectures, references, new LLM capabilities, and tools related to prompt engineering.
↩︎Human communication will almost always go astray unless real energy is expended.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. http://arxiv.org/abs/2210.03629
Tags: prompt engineering
github.com/explodinggradients/ragas:
Ragas measures your pipeline’s performance against different dimensions
Faithfulness: measures the information consistency of the generated answer against the given context. If any claims are made in the answer that cannot be deduced from context is penalized.
Context Relevancy: measures how relevant retrieved contexts are to the question. Ideally, the context should only contain information necessary to answer the question. The presence of redundant information in the context is penalized.
Context Recall: measures the recall of the retrieved context using annotated answer as ground truth. Annotated answer is taken as proxy for ground truth context.
Answer Relevancy: refers to the degree to which a response directly addresses and is appropriate for a given question or context. This does not take the factuality of the answer into consideration but rather penalizes the present of redundant information or incomplete answers given a question.
Aspect Critiques: Designed to judge the submission against defined aspects like harmlessness, correctness, etc. You can also define your own aspect and validate the submission against your desired aspect. The output of aspect critiques is always binary.
ragas is mentioned in CSUIRx & in LLM frameworks
HT: Aaron Tay
I looked back at my comments on the OWASP . One concern I had there was:
“inadequate informing” (wc?), where the information generated is accurate but inadequate given the situation-and-user.
It doesn’t seem that these metrics directly engage with that, though aspect critiques could include it. I think this concerns pays more into what the ‘ground truth context’ is and how flexible these pipelines are for wildly different users asking the same strings of questions but hoping for and needing different responses. Perhaps I’m pondering something more like old-fashioned user relevance, which may be much more new and hot with generated responses.
Tags: RAG
Search Smart suggests the best databases for your purpose based on a comprehensive comparison of most of the popular English academic databases. Search Smart tests the critical functionalities databases offer. Thereby, we uncover the capabilities and limitations of search systems that are not reported anywhere else. Search Smart aims to provide the best – i.e., most accurate, up-to-date, and comprehensive – information possible on search systems’ functionalities.
Researchers use Search Smart as a decision tool to select the system/database that fits best.
Librarians use Search Smart for giving search advice and for procurement decisions.
Search providers use Search Smart for benchmarking and improvement of their offerings.
We defined a generic testing procedure that works across a diverse set of academic search systems - all with distinct coverages, functionalities, and features. Thus, while other testing methods would be available, we chose the best common denominator across a heterogenic landscape of databases. This way, we can test a substantially greater number of databases compared to already existing database overviews.
We test the functionalities of specific capabilities search systems have or claim to have. Here we follow a routine that is called “metamorphic testing”. It is a way of testing hard-to-test systems such as artificial intelligence, or databases. A group of researchers titled their 2020 IEEE article “Metamorphic Testing: Testing the Untestable”. Using this logic, we test databases and systems that do not provide access to their systems.
Metamorphic testing is always done from the perspective of the user. It investigates how well a system performs, not at some theoretical level, but in practice - how well can the user search with a system? Do the results add up? What are the limitations of certain functionalities?
Goldenfein, J., & Griffin, D. (2022). Google scholar – platforming the scholarly economy. Internet Policy Review, 11(3), 117. https://doi.org/10.14763/2022.3.1671
Gusenbauer, M., & Haddaway, N. R. (2019). Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of google scholar, pubmed and 26 other resources. Research Synthesis Methods. https://doi.org/10.1002/jrsm.1378
Segura, S., Towey, D., Zhou, Z. Q., & Chen, T. Y. (2020). Metamorphic testing: Testing the untestable. IEEE Software, 37(3), 46–53. https://doi.org/10.1109/MS.2018.2875968
Arawjo, I., Vaithilingam, P., Swoopes, C., Wattenberg, M., & Glassman, E. (2023). ChainForge. https://www.chainforge.ai/.
Guendelman, S., Pleasants, E., Cheshire, C., & Kong, A. (2022). Exploring google searches for out-of-clinic medication abortion in the united states during 2020: Infodemiology approach using multiple samples. JMIR Infodemiology, 2(1), e33184. https://doi.org/10.2196/33184
Lurie, E., & Mulligan, D. K. (2021). Searching for representation: A sociotechnical audit of googling for members of U.S. Congress. https://arxiv.org/abs/2109.07012
Mejova, Y., Gracyk, T., & Robertson, R. (2022). Googling for abortion: Search engine mediation of abortion accessibility in the united states. JQD, 2. https://doi.org/10.51685/jqd.2022.007
Mustafaraj, E., Lurie, E., & Devine, C. (2020). The case for voter-centered audits of search engines during political elections. FAT* ’20.
Noble, S. U. (2018). Algorithms of oppression how search engines reinforce racism. New York University Press. https://nyupress.org/9781479837243/algorithms-of-oppression/
Sundin, O., Lewandowski, D., & Haider, J. (2021). Whose relevance? Web search engines as multisided relevance machines. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24570
Urman, A., & Makhortykh, M. (2022). “Foreign beauties want to meet you”: The sexualization of women in google’s organic and sponsored text search results. New Media & Society, 0(0), 14614448221099536. https://doi.org/10.1177/14614448221099536
Urman, A., Makhortykh, M., & Ulloa, R. (2022). Auditing the representation of migrants in image web search results. Humanit Soc Sci Commun, 9(1), 5. https://doi.org/10.1057/s41599-022-01144-1
Urman, A., Makhortykh, M., Ulloa, R., & Kulshrestha, J. (2022). Where the earth is flat and 9/11 is an inside job: A comparative algorithm audit of conspiratorial information in web search results. Telematics and Informatics, 72, 101860. https://doi.org/10.1016/j.tele.2022.101860
Zade, H., Wack, M., Zhang, Y., Starbird, K., Calo, R., Young, J., & West, J. D. (2022). Auditing google’s search headlines as a potential gateway to misleading content. Jots, 1(4). https://doi.org/10.54501/jots.v1i4.72
Beane, M. (2017). Operating in the shadows: The productive deviance needed to make robotic surgery work [PhD thesis, MIT]. http://hdl.handle.net/1721.1/113956
Note: “Foundation model” is another term for large language model (or LLM).
…as industry-led advances in AI continue to reach new heights, we believe that a vibrant and diverse research ecosystem remains essential to realizing the promise of AI to benefit people and society while mitigating risks. Accelerate Foundation Models Research (AFMR) is a research grant program through which we will make leading foundation models hosted by Microsoft Azure more accessible to the academic research community via Microsoft Azure AI services.
Potential research topics
Align AI systems with human goals and preferences
(e.g., enable robustness, sustainability, transparency, trustfulness, develop evaluation approaches)
- How should we evaluate foundation models?
- How might we mitigate the risks and potential harms of foundation models such as bias, unfairness, manipulation, and misinformation?
- How might we enable continual learning and adaptation, informed by human feedback?
- How might we ensure that the outputs of foundation models are faithful to real-world evidence, experimental findings, and other explicit knowledge?
Advance beneficial applications of AI
(e.g., increase human ingenuity, creativity and productivity, decrease AI digital divide)
- How might we advance the study of the social and environmental impacts of foundation models?
- How might we foster ethical, responsible, and transparent use of foundation models across domains and applications?
- How might we study and address the social and psychological effects of large language models on human behavior, cognition, and emotion?
- How can we develop AI technologies that are inclusive of everyone on the planet?
- How might foundation models be used to enhance the creative process?
Accelerate scientific discovery in the natural and life sciences
(e.g., advanced knowledge discovery, causal understanding, generation of multi-scale multi-modal scientific data)
- How might foundation models accelerate knowledge discovery, hypothesis generation and analysis workflows in natural and life sciences?
- How might foundation models be used to transform scientific data interpretation and experimental data synthesis?
- Which new scientific datasets are needed to train, fine-tune, and evaluate foundation models in natural and life sciences?
- How might foundation models be used to make scientific data more discoverable, interoperable, and reusable?
Hoffmann, A. L. (2021). Terms of inclusion: Data, discourse, violence. New Media & Society, 23(12), 3539–3556. https://doi.org/10.1177/1461444820958725
Tags: CFP
Probably one of the best things I’ve done since ChatGPT/Copilot came out is create a “column” on the right side of my screen for them.
I’ve caught myself having questions that I normally wouldn’t bother Googling but if since the friction is so low, I’ll ask of Copilot.
Tags: found-queries
https://techpolicy.press/choosing-our-words-carefully/
This episode features two segments. In the first, Rebecca Rand speaks with Alina Leidinger, a researcher at the Institute for Logic, Language and Computation at the University of Amsterdam about her research– with coauthor Richard Rogers– into which stereotypes are moderated and under-moderated in search engine autocompletion. In the second segment, Justin Hendrix speaks with Associated Press investigative journalist Garance Burke about a new chapter in the AP Stylebook offering guidance on how to report on artificial intelligence.
HTT: Alina Leidinger (website, Twitter)
The paper in question: Leidinger & Rogers (2023)
abstract:
Warning: This paper contains content that may be offensive or upsetting.
Language technologies that perpetuate stereotypes actively cement social hierarchies. This study enquires into the moderation of stereotypes in autocompletion results by Google, DuckDuckGo and Yahoo! We investigate the moderation of derogatory stereotypes for social groups, examining the content and sentiment of the autocompletions. We thereby demonstrate which categories are highly moderated (i.e., sexual orientation, religious affiliation, political groups and communities or peoples) and which less so (age and gender), both overall and per engine. We found that under-moderated categories contain results with negative sentiment and derogatory stereotypes. We also identify distinctive moderation strategies per engine, with Google and DuckDuckGo moderating greatly and Yahoo! being more permissive. The research has implications for both moderation of stereotypes in commercial autocompletion tools, as well as large language models in NLP, particularly the question of the content deserving of moderation.
Leidinger, A., & Rogers, R. (2023). Which stereotypes are moderated and under-moderated in search engine autocompletion? Proceedings of the 2023 Acm Conference on Fairness, Accountability, and Transparency, 1049–1061. https://doi.org/10.1145/3593013.3594062
Tags: to-look-at, search-autocomplete, artificial intelligence
Tags: local-search
I understand. But when it comes to coding, if it’s not true, it most likely won’t work. And what matters is if it works. Only a bad programmer will accept the answer without testing it. You may need a few rounds of prompting to get to the right answer and often it knows how to correct itself. It will also suggest other more efficient approaches.
Kabir, S., Udo-Imeh, D. N., Kou, B., & Zhang, T. (2023). Who answers it better? An in-depth analysis of chatgpt and stack overflow answers to software engineering questions. http://arxiv.org/abs/2308.02312
Widder, D. G., Nafus, D., Dabbish, L., & Herbsleb, J. D. (2022, June). Limits and possibilities for “ethical AI” in open source: A study of deepfakes. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. https://davidwidder.me/files/widder-ossdeepfakes-facct22.pdf
Prompts are not Lipschitz. There are no “small” changes to prompts. Seemingly minor tweaks can yield shocking jolts in model behavior. Any change in a prompt-based method requires a complete rerun of evaluation, both automatic and human. For now, this is the way.
Hora, A. (2021, May). Googling for software development: What developers search for and what they find. 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). https://doi.org/10.1109/msr52588.2021.00044
Lurie, E., & Mulligan, D. K. (2021). Searching for representation: A sociotechnical audit of googling for members of U.S. Congress. https://arxiv.org/abs/2109.07012
Trielli, D., & Diakopoulos, N. (2018). Defining the role of user input bias in personalized platforms. Paper presented at the Algorithmic Personalization and News (APEN18) workshop at the International AAAI Conference on Web and Social Media (ICWSM). https://www.academia.edu/37432632/Defining_the_Role_of_User_Input_Bias_in_Personalized_Platforms
Tripodi, F. (2018). Searching for alternative facts: Analyzing scriptural inference in conservative news practices. Data & Society. https://datasociety.net/output/searching-for-alternative-facts/
Tags: prompt engineering
Keyword search is dead. Ask full questions in your own words and get the high-relevance results that you actually need.
🔍 Top retrieval, summarization, & grounded generation
😵💫 Eliminates hallucinations
🧑🏽💻 Built for developers
⏩ Set up in 5 mins
vectara.com