Companion repo for “Evaluating Verifiability in Generative Search Engines”.
Converting HTML to text
To extract text from HTML pages, we first usedsingle-filez
to download cited webpages and their associated assets (e.g., CSS and images). Then, we use the Chrome DOM Distiller to extract the “readable” portion of the page (this is the view that appears when you use “Reader Mode” in the Chrome browswer). Finally, we used Trafilatura to extract the text from the DOM-distilled HTML.
Liu, N. F., Zhang, T., & Liang, P. (2023). Evaluating verifiability in generative search engines. https://doi.org/10.48550/arXiv.2304.09848 [liu2023evaluating]
Converted to markdown with euangoddard.github.io/clipboard2markdown/.