“Converting HTML to text” in Liu et al. (2023)

Liu et al. (2023)

Companion repo for “Evaluating Verifiability in Generative Search Engines”.
Converting HTML to text

To extract text from HTML pages, we first used single-filez to download cited webpages and their associated assets (e.g., CSS and images). Then, we use the Chrome DOM Distiller to extract the “readable” portion of the page (this is the view that appears when you use “Reader Mode” in the Chrome browswer). Finally, we used Trafilatura to extract the text from the DOM-distilled HTML.

Converted to markdown with euangoddard.github.io/clipboard2markdown/.

References

Liu, N. F., Zhang, T., & Liang, P. (2023). Evaluating verifiability in generative search engines. https://doi.org/10.48550/arXiv.2304.09848 [liu2023evaluating]