“Converting HTML to text” in Liu et al. (2023)

    December 4th, 2023
    Liu et al. (2023)

    Companion repo for “Evaluating Verifiability in Generative Search Engines”.

    Converting HTML to text


    To extract text from HTML pages, we first used single-filez to download cited webpages and their associated assets (e.g., CSS and images). Then, we use the Chrome DOM Distiller to extract the “readable” portion of the page (this is the view that appears when you use “Reader Mode” in the Chrome browswer). Finally, we used Trafilatura to extract the text from the DOM-distilled HTML.

    Converted to markdown with euangoddard.github.io/clipboard2markdown/.

    References

    Liu, N. F., Zhang, T., & Liang, P. (2023). Evaluating verifiability in generative search engines. https://doi.org/10.48550/arXiv.2304.09848 [liu2023evaluating]