Craw4LLM: Efficient Web Crawling for LLM Pretraining Paper • 2502.13347 • Published Feb 19, 2025 • 30
Harnessing Webpage UIs for Text-Rich Visual Understanding Paper • 2410.13824 • Published Oct 17, 2024 • 30
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published May 13, 2024 • 21