view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks Aug 4, 2024 • 30
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing Jul 19, 2024 • 20
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM Apr 26, 2024 • 17
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data Apr 18, 2024 • 23
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs Mar 20, 2024 • 29