MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Paper
β’
2509.25531
β’
Published
β’
7
'an LLM is only as good as the dataset it was trained on' - Sun Tzu