Datasets - a kaizuberbuehler Collection

kaizuberbuehler 's Collections

Reasoning, Thinking, RL and Test-Time Scaling

Vision Language Models

Foundation Models

Synthetic Data and Self-Improvement

LM Prompt Engineering

LM Capabilities and Scaling

LM Architectures

Code Generation

EXL2 Quantized Models

Datasets

updated Sep 26, 2025

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Paper • 2404.01197 • Published Apr 1, 2024 • 31
CosmicMan: A Text-to-Image Foundation Model for Humans

Paper • 2404.01294 • Published Apr 1, 2024 • 17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Paper • 2406.08707 • Published Jun 13, 2024 • 17
DataComp-LM: In search of the next generation of training sets for language models

Paper • 2406.11794 • Published Jun 17, 2024 • 55
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Paper • 2406.08973 • Published Jun 13, 2024 • 89
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 32
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12, 2024 • 26
argilla/magpie-ultra-v0.1

Viewer • Updated Nov 26, 2024 • 50k • 320 • 221
HuggingFaceFW/fineweb

Viewer • Updated Jul 11, 2025 • 52.5B • 163k • 2.68k
wikimedia/wikipedia

Viewer • Updated Jan 9, 2024 • 61.6M • 70.9k • 1.14k
HuggingFaceTB/cosmopedia

Viewer • Updated Aug 12, 2024 • 31.1M • 12.7k • 674
bigcode/the-stack

Viewer • Updated Apr 13, 2023 • 546M • 20.8k • 960
teknium/OpenHermes-2.5

Viewer • Updated Apr 15, 2024 • 1M • 7.52k • 797
roneneldan/TinyStories

Viewer • Updated Aug 12, 2024 • 2.14M • 75.5k • 906
Vezora/Open-Critic-GPT

Viewer • Updated Jul 28, 2024 • 55.1k • 72 • 96
HuggingFaceFW/fineweb-edu

Viewer • Updated Jul 11, 2025 • 3.5B • 223k • 965
arcee-ai/The-Tome

Viewer • Updated Aug 15, 2024 • 1.75M • 125 • 104
mlabonne/FineTome-100k

Viewer • Updated Jul 29, 2024 • 100k • 6.25k • 262
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Paper • 2409.12568 • Published Sep 19, 2024 • 50
RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 57
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Paper • 2411.07461 • Published Nov 12, 2024 • 23
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Paper • 2411.04905 • Published Nov 7, 2024 • 127
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Paper • 2501.04686 • Published Jan 8, 2025 • 53
open-r1/OpenR1-Math-220k

Viewer • Updated Feb 18, 2025 • 450k • 10.7k • 714
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Paper • 2501.18511 • Published Jan 30, 2025 • 20
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

Paper • 2502.04235 • Published Feb 6, 2025 • 23
Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Paper • 2502.06589 • Published Feb 10, 2025 • 21
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

Paper • 2502.09082 • Published Feb 13, 2025 • 30
EgoLife: Towards Egocentric Life Assistant

Paper • 2503.03803 • Published Mar 5, 2025 • 46
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Paper • 2503.02951 • Published Mar 4, 2025 • 33
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Paper • 2503.10582 • Published Mar 13, 2025 • 24
ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback

Paper • 2503.21332 • Published Mar 27, 2025 • 23
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Paper • 2504.01943 • Published Apr 2, 2025 • 15
MegaMath: Pushing the Limits of Open Math Corpora

Paper • 2504.02807 • Published Apr 3, 2025 • 35
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Paper • 2504.13161 • Published Apr 17, 2025 • 93
DataDecide: How to Predict Best Pretraining Data with Small Experiments

Paper • 2504.11393 • Published Apr 15, 2025 • 18
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

Paper • 2504.14655 • Published Apr 20, 2025 • 21