kaizuberbuehler 's Collections Datasets
updated
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
• 2404.01197
• Published
• 31
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper
• 2404.01294
• Published
• 17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
• 2406.08707
• Published
• 17
DataComp-LM: In search of the next generation of training sets for
language models
Paper
• 2406.11794
• Published
• 55
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
• 2406.08973
• Published
• 89
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
• 2406.08418
• Published
• 32
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
• 2406.08451
• Published
• 26
argilla/magpie-ultra-v0.1
Viewer
• Updated
• 50k • 320
• 221
Viewer
• Updated
• 52.5B • 163k
• 2.68k
Viewer
• Updated
• 61.6M • 70.9k
• 1.14k
Viewer
• Updated
• 31.1M • 12.7k
• 674
Viewer
• Updated
• 546M • 20.8k
• 960
Viewer
• Updated
• 1M • 7.52k
• 797
Viewer
• Updated
• 2.14M • 75.5k
• 906
Viewer
• Updated
• 55.1k • 72
• 96
HuggingFaceFW/fineweb-edu
Viewer
• Updated
• 3.5B • 223k
• 965
Viewer
• Updated
• 1.75M • 125
• 104
Viewer
• Updated
• 100k • 6.25k
• 262
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced
Mathematical Reasoning
Paper
• 2409.12568
• Published
• 50
RedPajama: an Open Dataset for Training Large Language Models
Paper
• 2411.12372
• Published
• 57
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
• 2411.07461
• Published
• 23
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Paper
• 2411.04905
• Published
• 127
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
• 2501.04686
• Published
• 53
Viewer
• Updated
• 450k • 10.7k
• 714
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training
Paper
• 2501.18511
• Published
• 20
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus
Expansion
Paper
• 2502.04235
• Published
• 23
Hephaestus: Improving Fundamental Agent Capabilities of Large Language
Models through Continual Pre-Training
Paper
• 2502.06589
• Published
• 21
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Paper
• 2502.09082
• Published
• 30
EgoLife: Towards Egocentric Life Assistant
Paper
• 2503.03803
• Published
• 46
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for
Coding
Paper
• 2503.02951
• Published
• 33
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web
Search
Paper
• 2503.10582
• Published
• 24
ReFeed: Multi-dimensional Summarization Refinement with Reflective
Reasoning on Feedback
Paper
• 2503.21332
• Published
• 23
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Paper
• 2504.01943
• Published
• 15
MegaMath: Pushing the Limits of Open Math Corpora
Paper
• 2504.02807
• Published
• 35
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for
Language Model Pre-training
Paper
• 2504.13161
• Published
• 93
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper
• 2504.11393
• Published
• 18
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient
Training of Code LLMs
Paper
• 2504.14655
• Published
• 21