-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 89 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 36 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 80 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 240
Collections
Discover the best community collections!
Collections including paper arxiv:2508.10104
-
End-to-End Vision Tokenizer Tuning
Paper • 2505.10562 • Published • 22 -
Global and Local Entailment Learning for Natural World Imagery
Paper • 2506.21476 • Published • 1 -
DINOv3
Paper • 2508.10104 • Published • 285 -
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
Paper • 2509.01363 • Published • 58
-
ReZero: Enhancing LLM search ability by trying one-more-time
Paper • 2504.11001 • Published • 16 -
FonTS: Text Rendering with Typography and Style Controls
Paper • 2412.00136 • Published • 1 -
GenEx: Generating an Explorable World
Paper • 2412.09624 • Published • 97 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 158
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 249 -
A Survey of Context Engineering for Large Language Models
Paper • 2507.13334 • Published • 259 -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 429 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 301
-
FAN: Fourier Analysis Networks
Paper • 2410.02675 • Published • 28 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
Scalable-Softmax Is Superior for Attention
Paper • 2501.19399 • Published • 22 -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 8
-
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 74 -
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 80 -
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Paper • 2506.03147 • Published • 58
-
microsoft/bitnet-b1.58-2B-4T
Text Generation • 0.8B • Updated • 7.7k • 1.22k -
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Paper • 2504.10449 • Published • 15 -
nvidia/Llama-3.1-Nemotron-8B-UltraLong-2M-Instruct
Text Generation • 8B • Updated • 140 • 15 -
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper • 2504.11536 • Published • 63
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Paper • 2408.03314 • Published • 63 -
TAG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning
Paper • 2502.15425 • Published • 9 -
EgoLife: Towards Egocentric Life Assistant
Paper • 2503.03803 • Published • 46 -
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper • 2503.01785 • Published • 85
-
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper • 2501.18585 • Published • 61 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 153 -
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Paper • 2503.15265 • Published • 46 -
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper • 2503.15558 • Published • 50
-
What matters when building vision-language models?
Paper • 2405.02246 • Published • 103 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark
Paper • 2405.19707 • Published • 8 -
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations
Paper • 2410.08049 • Published • 8
-
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Paper • 2506.23918 • Published • 89 -
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper • 2504.16030 • Published • 36 -
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper • 2505.24867 • Published • 80 -
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Paper • 2507.01006 • Published • 240
-
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 74 -
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 48 -
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 80 -
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Paper • 2506.03147 • Published • 58
-
End-to-End Vision Tokenizer Tuning
Paper • 2505.10562 • Published • 22 -
Global and Local Entailment Learning for Natural World Imagery
Paper • 2506.21476 • Published • 1 -
DINOv3
Paper • 2508.10104 • Published • 285 -
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
Paper • 2509.01363 • Published • 58
-
microsoft/bitnet-b1.58-2B-4T
Text Generation • 0.8B • Updated • 7.7k • 1.22k -
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Paper • 2504.10449 • Published • 15 -
nvidia/Llama-3.1-Nemotron-8B-UltraLong-2M-Instruct
Text Generation • 8B • Updated • 140 • 15 -
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper • 2504.11536 • Published • 63
-
ReZero: Enhancing LLM search ability by trying one-more-time
Paper • 2504.11001 • Published • 16 -
FonTS: Text Rendering with Typography and Style Controls
Paper • 2412.00136 • Published • 1 -
GenEx: Generating an Explorable World
Paper • 2412.09624 • Published • 97 -
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper • 2412.13663 • Published • 158
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Paper • 2408.03314 • Published • 63 -
TAG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning
Paper • 2502.15425 • Published • 9 -
EgoLife: Towards Egocentric Life Assistant
Paper • 2503.03803 • Published • 46 -
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper • 2503.01785 • Published • 85
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 249 -
A Survey of Context Engineering for Large Language Models
Paper • 2507.13334 • Published • 259 -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 429 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 301
-
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper • 2501.18585 • Published • 61 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 153 -
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Paper • 2503.15265 • Published • 46 -
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper • 2503.15558 • Published • 50
-
FAN: Fourier Analysis Networks
Paper • 2410.02675 • Published • 28 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
Scalable-Softmax Is Superior for Attention
Paper • 2501.19399 • Published • 22 -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 8
-
What matters when building vision-language models?
Paper • 2405.02246 • Published • 103 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 -
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark
Paper • 2405.19707 • Published • 8 -
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations
Paper • 2410.08049 • Published • 8