-
S*: Test Time Scaling for Code Generation
Paper • 2502.14382 • Published • 63 -
o1-Coder: an o1 Replication for Coding
Paper • 2412.00154 • Published • 44 -
Competitive Programming with Large Reasoning Models
Paper • 2502.06807 • Published • 68 -
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Paper • 2502.18449 • Published • 75
Jaehyun Jun
btjhjeon
AI & ML interests
Multimodal
Recent Activity
updated
a collection
3 days ago
Multimodal Benchmarks
updated
a collection
3 days ago
Multimodal Reasoning
updated
a collection
3 days ago
Multimodal Reasoning
Organizations
Multimodal Agent
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 29 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
Multimodal Reasoning
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
Multimodal Alignment
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Paper • 2410.17637 • Published • 36 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 86 -
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper • 2411.18203 • Published • 41 -
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper • 2411.14432 • Published • 25
Multimodal LLM
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
LLM context length
Multimodal Benchmarks
-
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 -
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 -
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 -
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62
Code Agent
-
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Paper • 2502.05664 • Published • 24 -
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Paper • 2312.13010 • Published • 6 -
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Paper • 2409.16299 • Published • 12 -
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Paper • 2505.19443 • Published • 15
Multimodal System
-
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Paper • 2503.13964 • Published • 20 -
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
Paper • 2510.06710 • Published • 38 -
VIDEOP2R: Video Understanding from Perception to Reasoning
Paper • 2511.11113 • Published • 111
Multimodal Analysis
-
Analyzing The Language of Visual Tokens
Paper • 2411.05001 • Published • 24 -
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Paper • 2411.14982 • Published • 19 -
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
Paper • 2411.17686 • Published • 20 -
On the Limitations of Vision-Language Models in Understanding Image Transforms
Paper • 2503.09837 • Published • 10
PEFT
LLM
Multimodal Dataset
-
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper • 2407.09413 • Published • 11 -
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33 -
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper • 2409.01437 • Published • 71 -
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper • 2409.05840 • Published • 49
Code Reasoning
-
S*: Test Time Scaling for Code Generation
Paper • 2502.14382 • Published • 63 -
o1-Coder: an o1 Replication for Coding
Paper • 2412.00154 • Published • 44 -
Competitive Programming with Large Reasoning Models
Paper • 2502.06807 • Published • 68 -
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Paper • 2502.18449 • Published • 75
Code Agent
-
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Paper • 2502.05664 • Published • 24 -
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Paper • 2312.13010 • Published • 6 -
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Paper • 2409.16299 • Published • 12 -
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Paper • 2505.19443 • Published • 15
Multimodal Agent
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 29 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
Multimodal System
-
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Paper • 2503.13964 • Published • 20 -
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
Paper • 2510.06710 • Published • 38 -
VIDEOP2R: Video Understanding from Perception to Reasoning
Paper • 2511.11113 • Published • 111
Multimodal Reasoning
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
Multimodal Analysis
-
Analyzing The Language of Visual Tokens
Paper • 2411.05001 • Published • 24 -
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Paper • 2411.14982 • Published • 19 -
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
Paper • 2411.17686 • Published • 20 -
On the Limitations of Vision-Language Models in Understanding Image Transforms
Paper • 2503.09837 • Published • 10
Multimodal Alignment
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Paper • 2410.17637 • Published • 36 -
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 86 -
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper • 2411.18203 • Published • 41 -
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper • 2411.14432 • Published • 25
PEFT
Multimodal LLM
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
LLM
LLM context length
Multimodal Dataset
-
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper • 2407.09413 • Published • 11 -
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33 -
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper • 2409.01437 • Published • 71 -
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper • 2409.05840 • Published • 49
Multimodal Benchmarks
-
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 -
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 -
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 -
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62