-
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models
Paper • 2511.18890 • Published • 29 -
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Paper • 2511.22787 • Published • 8
Collections
Discover the best community collections!
Collections including paper arxiv:2503.02495
-
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper • 2411.19930 • Published • 29 -
START: Self-taught Reasoner with Tools
Paper • 2503.04625 • Published • 113 -
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective
Paper • 2503.01933 • Published • 13
-
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 58 -
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Paper • 2504.18415 • Published • 47
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33
-
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models
Paper • 2511.18890 • Published • 29 -
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Paper • 2511.22787 • Published • 8
-
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 58 -
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Paper • 2504.18415 • Published • 47
-
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper • 2411.19930 • Published • 29 -
START: Self-taught Reasoner with Tools
Paper • 2503.04625 • Published • 113 -
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective
Paper • 2503.01933 • Published • 13
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 29 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 50 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 33