Daily Model Scout Report — 2026-04-28
Daily Model Scout Report — 2026-04-28
Window scanned: 2026-04-21 → 2026-04-28 (last 7 days), image-text-to-text filter on the HF model index plus targeted searches across Qwen, InternVL, Florence, PaliGemma, SmolVLM/Idefics, LLaVA, Phi-Vision, Molmo/Moondream, MiniCPM-V, DeepSeek-VL, GLM-4V, Step-VL, Youtu-VL, Cosmos-Reason, DINOv3, fashion/garment-tagged repos, and trending-by-likes sweeps.
Current production baseline (3,500-sample hard eval, weighted score)
| Rank | Model | Weighted Score |
|---|---|---|
| 1 | qwen3-vl-8b-sft+grpo |
0.9131 |
| 2 | qwen3-vl-2b-sft-grpo-v9 |
0.8948 |
| 3 | qwen3-vl-8b-sft-grpo-nvfp4 |
0.8945 |
| 4 | qwen3-vl-8b-instruct-base |
0.8751 |
| 5 | qwen35-2b-base |
0.8437 |
The Qwen3-VL family dominates. SFT+GRPO adds ~+0.038 over the Qwen3-VL-8B-Instruct base.
High relevance — benchmark immediately
1. Qwen/Qwen3.6-27B (and Qwen/Qwen3.6-27B-FP8)
- Released: 2026-04-21 — top of HF trending (951 / 158 likes, 1.25M combined downloads in 7 days)
- Architecture: Dense 27B VLM, 64 layers mixed Gated DeltaNet + Gated Attention, native 262K context, integrated vision encoder
- License: Apache 2.0 (matches our current usage)
- Reported VLM benchmarks: MMMU 82.9, MMMU-Pro 75.8, RealWorldQA 84.1, OCRBench 89.4, CC-OCR 81.2, RefCOCO 92.5, CountBench 97.8
- Why this could beat our 0.9131 baseline:
- Same family as our best model (Qwen3-VL-8B), so our SFT + GRPO + GTPO recipe should drop in unchanged.
- 3.4× the active parameters of our current best — likely meaningful headroom on the fields we still struggle on (the Qwen3-VL-8B Instruct base is already 0.8751; a much larger same-family base should lift the ceiling).
- OCRBench 89.4 is directly relevant to our weakest field — brand recognition (~70% on Qwen3-VL-8B SFT+GRPO at the 100-sample eval) is largely a logo/text-on-garment OCR problem.
- Hardware fit: ~54 GB BF16 / ~27 GB FP8 — fits comfortably on the RTX PRO 6000 98 GB. FP8 variant is published officially; near-identical metrics per the card.
- Caveat: Inference will be 2–3× slower than 8B. Worth pairing FP8 / NVFP4 quant with the SFT+GRPO recipe from day one.
- Action: Run full SFT → eval-on-3.5k pipeline against
Qwen3.6-27B-FP8as a base. Target: beat 0.9131 weighted overall. - Links: Qwen3.6-27B · Qwen3.6-27B-FP8
Medium relevance — worth watching / probe before committing
2. addpty/Youtu-VL-4B-Instruct
- Released: 2026-04-28 (today, likes/dl still 0 — very fresh)
- Architecture: Tencent Youtu — novel VLUAS (Vision-Language Unified Autoregressive Supervision) paradigm; 4B params built on Youtu-LLM
- Why interesting: The card explicitly lists image classification and fine-grained visual tasks as the design target, with a learned visual codebook giving vision tokens equal autoregressive status. A 4B model purpose-built for classification is precisely the size/task slot between our 2B and 8B variants.
- Concern: Custom
youtu-vllicense — not Apache. Needs a legal/licensing review before any production path. - Action: Read the technical report; if license is workable for Denali-AI, run a zero-shot 100-sample probe before committing to a full SFT run.
- Link: Youtu-VL-4B-Instruct
3. keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF
- Released: 2026-04-27 (GGUF wrapper of
atbender/Qwen3.6-VL-REAP-26B-A3B, 2026-04-19) - Architecture: REAP-pruned Qwen3.6-VL MoE — 27B total / ~3B active per token, 25% of routed experts removed (256 → 192 per layer); vision encoder + all 40 MoE layers preserved.
- Why interesting: Active-param footprint is the same as our 2B sweet spot, but with the larger Qwen3.6-VL knowledge base behind it. Could deliver 27B-class quality at ~3B-class latency.
- Concerns: (a) GGUF format doesn't fit our HF/vLLM training pipeline — we'd want the BF16 / W4A16 base from
atbenderinstead. (b) Quality regression from pruning is unmeasured on classification-style tasks. - Action: If/when we decide to invest in Qwen3.6-VL, A/B the REAP-pruned base vs the dense 27B on a 100-sample probe before allocating SFT compute.
- Link: Qwen3.6-VL-REAP-26B-A3B
Low relevance — note and move on
vrfai/Cosmos-Reason2-2B-NVFP4(2026-04-24) — Community NVFP4 quant of NVIDIA's Cosmos-Reason2 2B. Cosmos-Reason is tuned for embodied/spatial physical reasoning — not a strong fit for garment attribute JSON extraction. Skip.zenless-lab/vit_*_dinov3.lvd1689m(2026-04-27) — DINOv3 ViT ports in multiple sizes (small/base/large/huge+). Pure vision encoders, no language head — not directly useful as a VLM base, but could be a candidate vision-tower swap experiment for a future custom architecture.- Numerous Qwen3.6-27B quants (unsloth GGUF, AWQ-INT4, MLX 4/8-bit, NVFP4, GPTQ-Pro, AutoRound, PrismaQuant, etc.) — all derivatives of the Qwen3.6-27B base above. Worth pulling once we pick a target deployment quant for the 27B; no novel architectural value on their own.
- No new releases this week for: InternVL4, Florence-3, PaliGemma3, SmolVLM3, Idefics4, LLaVA-NeXT, Phi-5-Vision, Molmo2, Moondream3, GLM-4.5V, Hunyuan-VL, Skywork-VL, Apple FastVLM successors, Pixtral updates.
Recommendation
One concrete next experiment: SFT + GRPO Qwen/Qwen3.6-27B-FP8 on the existing apparel-capture-8k-train dataset (now 7,672 rows after the 4-21 capture removal), eval on the 3,500-sample hard set, target > 0.9131. The OCRBench score makes brand recognition the field most likely to move.
If compute is tight, do a zero-shot 100-sample probe of both Qwen3.6-27B-FP8 and Youtu-VL-4B-Instruct first, then allocate a full SFT run only to the winner.