Daily Model Scout Report — 2026-04-28

#17

by msudharsanan - opened 2 days ago

Denali Advanced Integration org 2 days ago

Daily Model Scout Report — 2026-04-28

Window scanned: 2026-04-21 → 2026-04-28 (last 7 days), image-text-to-text filter on the HF model index plus targeted searches across Qwen, InternVL, Florence, PaliGemma, SmolVLM/Idefics, LLaVA, Phi-Vision, Molmo/Moondream, MiniCPM-V, DeepSeek-VL, GLM-4V, Step-VL, Youtu-VL, Cosmos-Reason, DINOv3, fashion/garment-tagged repos, and trending-by-likes sweeps.

Current production baseline (3,500-sample hard eval, weighted score)

Rank	Model	Weighted Score
1	`qwen3-vl-8b-sft+grpo`	0.9131
2	`qwen3-vl-2b-sft-grpo-v9`	0.8948
3	`qwen3-vl-8b-sft-grpo-nvfp4`	0.8945
4	`qwen3-vl-8b-instruct-base`	0.8751
5	`qwen35-2b-base`	0.8437

The Qwen3-VL family dominates. SFT+GRPO adds ~+0.038 over the Qwen3-VL-8B-Instruct base.

High relevance — benchmark immediately

1. `Qwen/Qwen3.6-27B` (and `Qwen/Qwen3.6-27B-FP8`)

Released: 2026-04-21 — top of HF trending (951 / 158 likes, 1.25M combined downloads in 7 days)
Architecture: Dense 27B VLM, 64 layers mixed Gated DeltaNet + Gated Attention, native 262K context, integrated vision encoder
License: Apache 2.0 (matches our current usage)
Reported VLM benchmarks: MMMU 82.9, MMMU-Pro 75.8, RealWorldQA 84.1, OCRBench 89.4, CC-OCR 81.2, RefCOCO 92.5, CountBench 97.8
Why this could beat our 0.9131 baseline:
- Same family as our best model (Qwen3-VL-8B), so our SFT + GRPO + GTPO recipe should drop in unchanged.
- 3.4× the active parameters of our current best — likely meaningful headroom on the fields we still struggle on (the Qwen3-VL-8B Instruct base is already 0.8751; a much larger same-family base should lift the ceiling).
- OCRBench 89.4 is directly relevant to our weakest field — brand recognition (~70% on Qwen3-VL-8B SFT+GRPO at the 100-sample eval) is largely a logo/text-on-garment OCR problem.
Hardware fit: ~54 GB BF16 / ~27 GB FP8 — fits comfortably on the RTX PRO 6000 98 GB. FP8 variant is published officially; near-identical metrics per the card.
Caveat: Inference will be 2–3× slower than 8B. Worth pairing FP8 / NVFP4 quant with the SFT+GRPO recipe from day one.
Action: Run full SFT → eval-on-3.5k pipeline against Qwen3.6-27B-FP8 as a base. Target: beat 0.9131 weighted overall.
Links: Qwen3.6-27B · Qwen3.6-27B-FP8

Medium relevance — worth watching / probe before committing

2. `addpty/Youtu-VL-4B-Instruct`

Released: 2026-04-28 (today, likes/dl still 0 — very fresh)
Architecture: Tencent Youtu — novel VLUAS (Vision-Language Unified Autoregressive Supervision) paradigm; 4B params built on Youtu-LLM
Why interesting: The card explicitly lists image classification and fine-grained visual tasks as the design target, with a learned visual codebook giving vision tokens equal autoregressive status. A 4B model purpose-built for classification is precisely the size/task slot between our 2B and 8B variants.
Concern: Custom youtu-vl license — not Apache. Needs a legal/licensing review before any production path.
Action: Read the technical report; if license is workable for Denali-AI, run a zero-shot 100-sample probe before committing to a full SFT run.
Link: Youtu-VL-4B-Instruct

3. `keithnull/Qwen3.6-VL-REAP-26B-A3B-GGUF`

Released: 2026-04-27 (GGUF wrapper of atbender/Qwen3.6-VL-REAP-26B-A3B, 2026-04-19)
Architecture: REAP-pruned Qwen3.6-VL MoE — 27B total / ~3B active per token, 25% of routed experts removed (256 → 192 per layer); vision encoder + all 40 MoE layers preserved.
Why interesting: Active-param footprint is the same as our 2B sweet spot, but with the larger Qwen3.6-VL knowledge base behind it. Could deliver 27B-class quality at ~3B-class latency.
Concerns: (a) GGUF format doesn't fit our HF/vLLM training pipeline — we'd want the BF16 / W4A16 base from atbender instead. (b) Quality regression from pruning is unmeasured on classification-style tasks.
Action: If/when we decide to invest in Qwen3.6-VL, A/B the REAP-pruned base vs the dense 27B on a 100-sample probe before allocating SFT compute.
Link: Qwen3.6-VL-REAP-26B-A3B

Low relevance — note and move on

vrfai/Cosmos-Reason2-2B-NVFP4 (2026-04-24) — Community NVFP4 quant of NVIDIA's Cosmos-Reason2 2B. Cosmos-Reason is tuned for embodied/spatial physical reasoning — not a strong fit for garment attribute JSON extraction. Skip.
zenless-lab/vit_*_dinov3.lvd1689m (2026-04-27) — DINOv3 ViT ports in multiple sizes (small/base/large/huge+). Pure vision encoders, no language head — not directly useful as a VLM base, but could be a candidate vision-tower swap experiment for a future custom architecture.
Numerous Qwen3.6-27B quants (unsloth GGUF, AWQ-INT4, MLX 4/8-bit, NVFP4, GPTQ-Pro, AutoRound, PrismaQuant, etc.) — all derivatives of the Qwen3.6-27B base above. Worth pulling once we pick a target deployment quant for the 27B; no novel architectural value on their own.
No new releases this week for: InternVL4, Florence-3, PaliGemma3, SmolVLM3, Idefics4, LLaVA-NeXT, Phi-5-Vision, Molmo2, Moondream3, GLM-4.5V, Hunyuan-VL, Skywork-VL, Apple FastVLM successors, Pixtral updates.

Recommendation

One concrete next experiment: SFT + GRPO Qwen/Qwen3.6-27B-FP8 on the existing apparel-capture-8k-train dataset (now 7,672 rows after the 4-21 capture removal), eval on the 3,500-sample hard set, target > 0.9131. The OCRBench score makes brand recognition the field most likely to move.

If compute is tight, do a zero-shot 100-sample probe of both Qwen3.6-27B-FP8 and Youtu-VL-4B-Instruct first, then allocate a full SFT run only to the winner.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment