Daily Model Scout Report — 2026-04-14

#12

by msudharsanan - opened 16 days ago

Denali Advanced Integration org 16 days ago

Daily Model Scout Report — 2026-04-14

Scouting VLMs released or trending in the last ~7 days that could beat or complement our current best garment classifier (qwen3-vl-8b-sft+grpo @ 0.9131 weighted on the 3,500-sample hard eval set).

All benchmarks referenced below are from the model providers' reports, not our hard eval — any claim of "may beat ours" requires a run through our 9-field JSON pipeline to confirm.

Tier 1 — Benchmark immediately (High)

1. Google Gemma 4 family — released 2026-04-02 (12 days ago)

Variants: google/gemma-4-E2B, E4B, 26B-A4B (MoE, 8/128 experts), 31B (dense)
License: Apache 2.0
Modality: Text + Image (+ audio on E-series), 256K context, variable image token budgets (70→1120), function calling
Why it may beat ours:
- Brand-new multimodal base we have not touched. Prior Gemma/Granite family members (Granite4-Vision-SFT) scored 0.88 on our 100-sample eval — above our best Qwen3.5-2B SFT. A fresh, stronger base from the same lineage plausibly lifts that further after SFT+GRPO.
- Native function calling makes strict JSON schema adherence cheaper to train, which is our hardest failure mode (closure, neckline, brand).
- MoE 26B-A4B activates ~4B params per token → fits fits comfortably on the RTX PRO 6000 98GB and may match 8B dense cost.
Fit for our stack: E4B (~4.5B effective) and 26B-A4B are the two obvious candidates. E4B replaces our 2B tier; 26B-A4B competes with Qwen3-VL-8B on inference cost while giving more capacity.
Link: https://huggingface.co/google/gemma-4-E4B · https://huggingface.co/google/gemma-4-26B-A4B-it

2. Tencent Penguin-VL-8B / 2B — released 2026-03 (trending this week)

Sizes: 2B, 8B (+ Penguin-Encoder standalone)
License: Apache 2.0
Architecture: Novel — vision encoder is initialized from a text LLM (Qwen3-0.6B) with bidirectional attention + 2D-RoPE, replacing CLIP/SigLIP. Language backbone is Qwen3-8B.
Why it may beat ours: Author-reported head-to-head vs Qwen3-VL-8B shows Penguin-VL ahead on most image + reasoning benchmarks:
- InfoVQA 86.8 vs 83.1, ChartQA 90.5 vs 89.6, AI2D 86.1 vs 85.7, RealWorldQA 75.8 vs 71.5, MathVista 77.4 vs 77.2, NextQA 85.4 vs 82.3
- Qwen3-VL wins on OCRBench (896 vs 852) and DocVQA (tied). Our hard samples are garment-attribute reasoning, not OCR-heavy, so the Penguin trade actually favors us.
Fit for our stack: Drop-in 8B replacement for the SFT+GRPO pipeline; same backbone lineage (Qwen3-8B) means our existing LoRA recipe should port with minimal tuning.
Link: https://huggingface.co/tencent/Penguin-VL-8B · https://huggingface.co/tencent/Penguin-VL-2B

Tier 2 — Worth watching (Medium)

3. Qwen3-VL FP8 variants — updated 2025-11-26

Full-collection FP8 pass dropped for 4B / 8B / 30B-A3B / 32B / 235B Instruct and Thinking variants.
Why watch: Our NVFP4 quant of qwen3-vl-8b-sft+grpo loses ~1.9 points (0.9131 → 0.8945). Official FP8 weights from Qwen may close that gap further with less calibration work than NVFP4 required.
Action: Quick A/B of Qwen3-VL-8B-Instruct-FP8 vs our NVFP4 on the same hard eval slice.
Link: https://huggingface.co/collections/Qwen/qwen3-vl

4. LiquidAI LFM2.5-VL-450M — updated 2025-11-28, trending now

Architecture: LFM2.5-350M LM + SigLIP2 NaFlex vision encoder; native 512×512 tiling
License: lfm1.0 (check commercial terms before shipping)
Why watch: Claims to beat SmolVLM2-500M across MMBench/MMStar/MMMU. Only viable if we ever need an edge/browser-tier model below our 2B floor — today our 2B models at 0.89 would almost certainly outperform, so this is for roadmap, not replacement.
Link: https://huggingface.co/LiquidAI/LFM2.5-VL-450M

5. Qwen3-VL-Embedding-8B — recent addition to Qwen3-VL collection

Why watch: Not a generator, but could power garment retrieval / nearest-neighbor de-duplication / hard-sample mining for the next training round.
Link: https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B

Tier 3 — Tangentially relevant (Low)

Model	Why skip
`tencent/HY-Embodied-0.5` (2026-04-09)	4B MoT specialized for robotics/VLA. Not a general VLM; garment JSON extraction is out of distribution.
`LGAI-EXAONE/EXAONE-4.5-33B`	Non-commercial license (EXAONE AI Model License 1.2-NC) — blocker for Denali production use.
`baidu/Qianfan-OCR`, `opendatalab/MinerU2.5-Pro`, `datalab-to/chandra-ocr-2`, `echo840/MonkeyOCR-pro-3B`	OCR-specialist models. Our hard fields (pattern, closure, neckline, brand) are visual-reasoning, not text-extraction — OCR specialists historically underperform here (see Phi-4-Multimodal at 0.46).
Countless `Gemma-4--CRACK`, `-Aggressive`, `*-abliterated` forks	Community uncensoring forks — irrelevant for classification, may hurt JSON-schema adherence.

Recommended next actions

Spin up a Gemma 4 E4B SFT run on our existing ORR dataset using the proven SFT→GRPO→GTPO recipe. Target: match or beat qwen3-vl-2b-sft-grpo-v9 (0.8948) at roughly the same active parameter budget.
Baseline Penguin-VL-8B zero-shot on the 3,500-sample hard eval before committing any training time. If zero-shot is already ≥ Qwen3-VL-8B-Instruct base (0.78 territory on the 100-sample eval), queue a full SFT+GRPO cycle.
Swap in official Qwen3-VL-8B-Instruct-FP8 as the quant baseline and compare against our NVFP4 artifact.
Defer: LFM2.5-VL-450M (sub-2B is not a current deployment target), EXAONE-4.5 (license), HY-Embodied (wrong domain), OCR specialists.

Generated by the hf-model-scout skill. Eval anchor: qwen3-vl-8b-sft+grpo _overall.weighted_score = 0.9131 on our 3,500-sample hard eval.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment