Daily Model Scout Report β 2026-04-14
#12
by msudharsanan - opened
Daily Model Scout Report β 2026-04-14
Scouting VLMs released or trending in the last ~7 days that could beat or complement our current best garment classifier (qwen3-vl-8b-sft+grpo @ 0.9131 weighted on the 3,500-sample hard eval set).
All benchmarks referenced below are from the model providers' reports, not our hard eval β any claim of "may beat ours" requires a run through our 9-field JSON pipeline to confirm.
Tier 1 β Benchmark immediately (High)
1. Google Gemma 4 family β released 2026-04-02 (12 days ago)
- Variants:
google/gemma-4-E2B,E4B,26B-A4B(MoE, 8/128 experts),31B(dense) - License: Apache 2.0
- Modality: Text + Image (+ audio on E-series), 256K context, variable image token budgets (70β1120), function calling
- Why it may beat ours:
- Brand-new multimodal base we have not touched. Prior Gemma/Granite family members (Granite4-Vision-SFT) scored 0.88 on our 100-sample eval β above our best Qwen3.5-2B SFT. A fresh, stronger base from the same lineage plausibly lifts that further after SFT+GRPO.
- Native function calling makes strict JSON schema adherence cheaper to train, which is our hardest failure mode (closure, neckline, brand).
- MoE
26B-A4Bactivates ~4B params per token β fits fits comfortably on the RTX PRO 6000 98GB and may match 8B dense cost.
- Fit for our stack: E4B (~4.5B effective) and 26B-A4B are the two obvious candidates. E4B replaces our 2B tier; 26B-A4B competes with Qwen3-VL-8B on inference cost while giving more capacity.
- Link: https://huggingface.co/google/gemma-4-E4B Β· https://huggingface.co/google/gemma-4-26B-A4B-it
2. Tencent Penguin-VL-8B / 2B β released 2026-03 (trending this week)
- Sizes: 2B, 8B (+
Penguin-Encoderstandalone) - License: Apache 2.0
- Architecture: Novel β vision encoder is initialized from a text LLM (Qwen3-0.6B) with bidirectional attention + 2D-RoPE, replacing CLIP/SigLIP. Language backbone is Qwen3-8B.
- Why it may beat ours: Author-reported head-to-head vs Qwen3-VL-8B shows Penguin-VL ahead on most image + reasoning benchmarks:
- InfoVQA 86.8 vs 83.1, ChartQA 90.5 vs 89.6, AI2D 86.1 vs 85.7, RealWorldQA 75.8 vs 71.5, MathVista 77.4 vs 77.2, NextQA 85.4 vs 82.3
- Qwen3-VL wins on OCRBench (896 vs 852) and DocVQA (tied). Our hard samples are garment-attribute reasoning, not OCR-heavy, so the Penguin trade actually favors us.
- Fit for our stack: Drop-in 8B replacement for the SFT+GRPO pipeline; same backbone lineage (Qwen3-8B) means our existing LoRA recipe should port with minimal tuning.
- Link: https://huggingface.co/tencent/Penguin-VL-8B Β· https://huggingface.co/tencent/Penguin-VL-2B
Tier 2 β Worth watching (Medium)
3. Qwen3-VL FP8 variants β updated 2025-11-26
- Full-collection FP8 pass dropped for 4B / 8B / 30B-A3B / 32B / 235B Instruct and Thinking variants.
- Why watch: Our NVFP4 quant of
qwen3-vl-8b-sft+grpoloses ~1.9 points (0.9131 β 0.8945). Official FP8 weights from Qwen may close that gap further with less calibration work than NVFP4 required. - Action: Quick A/B of
Qwen3-VL-8B-Instruct-FP8vs our NVFP4 on the same hard eval slice. - Link: https://huggingface.co/collections/Qwen/qwen3-vl
4. LiquidAI LFM2.5-VL-450M β updated 2025-11-28, trending now
- Architecture: LFM2.5-350M LM + SigLIP2 NaFlex vision encoder; native 512Γ512 tiling
- License: lfm1.0 (check commercial terms before shipping)
- Why watch: Claims to beat SmolVLM2-500M across MMBench/MMStar/MMMU. Only viable if we ever need an edge/browser-tier model below our 2B floor β today our 2B models at 0.89 would almost certainly outperform, so this is for roadmap, not replacement.
- Link: https://huggingface.co/LiquidAI/LFM2.5-VL-450M
5. Qwen3-VL-Embedding-8B β recent addition to Qwen3-VL collection
- Why watch: Not a generator, but could power garment retrieval / nearest-neighbor de-duplication / hard-sample mining for the next training round.
- Link: https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B
Tier 3 β Tangentially relevant (Low)
| Model | Why skip |
|---|---|
tencent/HY-Embodied-0.5 (2026-04-09) |
4B MoT specialized for robotics/VLA. Not a general VLM; garment JSON extraction is out of distribution. |
LGAI-EXAONE/EXAONE-4.5-33B |
Non-commercial license (EXAONE AI Model License 1.2-NC) β blocker for Denali production use. |
baidu/Qianfan-OCR, opendatalab/MinerU2.5-Pro, datalab-to/chandra-ocr-2, echo840/MonkeyOCR-pro-3B |
OCR-specialist models. Our hard fields (pattern, closure, neckline, brand) are visual-reasoning, not text-extraction β OCR specialists historically underperform here (see Phi-4-Multimodal at 0.46). |
Countless Gemma-4-*-CRACK, *-Aggressive, *-abliterated forks |
Community uncensoring forks β irrelevant for classification, may hurt JSON-schema adherence. |
Recommended next actions
- Spin up a Gemma 4 E4B SFT run on our existing ORR dataset using the proven SFTβGRPOβGTPO recipe. Target: match or beat
qwen3-vl-2b-sft-grpo-v9(0.8948) at roughly the same active parameter budget. - Baseline Penguin-VL-8B zero-shot on the 3,500-sample hard eval before committing any training time. If zero-shot is already β₯ Qwen3-VL-8B-Instruct base (0.78 territory on the 100-sample eval), queue a full SFT+GRPO cycle.
- Swap in official Qwen3-VL-8B-Instruct-FP8 as the quant baseline and compare against our NVFP4 artifact.
- Defer: LFM2.5-VL-450M (sub-2B is not a current deployment target), EXAONE-4.5 (license), HY-Embodied (wrong domain), OCR specialists.
Generated by the hf-model-scout skill. Eval anchor: qwen3-vl-8b-sft+grpo _overall.weighted_score = 0.9131 on our 3,500-sample hard eval.