Daily Model Scout Report β€” 2026-04-14

#12
by msudharsanan - opened
Denali Advanced Integration org

Daily Model Scout Report β€” 2026-04-14

Scouting VLMs released or trending in the last ~7 days that could beat or complement our current best garment classifier (qwen3-vl-8b-sft+grpo @ 0.9131 weighted on the 3,500-sample hard eval set).

All benchmarks referenced below are from the model providers' reports, not our hard eval β€” any claim of "may beat ours" requires a run through our 9-field JSON pipeline to confirm.


Tier 1 β€” Benchmark immediately (High)

1. Google Gemma 4 family β€” released 2026-04-02 (12 days ago)

  • Variants: google/gemma-4-E2B, E4B, 26B-A4B (MoE, 8/128 experts), 31B (dense)
  • License: Apache 2.0
  • Modality: Text + Image (+ audio on E-series), 256K context, variable image token budgets (70β†’1120), function calling
  • Why it may beat ours:
    • Brand-new multimodal base we have not touched. Prior Gemma/Granite family members (Granite4-Vision-SFT) scored 0.88 on our 100-sample eval β€” above our best Qwen3.5-2B SFT. A fresh, stronger base from the same lineage plausibly lifts that further after SFT+GRPO.
    • Native function calling makes strict JSON schema adherence cheaper to train, which is our hardest failure mode (closure, neckline, brand).
    • MoE 26B-A4B activates ~4B params per token β†’ fits fits comfortably on the RTX PRO 6000 98GB and may match 8B dense cost.
  • Fit for our stack: E4B (~4.5B effective) and 26B-A4B are the two obvious candidates. E4B replaces our 2B tier; 26B-A4B competes with Qwen3-VL-8B on inference cost while giving more capacity.
  • Link: https://huggingface.co/google/gemma-4-E4B Β· https://huggingface.co/google/gemma-4-26B-A4B-it

2. Tencent Penguin-VL-8B / 2B β€” released 2026-03 (trending this week)

  • Sizes: 2B, 8B (+ Penguin-Encoder standalone)
  • License: Apache 2.0
  • Architecture: Novel β€” vision encoder is initialized from a text LLM (Qwen3-0.6B) with bidirectional attention + 2D-RoPE, replacing CLIP/SigLIP. Language backbone is Qwen3-8B.
  • Why it may beat ours: Author-reported head-to-head vs Qwen3-VL-8B shows Penguin-VL ahead on most image + reasoning benchmarks:
    • InfoVQA 86.8 vs 83.1, ChartQA 90.5 vs 89.6, AI2D 86.1 vs 85.7, RealWorldQA 75.8 vs 71.5, MathVista 77.4 vs 77.2, NextQA 85.4 vs 82.3
    • Qwen3-VL wins on OCRBench (896 vs 852) and DocVQA (tied). Our hard samples are garment-attribute reasoning, not OCR-heavy, so the Penguin trade actually favors us.
  • Fit for our stack: Drop-in 8B replacement for the SFT+GRPO pipeline; same backbone lineage (Qwen3-8B) means our existing LoRA recipe should port with minimal tuning.
  • Link: https://huggingface.co/tencent/Penguin-VL-8B Β· https://huggingface.co/tencent/Penguin-VL-2B

Tier 2 β€” Worth watching (Medium)

3. Qwen3-VL FP8 variants β€” updated 2025-11-26

  • Full-collection FP8 pass dropped for 4B / 8B / 30B-A3B / 32B / 235B Instruct and Thinking variants.
  • Why watch: Our NVFP4 quant of qwen3-vl-8b-sft+grpo loses ~1.9 points (0.9131 β†’ 0.8945). Official FP8 weights from Qwen may close that gap further with less calibration work than NVFP4 required.
  • Action: Quick A/B of Qwen3-VL-8B-Instruct-FP8 vs our NVFP4 on the same hard eval slice.
  • Link: https://huggingface.co/collections/Qwen/qwen3-vl

4. LiquidAI LFM2.5-VL-450M β€” updated 2025-11-28, trending now

  • Architecture: LFM2.5-350M LM + SigLIP2 NaFlex vision encoder; native 512Γ—512 tiling
  • License: lfm1.0 (check commercial terms before shipping)
  • Why watch: Claims to beat SmolVLM2-500M across MMBench/MMStar/MMMU. Only viable if we ever need an edge/browser-tier model below our 2B floor β€” today our 2B models at 0.89 would almost certainly outperform, so this is for roadmap, not replacement.
  • Link: https://huggingface.co/LiquidAI/LFM2.5-VL-450M

5. Qwen3-VL-Embedding-8B β€” recent addition to Qwen3-VL collection


Tier 3 β€” Tangentially relevant (Low)

Model Why skip
tencent/HY-Embodied-0.5 (2026-04-09) 4B MoT specialized for robotics/VLA. Not a general VLM; garment JSON extraction is out of distribution.
LGAI-EXAONE/EXAONE-4.5-33B Non-commercial license (EXAONE AI Model License 1.2-NC) β€” blocker for Denali production use.
baidu/Qianfan-OCR, opendatalab/MinerU2.5-Pro, datalab-to/chandra-ocr-2, echo840/MonkeyOCR-pro-3B OCR-specialist models. Our hard fields (pattern, closure, neckline, brand) are visual-reasoning, not text-extraction β€” OCR specialists historically underperform here (see Phi-4-Multimodal at 0.46).
Countless Gemma-4-*-CRACK, *-Aggressive, *-abliterated forks Community uncensoring forks β€” irrelevant for classification, may hurt JSON-schema adherence.

Recommended next actions

  1. Spin up a Gemma 4 E4B SFT run on our existing ORR dataset using the proven SFT→GRPO→GTPO recipe. Target: match or beat qwen3-vl-2b-sft-grpo-v9 (0.8948) at roughly the same active parameter budget.
  2. Baseline Penguin-VL-8B zero-shot on the 3,500-sample hard eval before committing any training time. If zero-shot is already β‰₯ Qwen3-VL-8B-Instruct base (0.78 territory on the 100-sample eval), queue a full SFT+GRPO cycle.
  3. Swap in official Qwen3-VL-8B-Instruct-FP8 as the quant baseline and compare against our NVFP4 artifact.
  4. Defer: LFM2.5-VL-450M (sub-2B is not a current deployment target), EXAONE-4.5 (license), HY-Embodied (wrong domain), OCR specialists.

Generated by the hf-model-scout skill. Eval anchor: qwen3-vl-8b-sft+grpo _overall.weighted_score = 0.9131 on our 3,500-sample hard eval.

Sign up or log in to comment