The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15 • 8
CST5: Data Augmentation for Code-Switched Semantic Parsing Paper • 2211.07514 • Published Nov 14, 2022 • 1
MT5 release Collection The MT5 release follows the T5 family, but is pretrained on multilingual data. The update UMT5 models are pretrained on an updated corpus. • 10 items • Updated Jul 10 • 22
view article Article Introducing the Polish ASR Leaderboard (PAL) and Benchmark Intended Grouping of Open Speech (BIGOS) Corpora Jul 10, 2024 • 4
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP Paper • 2510.05644 • Published Oct 7 • 23
H-Net Collection The family of hierarchical networks (H-Nets) from https://arxiv.org/abs/2507.07955 • 8 items • Updated Jul 11 • 20
OmniGEC Collection This is a collection of multilingual silver-standard datasets and models for the task of Grammatical Error Correction (GEC). • 9 items • Updated Sep 19 • 8
view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM +2 Mar 12 • 473
Gemma 3 Collection All versions of Google's new multimodal models including QAT in 1B, 4B, 12B, and 27B sizes. In GGUF, dynamic 4-bit and 16-bit formats. • 55 items • Updated 5 days ago • 96
MT Quality Estimation Collection Models for reference-free quality estimation of machine translation • 10 items • Updated Jan 29 • 4
GTE models Collection General Text Embedding Models Released by Tongyi Lab of Alibaba Group • 21 items • Updated Jan 21 • 32
OWLS: Scaling Laws for Speech Recognition and Translation Collection 🦉 A suite of Whisper-style models from 250M to 18B parameters. Trained on up to 360K hours of data. 16k sampling rate. • 8 items • Updated May 3 • 7
view article Article From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages Feb 11 • 33
NeMo Curator - Classifier Models Collection Classifier models that can be used in NeMo Curator for labelling/filtering datasets. • 11 items • Updated 3 days ago • 24
Ukrainian Text-to-Speech datasets Collection Five voices: Mykyta, Oleksa, Lada, Kateryna or Tetiana • 6 items • Updated Feb 26 • 4
Crimean Tatar Text-to-Speech datasets Collection Three voices: Abibullah, Sevil, or Arslan • 4 items • Updated May 27 • 2