nl-nli-klein
Small Dutch Natural Language Inference model — 46M parameters, for local-first LLM workflows that can't afford a cloud round-trip.
nl-nli-klein ("klein" = small in Dutch) is a 3-class sequence-pair classifier that predicts whether a premise entails, contradicts, or is neutral with respect to a hypothesis. Natively in Dutch, ships pre-fine-tuned so you don't have to.
Before you use this model, read the Limitations section carefully. The 0.826 SICK-NL benchmark reflects performance on SICK-style paraphrase pairs; the model has real, reproducible failure modes on temporal, numeric, and hypernym reasoning that we document explicitly.
It is the throughput tier of the LokaalHub Dutch NLI family. For higher accuracy at a larger size, see LokaalHub/nl-nli-middel.
At a glance
| Base model | DTAI-KULeuven/robbertje-1-gb-bort (RobBERTje-BORT, distilled from RobBERT) |
| Parameters | 46M |
| Disk size | 177.8 MB (fp32) |
| Language | Dutch (nl) |
| Task | 3-class NLI (entailment / neutral / contradiction) |
| License | Apache-2.0 |
| Training data | SICK-NL (MIT) + e-SNLI-NL (Apache-2.0) |
| SICK-NL test accuracy | 0.826 |
| Macro-F1 | 0.820 |
| CPU throughput | ~184 pairs/sec (single M1 P-core, fp32, batch 32) |
What to use this for
Any Dutch LLM workflow where a tiny local classifier makes the whole pipeline better — especially high-volume, low-latency scenarios where a cloud LLM call would be wasteful.
Primary use cases
Groundedness / hallucination check for Dutch RAG. For every sentence in an LLM-generated answer, check whether it entails from the retrieved context. If
P(entailment) < threshold→ flag or drop the sentence. Cheap enough to run on every generated token, local-only (no context leaves the device).Zero-shot classification via the NLI-as-classifier pattern. Frame any classification task as entailment: premise = the document, hypothesis = "Dit document gaat over boekhouding" (or whatever label you want to check).
P(entailment)becomes the class score. Lets you define new label sets without retraining.Summary-faithfulness scoring. For each sentence in a summary, check it against the source document. Contradictions reveal hallucinations; neutrals reveal fabrication.
Paraphrase / duplicate detection. Two sentences entailing each other in both directions is a strong paraphrase signal. Useful for deduplicating knowledge bases, matching FAQ entries, or clustering support tickets.
Simple contradiction detection in user input. For moderation or QA pipelines where you want to flag when a user's statement contradicts a known fact.
What makes it suitable for these
- Local and private. Runs fully on-device. No data leaves the laptop / VM / cluster. Critical for GDPR-sensitive Dutch workloads (healthcare, legal, HR, government).
- Fast. ~184 pairs/sec on a single M1 CPU core. You can score every sentence of a streaming LLM output without adding perceivable latency.
- Small. 178 MB fits easily alongside an agent runtime, a local vector DB, and a lightweight generation model.
- Cheap. Zero marginal cost per inference. A cloud-based NLI-as-a-service at 1000 RAG queries/day × 10 sentences × 8 chunks = 80k pair-checks/day would cost real money; this model costs zero.
Read the Limitations section before committing — this is the klein (fastest/smallest) tier, and it has concrete failure modes on temporal/numeric/hypernym reasoning.
Quick start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("LokaalHub/nl-nli-klein")
model = AutoModelForSequenceClassification.from_pretrained("LokaalHub/nl-nli-klein")
premise = "Het contract gaat in op 1 mei 2026."
hypothesis = "Het contract start in mei."
inputs = tokenizer(premise, hypothesis, truncation="only_first", max_length=256, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = logits.softmax(-1).squeeze().tolist()
label_id = int(torch.argmax(logits, dim=-1))
labels = model.config.id2label # {0: "entailment", 1: "neutral", 2: "contradiction"}
print(f"prediction: {labels[label_id]} probs: {dict(zip(labels.values(), [round(p,3) for p in probs]))}")
Do NOT use the pipeline("zero-shot-classification") wrapper for benchmarking — it rewraps inputs with a template ("This example is {}.") and collapses to 2 classes. For 3-class NLI use raw AutoModelForSequenceClassification + argmax as shown above.
Benchmark — SICK-NL test (4,906 pairs)
Head-to-head vs. zero-shot multilingual NLI models
Every row except the literature ones was measured by us on CPU, fp32, under an identical preprocessing protocol (see Reproducibility below). Competitor numbers are zero-shot — no SICK-NL exposure in training. nl-nli-klein is fine-tuned on a mix of SICK-NL train + e-SNLI-NL.
| Model | Params (M) | Disk (MB) | Accuracy | Macro-F1 | F1 entail / neutral / contradict | Throughput (pairs/s, CPU) | Dutch NLI in training? |
|---|---|---|---|---|---|---|---|
| LokaalHub/nl-nli-klein (ours) | 46 | 178 | 0.826 | 0.820 | 0.77 / 0.85 / 0.84 | 163 | Yes |
| MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7 | 278 | 551 | 0.682 | 0.677 | 0.79 / 0.67 / 0.56 | 17 | Yes (MT) |
| MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 278 | 552 | 0.591 | 0.592 | 0.80 / 0.48 / 0.49 | 17 | No |
| MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli | 100 | 429 | 0.508 | 0.507 | 0.67 / 0.42 / 0.43 | 143 | No |
| joeddav/xlm-roberta-large-xnli | 560 | 2145 | 0.592 | 0.588 | 0.76 / 0.49 / 0.51 | 8 | No |
Pinned revisions: see competitor_revisions.json at project root.
Bottom line: nl-nli-klein at 46M parameters beats every zero-shot multilingual NLI model tested, including 6× larger mDeBERTa-2mil7 (by 14 points of accuracy) and 12× larger xlm-roberta-large-xnli (by 23 points). At 163 pairs/s on a single CPU core, it's also 10× faster than mDeBERTa-2mil7 and 20× faster than XLM-R-large.
vs. published Dutch NLI baselines (apples-to-apples)
How do we compare to Dutch-specific models fine-tuned on SICK-NL train? Literature numbers are from Wijnholds & Moortgat (EACL 2021, paper, Table 3).
| Model | Params | Training setup | SICK-NL accuracy |
|---|---|---|---|
| mBERT | 178M | SICK-NL train | 0.845 |
| BERTje | 109M | SICK-NL train | 0.839 |
nl-nli-klein (shipped, this repo) |
46M | SICK-NL + e-SNLI-NL | 0.826 |
| RobBERT v1 | 124M | SICK-NL train | 0.820 |
nl-nli-klein (strict) |
46M | SICK-NL train only | 0.804 |
Two klein rows because the training setup differs:
nl-nli-klein(shipped) — the 80k curated data mix used for this checkpoint. Beats RobBERT v1 at 37% its parameter count and lands within 1.3 points of BERTje at 42% its size.nl-nli-klein(strict) — SICK-NL train only (4,439 examples × 3 epochs), directly comparable to the literature baselines. Shows the honest size-performance tradeoff at identical training: 46M loses ~4 points to BERTje (109M) under matched fine-tuning.
Training
Data mix
Only permissive-licensed sources — so this Apache-2.0 release is defensible.
| Source | License | Role | Size (unique) | Effective weight |
|---|---|---|---|---|
maximedb/sick_nl |
MIT | gold anchor | 4,439 train | 18% |
GroNLP/ik-nlp-22_transqe |
Apache-2.0 | volume | 549,367 train | 82% |
Excluded on purpose: MoritzLaurer/multilingual-NLI-26lang-2mil7 NL subsets. Its upstream sources include ANLI (CC-BY-NC-4.0 — non-commercial) and FEVER (CC-BY-SA-3.0 — share-alike), which are incompatible with this model's Apache-2.0 release.
Method
Standard 3-class sequence-pair classification fine-tune on RobBERTje-BORT. 80,000 training rows sampled with replacement from the weighted pool, 3 epochs, AdamW (LR 3e-5, weight decay 0.01, linear schedule with 10% warmup), batch size 32 (16 × 2 gradient accumulation), fp32, seed 42. Training capped at 40 minutes wall-clock on an M1 Pro with MPS.
The project ships with an in-house autoresearch_agent.py that iterates on config.yaml via Claude to drive further hyperparameter search — this klein v1 checkpoint was trained from the default config, with autoresearch reserved for future tier runs.
Inspired by Andrej Karpathy's minimal-loop approach to ML research — small, readable code, fast iteration, measured decisions.
Note on SICK-NL label conventions
The maximedb/sick_nl dataset card on HuggingFace states integer labels as {0: entailment, 1: neutral, 2: contradiction}. However, inspection of the dataset's own entailment_AB column reveals the raw label integers are inverted from that claim — rows with label=0 have entailment_AB="A_contradicts_B". The class distribution confirms this (14.5% of test rows have label=0, matching SICK's documented ~14% contradiction rate).
This model was trained with the raw labels remapped (label = 2 - label) so our id2label — {0: entailment, 1: neutral, 2: contradiction} — matches both the standard NLI convention and the semantic content of entailment_AB. Benchmarks against multilingual NLI competitors were run against the same remapped gold labels for fairness. All klein accuracy numbers in this card are on semantically-correct NLI — if the model outputs "entailment" with high probability, the pair is genuinely entailed.
Reproducibility (evaluation protocol)
To reproduce the numbers above, follow these rules exactly:
- Test set: SICK-NL test split (4,906 examples) of
maximedb/sick_nl. No filtering. - Canonical label order:
{0: entailment, 1: neutral, 2: contradiction}. For any competitor, remap via itsconfig.id2labelbefore scoring (some models, e.g.joeddav/xlm-roberta-large-xnli, invert the order). - Per-model tokenizer:
AutoTokenizer.from_pretrained(id, revision=<sha>). Never share tokenizers across models. - Tokenisation:
tokenizer(premise, hypothesis, truncation='only_first', max_length=256, padding='max_length'). - Precision: fp32 for all models. mDeBERTa does not support fp16.
- Model loading:
AutoModelForSequenceClassification+ raw logits + argmax. Neverpipeline("zero-shot-classification"). - Device: CPU for accuracy (MPS has known numeric issues with DeBERTa-v2's disentangled attention).
- For DeBERTa-v2 competitors, forward
token_type_idsalongsideinput_ids/attention_mask— they're required for pair classification.
Hypothesis-only bias check
We explicitly measured accuracy on SICK-NL test with the premise blanked out. Result: 0.570 — which is the majority-class baseline for SICK-NL (56.87% of test examples are neutral). A model that can't beat this baseline on hypothesis alone is demonstrating no shortcut learning — it genuinely requires the premise to make predictions. A model with real hypothesis-only bias would exceed ~0.62 on this check.
Limitations
Important — please read before using. The benchmark number (0.826 on SICK-NL) reflects the model's behaviour on the distribution it was trained on (SICK-style paraphrase pairs + machine-translated e-SNLI). It does not mean the model performs general-purpose Dutch NLI at 82% accuracy on arbitrary text. Spot-checks on handwritten Dutch pairs revealed specific, reproducible failure modes you should be aware of.
What the model handles well
- Lexical-overlap entailment — removing or adding modifiers while keeping the main content. "Een hond rent in het park" → "Een hond rent" = entailment ✓
- Direct negation as contradiction — negation of a main verb or subject. "Een hond rent in het park" → "Er is geen hond in het park" = contradiction ✓
- Clearly unrelated pairs as neutral — when premise and hypothesis share no logical connection.
- Quantifier contradictions — "Niemand zit op de bank" → "Er zit iemand op de bank" = contradiction ✓
What the model does poorly
These are real failures from our own testing. Do not rely on this model for:
- Temporal / numeric reasoning. "Het contract gaat in op 1 mei 2026" → "Het contract start in mei" — the model predicted
neutral(should be entailment). "Het contract start in maart" — the model predictedneutral(should be contradiction). Any task requiring reasoning over dates, numbers, or quantities is out of scope. - Hyponym / hypernym chains. "Twee mannen spelen voetbal" → "Mensen spelen een sport" — the model predicted
contradiction(should be entailment: men → people, football → sport). Category reasoning is brittle. - Nuanced neutrality. "Een man loopt op straat" → "Een man werkt bij een bank" — the model predicted
entailment(should be neutral). When topics are loosely related, the model tends toward entailment or contradiction rather than neutral. - Semantic opposites without syntactic negation. "Het meisje lacht" / "Het meisje huilt" — the model predicted
contradictionconfidently, but without niet / geen the gold label is typically neutral in SICK convention.
When to pick klein vs. something else
klein is the throughput tier — 46M parameters, 178 MB, 163 pairs/sec on a single CPU core. It's the right choice when:
- You need high-volume Dutch NLI at very low latency (every RAG chunk, every streaming LLM sentence).
- Your inputs look like SICK-style paraphrase pairs or SNLI-style simple entailment — lexical-overlap rewrites, direct negations, straightforward contradictions.
Pick something else when:
- Your inputs involve temporal, numeric, or categorical reasoning — use a larger Dutch LLM (Fietje / GEITje / Claude) with explicit prompting, not this classifier.
- You need confidence-calibrated predictions on hard-NLI corpora —
klein's softmax is not calibrated for ANLI/WANLI-style adversarial pairs. - You need broader linguistic coverage (Flemish, Frisian, code-mixed).
kleinis trained Dutch-only.
Other limitations
- SICK-NL is MT + manual-correction of the English SICK, not native Dutch re-annotation. The model and the benchmark inherit some translation-style artefacts, including a preference for lexical-overlap heuristics over deep semantic reasoning.
- e-SNLI-NL is fully machine-translated from English e-SNLI via OPUS-MT. Some hypothesis-only-bias risk is inherent in the training data; we measured it explicitly (0.570, at the SICK-NL majority-class baseline of 0.569 — so no evidence of shortcut learning on the benchmark, but adversarial prompts may exploit bias that the benchmark doesn't surface).
- Max input length is 256 tokens with premise-side truncation. Longer premises get truncated while the hypothesis is preserved.
- Dutch only. No evaluation on Flemish, Frisian, or code-mixed text.
- 3-class task only. If you need "weak entailment" or confidence stratification, use the raw softmax probabilities as an ordinal signal — but see the calibration caveat above.
Ethical considerations
- Outputs are probabilities over 3 classes; no claims about the world, only about logical relationships between two Dutch texts.
- For production hallucination-detection or groundedness decisions, combine with retrieval-quality checks and human-in-the-loop review for consequential outcomes.
- Runs fully locally; no data leaves the device at inference time.
Related models
LokaalHub/nl-nli-middel— higher-accuracy tier (124M params, RobBERT-2023-base). Coming next.LokaalHub/nl-lokaal-kleinandnl-lokaal-middel— sibling Dutch PII NER models.
Citation
@misc{nl_nli_klein_2026,
title = {nl-nli-klein: A Tiny Dutch Natural Language Inference Model},
author = {LokaalHub},
year = {2026},
url = {https://huggingface.co/LokaalHub/nl-nli-klein}
}
Please also cite the base model and data sources:
@inproceedings{wijnholds2021sicknl,
title = {{SICK-NL}: A Dataset for {Dutch} Natural Language Inference},
author = {Wijnholds, Gijs and Moortgat, Michael},
booktitle = {Proceedings of EACL 2021},
year = {2021}
}
@inproceedings{delobelle2020robbert,
title = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
booktitle = {Findings of EMNLP 2020},
year = {2020}
}
@inproceedings{camburu2018esnli,
title = {e-{SNLI}: Natural Language Inference with Natural Language Explanations},
author = {Camburu, Oana-Maria and Rockt\"{a}schel, Tim and Lukasiewicz, Thomas and Blunsom, Phil},
booktitle = {NeurIPS 2018},
year = {2018}
}
Built in the Netherlands. Trained 2026-04-21.
- Downloads last month
- 48
Model tree for LokaalHub/nl-nli-klein
Base model
DTAI-KULeuven/robbertje-1-gb-bortDatasets used to train LokaalHub/nl-nli-klein
Evaluation results
- SICK-NL test accuracy on SICK-NLtest set self-reported0.826
- macro F1 on SICK-NLtest set self-reported0.820