vioBERT-v3: Arabic Medical BERT
The first Arabic domain-adapted BERT model purpose-built for the medical domain.
vioBERT is produced by continuing masked language model pre-training on Shifaa, a curated corpus of 1.12 million Arabic medical documents spanning health consultations, drug references, patient-education materials, and encyclopaedic clinical articles across 16 medical specialties.
Key Results
| Task | vioBERT-v3 vs MARBERTv2 |
|---|---|
| Medical PPL | -82.7% reduction |
| Fill-mask Top-5 | +15.6 pp |
| Medical NER (F1) | +0.93 pp (surpasses BioBERT's +0.62 on English) |
| 39-class Classification (F1) | +1.62 pp |
| 5-class Classification (F1) | +0.97 pp |
Usage
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="Vionex-digital/vioBERT-v3")
results = fill_mask("المريض يعاني من [MASK] في الصدر")
for r in results:
print(f"{r['token_str']}: {r['score']:.4f}")
Pre-training Details
- Base Model: MARBERTv2 (1B Arabic tweets)
- Corpus: Shifaa — 1.12M Arabic medical documents (1.17M tokens)
- Strategy: Whole-word masking (respects Arabic agglutinative morphology)
- Steps: 22,000 (early-stopped via composite improvement score)
- Masking: 15% probability
- Optimizer: AdamW, lr=5e-5, weight decay 0.01
Training Data: The Shifaa Corpus
| Source | Documents | Avg. Tokens |
|---|---|---|
| Health consultations (Q) | 393,219 | 108 |
| Health consultations (A) | 777,954 | 259 |
| Medical encyclopaedia | ~5,000 | 412 |
| Total | 1,176,173 | 189 |
Evaluation
Evaluated across 5 orthogonal axes with 42 experimental configurations, 3 random seeds each:
- Intrinsic: Perplexity + Fill-mask accuracy
- Linear Probing: Frozen encoder classification
- Full Fine-tuning: 5-class and 39-class medical text classification
- NER: Arabic medical named entity recognition
- MCQ: Medical question answering (5 difficulty levels)
Citation
@article{zaghloul2026viobert,
title={From Tweets to Treatment: Domain-Adaptive Pre-Training for Arabic Medical {NLP}},
author={Zaghloul, Yousef and Khaled, Abdallah},
year={2026},
institution={Vionex Digital Solutions}
}
Limitations
- Clinical deployment requires bias auditing across dialects and demographics
- vioBERT is a research tool, not a diagnostic system
- Performance on clinical reasoning tasks (e.g., complex MCQ) remains at statistical parity with the base model
Ethics
The Shifaa corpus is constructed entirely from publicly available medical text; no patient records or protected health information were used.
Developed by Vionex Digital Solutions
- Downloads last month
- 24
Model tree for Vionex-digital/vioBERT-v3
Base model
UBC-NLP/MARBERTv2