vioBERT-v3: Arabic Medical BERT

The first Arabic domain-adapted BERT model purpose-built for the medical domain.

vioBERT is produced by continuing masked language model pre-training on Shifaa, a curated corpus of 1.12 million Arabic medical documents spanning health consultations, drug references, patient-education materials, and encyclopaedic clinical articles across 16 medical specialties.

Key Results

Task	vioBERT-v3 vs MARBERTv2
Medical PPL	-82.7% reduction
Fill-mask Top-5	+15.6 pp
Medical NER (F1)	+0.93 pp (surpasses BioBERT's +0.62 on English)
39-class Classification (F1)	+1.62 pp
5-class Classification (F1)	+0.97 pp

Usage

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="Vionex-digital/vioBERT-v3")
results = fill_mask("المريض يعاني من [MASK] في الصدر")
for r in results:
    print(f"{r['token_str']}: {r['score']:.4f}")

Pre-training Details

Base Model: MARBERTv2 (1B Arabic tweets)
Corpus: Shifaa — 1.12M Arabic medical documents (1.17M tokens)
Strategy: Whole-word masking (respects Arabic agglutinative morphology)
Steps: 22,000 (early-stopped via composite improvement score)
Masking: 15% probability
Optimizer: AdamW, lr=5e-5, weight decay 0.01

Training Data: The Shifaa Corpus

Source	Documents	Avg. Tokens
Health consultations (Q)	393,219	108
Health consultations (A)	777,954	259
Medical encyclopaedia	~5,000	412
Total	1,176,173	189

Evaluation

Evaluated across 5 orthogonal axes with 42 experimental configurations, 3 random seeds each:

Intrinsic: Perplexity + Fill-mask accuracy
Linear Probing: Frozen encoder classification
Full Fine-tuning: 5-class and 39-class medical text classification
NER: Arabic medical named entity recognition
MCQ: Medical question answering (5 difficulty levels)

Citation

@article{zaghloul2026viobert,
  title={From Tweets to Treatment: Domain-Adaptive Pre-Training for Arabic Medical {NLP}},
  author={Zaghloul, Yousef and Khaled, Abdallah},
  year={2026},
  institution={Vionex Digital Solutions}
}

Limitations

Clinical deployment requires bias auditing across dialects and demographics
vioBERT is a research tool, not a diagnostic system
Performance on clinical reasoning tasks (e.g., complex MCQ) remains at statistical parity with the base model

Ethics

The Shifaa corpus is constructed entirely from publicly available medical text; no patient records or protected health information were used.

Developed by Vionex Digital Solutions

Downloads last month: 24

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Vionex-digital/vioBERT-v3

Base model

UBC-NLP/MARBERTv2

Finetuned

(39)

this model