--- language: - lb license: cc-by-sa-4.0 library_name: transformers pipeline_tag: fill-mask tags: - modernbert - encoder - luxembourgish - multilingual - masked-language-modeling --- # LTZ E1 (mini) A ModernBERT-based masked language model pretrained on Luxembourgish, following the Ettin recipe (see here: https://huggingface.co/jhu-clsp/ettin-encoder-68m) ## Model Details - **Architecture:** ModernBERT (encoder) - **Size:** mini - **Vocabulary:** 50,368 tokens (BPE, GPTNeoXTokenizerFast) - **Context length:** 1,024 tokens - **Language:** Luxembourgish (`lb`/`ltz`) - **License:** CC BY-SA 4.0 ## Usage Requires `transformers>=4.48.0`. ```python from transformers import AutoModelForMaskedLM, AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained("instilux/ltz-e1-mini") model = AutoModelForMaskedLM.from_pretrained("instilux/ltz-e1-mini") inputs = tokenizer("Wéi spéit [MASK] et?", return_tensors="pt") mask_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] with torch.no_grad(): outputs = model(**inputs) top_tokens = outputs.logits[0, mask_pos].topk(5) for token_id, score in zip(top_tokens.indices[0], top_tokens.values[0]): token = tokenizer.decode(token_id) print(f"{token:15s} {score:.3f}") ``` ## Tokenizer Notes The tokenizer is BPE-based (`GPTNeoXTokenizerFast`) with BERT-style special tokens (`[CLS]`, `[SEP]`, `[MASK]`, `[PAD]`). A `[CLS]` token is prepended automatically (`add_bos_token: true`). ## Citation Please cite this paper (preprint, accepted to ACL 2026 Findings) if you use this model in your work. @misc{plum2026ltzglueluxembourgishgenerallanguage, title={ltzGLUE: Luxembourgish General Language Understanding Evaluation}, author={Alistair Plum and Felicia Körner and Anne-Marie Lutgen and Laura Bernardy and Fred Philippy and Emilia Milano and Nils Rehlinger and Cédric Lothritz and Tharindu Ranasinghe and Barbara Plank and Christoph Purschke}, year={2026}, eprint={2604.17976}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.17976}, }