nllb200-formosan-en
Repo: FormosonBankDemos/nllb200-formosan-en
Base model: facebook/nllb-200-distilled-600M
Task: Bidirectional machine translation between 15 Formosan languages and English (eng_Latn).
This model fine-tunes NLLB-200 (distilled 600M) on a curated FormosanBank parallel corpus for Formosan β English. It supports both directions with a single checkpoint.
1. Languages and codes
You must use NLLB language codes for src_lang and for the target language code when setting forced_bos_token_id.
| Language | Code |
|---|---|
| English | eng_Latn |
| Amis | ami_Latn |
| Bunun | bnn_Latn |
| Kavalan | ckv_Latn |
| Rukai | dru_Latn |
| Paiwan | pwn_Latn |
| Puyuma | pyu_Latn |
| Thao | ssf_Latn |
| Saaroa | sxr_Latn |
| Sakizaya | szy_Latn |
| Tao / Yami | tao_Latn |
| Atayal | tay_Latn |
| Seediq | trv_Latn |
| Tsou | tsu_Latn |
| Kanakanavu | xnb_Latn |
| Saisiyat | xsy_Latn |
2. How to use the model
2.1 Formosan β English (pipeline API)
import torch
from transformers import pipeline
model_id = "FormosonBankDemos/nllb200-formosan-en"
translator = pipeline(
task="translation",
model=model_id,
tokenizer=model_id,
src_lang="ami_Latn", # source language code
tgt_lang="eng_Latn", # target language code
device=0 if torch.cuda.is_available() else -1,
)
text = "Pa'araw cingra to demak nira."
print(translator(text)[0]["translation_text"])
# e.g. "He revealed what he was doing."
Change src_lang to any other Formosan code (e.g. bnn_Latn, pwn_Latn) to translate that language β English.
2.2 English β Formosan (manual generate with forced BOS)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_id = "FormosonBankDemos/nllb200-formosan-en"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="eng_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.to("cuda" if torch.cuda.is_available() else "cpu")
sentence = "He revealed what he was doing."
# Encode with English as source
inputs = tokenizer(sentence, return_tensors="pt").to(model.device)
# Choose the Formosan target language code
tgt_code = "ami_Latn" # e.g. Amis
forced_bos = tokenizer.convert_tokens_to_ids(tgt_code)
outputs = model.generate(
**inputs,
forced_bos_token_id=forced_bos,
decoder_start_token_id=forced_bos,
max_new_tokens=48,
num_beams=4,
no_repeat_ngram_size=3,
repetition_penalty=1.2,
length_penalty=1.05,
early_stopping=True,
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
2.3 Helper function for any direction
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_id = "FormosonBankDemos/nllb200-formosan-en"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.to("cuda" if torch.cuda.is_available() else "cpu")
def translate(text: str, src_code: str, tgt_code: str, max_new_tokens: int = 64) -> str:
tokenizer.src_lang = src_code
inputs = tokenizer(text, return_tensors="pt").to(model.device)
forced_bos = tokenizer.convert_tokens_to_ids(tgt_code)
outputs = model.generate(
**inputs,
forced_bos_token_id=forced_bos,
decoder_start_token_id=forced_bos,
max_new_tokens=max_new_tokens,
num_beams=4,
no_repeat_ngram_size=3,
repetition_penalty=1.1,
length_penalty=1.0,
early_stopping=True,
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
# Examples:
print(translate("Pa'araw cingra to demak nira.", "ami_Latn", "eng_Latn")) # Amis β English
print(translate("Thank you all.", "eng_Latn", "xsy_Latn")) # English β Saisiyat
3. Evaluation (new results)
Evaluation is done on a held-out FormosanBank English Parallel Corpus with standard MT metrics.
3.1 Global metrics (all languages combined)
On 7,207 sentence pairs in each direction:
Formosan β English (
*_Latnβeng_Latn)- BLEU: 42.31
- chrF2: 49.51
- TER: 62.29
English β Formosan (
eng_Latnβ*_Latn)- BLEU: 9.53
- chrF2: 32.06
- TER: 86.59
EnglishβFormosan is substantially harder; outputs should be treated as drafts for human post-editing, not final translations.
3.2 Per-language metrics
Formosan β English (*_Latn β eng_Latn)
| Language (code) | Samples | BLEU | chrF2 | TER |
|---|---|---|---|---|
Amis (ami_Latn) |
909 | 45.94 | 52.28 | 57.75 |
Bunun (bnn_Latn) |
856 | 44.89 | 51.71 | 59.58 |
Kavalan (ckv_Latn) |
367 | 26.56 | 39.31 | 75.03 |
Rukai (dru_Latn) |
980 | 37.85 | 45.13 | 67.03 |
Paiwan (pwn_Latn) |
566 | 47.45 | 52.75 | 56.13 |
Puyuma (pyu_Latn) |
509 | 54.81 | 59.73 | 51.96 |
Thao (ssf_Latn) |
138 | 50.26 | 54.98 | 53.44 |
Saaroa (sxr_Latn) |
139 | 50.81 | 54.52 | 50.79 |
Sakizaya (szy_Latn) |
224 | 44.88 | 53.68 | 61.94 |
Tao / Yami (tao_Latn) |
155 | 29.63 | 36.84 | 77.37 |
Atayal (tay_Latn) |
882 | 48.05 | 53.83 | 58.35 |
Seediq (trv_Latn) |
668 | 42.71 | 50.09 | 61.31 |
Tsou (tsu_Latn) |
223 | 22.77 | 32.86 | 86.81 |
Kanakanavu (xnb_Latn) |
329 | 29.54 | 41.84 | 71.46 |
Saisiyat (xsy_Latn) |
262 | 36.63 | 47.33 | 67.06 |
English β Formosan (eng_Latn β *_Latn)
| Language (code) | Samples | BLEU | chrF2 | TER |
|---|---|---|---|---|
Amis (ami_Latn) |
909 | 11.33 | 33.68 | 79.53 |
Bunun (bnn_Latn) |
856 | 5.33 | 31.59 | 91.83 |
Kavalan (ckv_Latn) |
367 | 15.83 | 39.99 | 84.32 |
Rukai (dru_Latn) |
980 | 3.06 | 25.58 | 102.19 |
Paiwan (pwn_Latn) |
566 | 9.49 | 32.89 | 82.45 |
Puyuma (pyu_Latn) |
509 | 12.62 | 35.47 | 80.41 |
Thao (ssf_Latn) |
138 | 20.30 | 44.67 | 71.17 |
Saaroa (sxr_Latn) |
139 | 15.15 | 42.96 | 79.97 |
Sakizaya (szy_Latn) |
224 | 13.77 | 39.07 | 80.72 |
Tao / Yami (tao_Latn) |
155 | 10.64 | 32.57 | 85.80 |
Atayal (tay_Latn) |
882 | 4.26 | 23.54 | 92.42 |
Seediq (trv_Latn) |
668 | 8.60 | 29.24 | 85.70 |
Tsou (tsu_Latn) |
223 | 5.80 | 25.77 | 92.72 |
Kanakanavu (xnb_Latn) |
329 | 12.95 | 40.75 | 78.61 |
Saisiyat (xsy_Latn) |
262 | 17.84 | 39.26 | 78.86 |
4. Intended use & limitations
Intended use
- Research, teaching, and prototyping for Formosan β English MT.
- Assisting linguists and community members with draft translations and exploration of bilingual text.
Limitations
- Outputs can be incorrect, ungrammatical, or culturally inappropriate, especially in the English β Formosan direction.
- Not suitable for legal, medical, or other high-stakes applications.
- Any community-facing or production use should involve human review by fluent speakers.
5. Citation
If you use this model, please cite:
@misc{nllb200-formosan-en,
title = {nllb200-formosan-en: NLLB-200 fine-tuned on 15 Formosan languages and English},
author = {FormosanBank / contributors},
year = {2025},
howpublished = {\url{https://huggingface.co/FormosonBankDemos/nllb200-formosan-en}}
}
- Downloads last month
- 8
Model tree for FormosanBankDemos/nllb200-formosan-en
Base model
facebook/nllb-200-distilled-600MSpace using FormosanBankDemos/nllb200-formosan-en 1
Evaluation results
- BLEU on FormosanBank English Parallel Corpusself-reported42.310
- chrF2 on FormosanBank English Parallel Corpusself-reported49.510
- TER on FormosanBank English Parallel Corpusself-reported62.290
- BLEU on FormosanBank English Parallel Corpusself-reported9.530
- chrF2 on FormosanBank English Parallel Corpusself-reported32.060
- TER on FormosanBank English Parallel Corpusself-reported86.590