nllb200-formosan-en

Repo: FormosonBankDemos/nllb200-formosan-en
Base model: facebook/nllb-200-distilled-600M
Task: Bidirectional machine translation between 15 Formosan languages and English (eng_Latn).

This model fine-tunes NLLB-200 (distilled 600M) on a curated FormosanBank parallel corpus for Formosan ↔ English. It supports both directions with a single checkpoint.


1. Languages and codes

You must use NLLB language codes for src_lang and for the target language code when setting forced_bos_token_id.

Language Code
English eng_Latn
Amis ami_Latn
Bunun bnn_Latn
Kavalan ckv_Latn
Rukai dru_Latn
Paiwan pwn_Latn
Puyuma pyu_Latn
Thao ssf_Latn
Saaroa sxr_Latn
Sakizaya szy_Latn
Tao / Yami tao_Latn
Atayal tay_Latn
Seediq trv_Latn
Tsou tsu_Latn
Kanakanavu xnb_Latn
Saisiyat xsy_Latn

2. How to use the model

2.1 Formosan β†’ English (pipeline API)

import torch
from transformers import pipeline

model_id = "FormosonBankDemos/nllb200-formosan-en"

translator = pipeline(
    task="translation",
    model=model_id,
    tokenizer=model_id,
    src_lang="ami_Latn",   # source language code
    tgt_lang="eng_Latn",   # target language code
    device=0 if torch.cuda.is_available() else -1,
)

text = "Pa'araw cingra to demak nira."
print(translator(text)[0]["translation_text"])
# e.g. "He revealed what he was doing."

Change src_lang to any other Formosan code (e.g. bnn_Latn, pwn_Latn) to translate that language β†’ English.


2.2 English β†’ Formosan (manual generate with forced BOS)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "FormosonBankDemos/nllb200-formosan-en"

tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="eng_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.to("cuda" if torch.cuda.is_available() else "cpu")

sentence = "He revealed what he was doing."

# Encode with English as source
inputs = tokenizer(sentence, return_tensors="pt").to(model.device)

# Choose the Formosan target language code
tgt_code = "ami_Latn"  # e.g. Amis

forced_bos = tokenizer.convert_tokens_to_ids(tgt_code)

outputs = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos,
    decoder_start_token_id=forced_bos,
    max_new_tokens=48,
    num_beams=4,
    no_repeat_ngram_size=3,
    repetition_penalty=1.2,
    length_penalty=1.05,
    early_stopping=True,
)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

2.3 Helper function for any direction

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "FormosonBankDemos/nllb200-formosan-en"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.to("cuda" if torch.cuda.is_available() else "cpu")

def translate(text: str, src_code: str, tgt_code: str, max_new_tokens: int = 64) -> str:
    tokenizer.src_lang = src_code
    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    forced_bos = tokenizer.convert_tokens_to_ids(tgt_code)
    outputs = model.generate(
        **inputs,
        forced_bos_token_id=forced_bos,
        decoder_start_token_id=forced_bos,
        max_new_tokens=max_new_tokens,
        num_beams=4,
        no_repeat_ngram_size=3,
        repetition_penalty=1.1,
        length_penalty=1.0,
        early_stopping=True,
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

# Examples:
print(translate("Pa'araw cingra to demak nira.", "ami_Latn", "eng_Latn"))  # Amis β†’ English
print(translate("Thank you all.", "eng_Latn", "xsy_Latn"))                 # English β†’ Saisiyat

3. Evaluation (new results)

Evaluation is done on a held-out FormosanBank English Parallel Corpus with standard MT metrics.

3.1 Global metrics (all languages combined)

On 7,207 sentence pairs in each direction:

  • Formosan β†’ English (*_Latn β†’ eng_Latn)

    • BLEU: 42.31
    • chrF2: 49.51
    • TER: 62.29
  • English β†’ Formosan (eng_Latn β†’ *_Latn)

    • BLEU: 9.53
    • chrF2: 32.06
    • TER: 86.59

English→Formosan is substantially harder; outputs should be treated as drafts for human post-editing, not final translations.

3.2 Per-language metrics

Formosan β†’ English (*_Latn β†’ eng_Latn)

Language (code) Samples BLEU chrF2 TER
Amis (ami_Latn) 909 45.94 52.28 57.75
Bunun (bnn_Latn) 856 44.89 51.71 59.58
Kavalan (ckv_Latn) 367 26.56 39.31 75.03
Rukai (dru_Latn) 980 37.85 45.13 67.03
Paiwan (pwn_Latn) 566 47.45 52.75 56.13
Puyuma (pyu_Latn) 509 54.81 59.73 51.96
Thao (ssf_Latn) 138 50.26 54.98 53.44
Saaroa (sxr_Latn) 139 50.81 54.52 50.79
Sakizaya (szy_Latn) 224 44.88 53.68 61.94
Tao / Yami (tao_Latn) 155 29.63 36.84 77.37
Atayal (tay_Latn) 882 48.05 53.83 58.35
Seediq (trv_Latn) 668 42.71 50.09 61.31
Tsou (tsu_Latn) 223 22.77 32.86 86.81
Kanakanavu (xnb_Latn) 329 29.54 41.84 71.46
Saisiyat (xsy_Latn) 262 36.63 47.33 67.06

English β†’ Formosan (eng_Latn β†’ *_Latn)

Language (code) Samples BLEU chrF2 TER
Amis (ami_Latn) 909 11.33 33.68 79.53
Bunun (bnn_Latn) 856 5.33 31.59 91.83
Kavalan (ckv_Latn) 367 15.83 39.99 84.32
Rukai (dru_Latn) 980 3.06 25.58 102.19
Paiwan (pwn_Latn) 566 9.49 32.89 82.45
Puyuma (pyu_Latn) 509 12.62 35.47 80.41
Thao (ssf_Latn) 138 20.30 44.67 71.17
Saaroa (sxr_Latn) 139 15.15 42.96 79.97
Sakizaya (szy_Latn) 224 13.77 39.07 80.72
Tao / Yami (tao_Latn) 155 10.64 32.57 85.80
Atayal (tay_Latn) 882 4.26 23.54 92.42
Seediq (trv_Latn) 668 8.60 29.24 85.70
Tsou (tsu_Latn) 223 5.80 25.77 92.72
Kanakanavu (xnb_Latn) 329 12.95 40.75 78.61
Saisiyat (xsy_Latn) 262 17.84 39.26 78.86

4. Intended use & limitations

Intended use

  • Research, teaching, and prototyping for Formosan ↔ English MT.
  • Assisting linguists and community members with draft translations and exploration of bilingual text.

Limitations

  • Outputs can be incorrect, ungrammatical, or culturally inappropriate, especially in the English β†’ Formosan direction.
  • Not suitable for legal, medical, or other high-stakes applications.
  • Any community-facing or production use should involve human review by fluent speakers.

5. Citation

If you use this model, please cite:

@misc{nllb200-formosan-en,
  title  = {nllb200-formosan-en: NLLB-200 fine-tuned on 15 Formosan languages and English},
  author = {FormosanBank / contributors},
  year   = {2025},
  howpublished = {\url{https://huggingface.co/FormosonBankDemos/nllb200-formosan-en}}
}
Downloads last month
8
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FormosanBankDemos/nllb200-formosan-en

Finetuned
(221)
this model

Space using FormosanBankDemos/nllb200-formosan-en 1

Evaluation results