A newer version of this model is available: Pavarissy/phayathaibert-thainer

OpenThai-NER

Fine-tuned Pavarissy/phayathaibert-thainer for Thai Named Entity Recognition (NER) on the JonusNattapong/OpenThai-NER dataset.

Model Details

  • Base model: Pavarissy/phayathaibert-thainer
  • Task: Token Classification (NER)
  • Language: Thai (th)
  • License: cc-by-3.0

Intended Use

This model is intended for:

  • Extracting named entities from Thai text (people, organizations, locations, etc. depending on your label schema)
  • Downstream tasks like document indexing, entity-based search, analytics, and data labeling acceleration

Limitations / Known Issues (Transparent)

  • eval_loss is NaN:
    Common causes:

    • fp16 mixed precision numerical instability
    • incorrect label mask handling (-100) in loss
    • invalid label ids in a batch (out of range)
    • occasional overflow if logits become extreme
      If you want to fix it: try disabling fp16, validate labels, and ensure your data collator pads labels with -100.
  • Domain shift: Performance may drop on domains/styles not present in training data (e.g., very informal slang, OCR noise, code-mixed text).

  • Entity boundary ambiguity: Thai tokenization and spacing can cause boundary errors (span too long/short), especially with uncommon names.

Evaluation

Final evaluation (epoch 3)

  • Precision: 0.8565
  • Recall: 0.8778
  • F1: 0.8670
  • Accuracy: 0.9565
  • Runtime: 29.8202s

Training summary

Epoch Training Loss Validation Loss Precision Recall F1 Accuracy
1 0.369000 NaN 0.787043 0.824356 0.805268 0.932532
2 0.237600 NaN 0.841745 0.855728 0.848679 0.949934
3 0.195900 NaN 0.856493 0.877835 0.867033 0.956475

Validation loss being NaN across epochs reinforces that this is likely a logging/loss-computation issue rather than random corruption, because the metric trend is consistent and improving.

How to Use

Transformers Pipeline

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

repo_id = "JonusNattapong/OpenThai-NER"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย และบริษัท ABC จำกัด ตั้งอยู่ที่บางนา"
print(ner(text))

Manual inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "JonusNattapong/OpenThai-NER"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

text = "นายสมชายทำงานที่กระทรวงการคลัง"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

pred_ids = outputs.logits.argmax(-1)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

print(list(zip(tokens, pred_ids)))

Training Details

  • Epochs: 3
  • Learning Rate: 2e-5
  • Batch Size: 16
  • Framework: Hugging Face Transformers Trainer

Reproducibility Notes

If you want fully stable loss:

  • Set fp16=False (or bf16=True if available)
  • Validate label ids are within [0, num_labels-1] or -100
  • Ensure the data collator pads labels with -100
  • Try gradient clipping (e.g., max_grad_norm=1.0)

Citation

If you use this corpus in your research, please cite:

@dataset{Thainer Model 2025,
  title={Thai Named Entity Recognition},
  author{Nattapong Tapachoom}
  year={2025},
  howpublished={https://github.com/JonusNattapong/Natural-Language-Processing}
}

Repo: JonusNattapong/OpenThai-NER Base model: Pavarissy/phayathaibert-thainer

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JonusNattapong/OpenThai-NER

Finetuned
(1)
this model

Dataset used to train JonusNattapong/OpenThai-NER

Evaluation results