PhoBERT Tourism Sentiment Classifier

Fine-tuned PhoBERT model for Vietnamese tourism comment sentiment analysis.

Model Description

This model classifies Vietnamese tourism comments into 3 sentiment categories:

positive (Tích cực)
neutral (Trung lập)
negative (Tiêu cực)

Training Data

Language: Vietnamese only
Dataset: 11,371 Vietnamese tourism comments from social media
Sources: Google Maps, TikTok, YouTube, Facebook
Split: 80% train (9,096), 20% validation (2,275)
Quality: Filtered for meaningful content (min 10 words)

Sentiment Distribution

Sentiment	Count	Percentage
Positive	7,669	67.4%
Neutral	1,894	16.7%
Negative	1,808	15.9%

Performance

Metric	Score
Accuracy	87.25%
F1 Macro	79.59%
F1 Weighted	87.26%

Per-Class Performance

Sentiment	Precision	Recall	F1-Score	Support
Positive	94.36%	94.85%	94.60%	1,534
Neutral	64.24%	79.16%	70.92%	379
Negative	86.47%	63.54%	73.25%	362

Confusion Matrix

              Predicted
              pos  neu  neg
Actual  pos  1455  63   16
        neu    59  300  20
        neg    28  104  230

Usage

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

# Define model architecture
class PhoBERTSentimentClassifier(nn.Module):
    def __init__(self, n_classes=3, dropout=0.3):
        super(PhoBERTSentimentClassifier, self).__init__()
        self.phobert = AutoModel.from_pretrained('vinai/phobert-base')
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.phobert.config.hidden_size, n_classes)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.phobert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained('vinai/phobert-base')

model = PhoBERTSentimentClassifier(n_classes=3)
checkpoint = torch.load('phobert_sentiment_best_model.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()

# Predict
text = "Bãi biển đẹp quá, tôi rất thích!"
encoding = tokenizer(
    text,
    add_special_tokens=True,
    max_length=256,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt'
)

input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)

with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    probs = torch.softmax(outputs, dim=1)
    confidence, predicted = torch.max(probs, dim=1)

sentiments = ['positive', 'neutral', 'negative']
print(f"Sentiment: {sentiments[predicted.item()]} ({confidence.item():.4f})")
# Output: Sentiment: positive (0.9965)

Training Details

Base Model: vinai/phobert-base
Architecture: PhoBERT + Dropout (0.3) + Linear (768 → 3)
Parameters: ~135M (base) + 2,307 (classifier)
Training Time: ~20-25 minutes on CUDA
Epochs: 5 (best at epoch 3)
Batch Size: 16
Learning Rate: 2e-5
Optimizer: AdamW with linear warmup
Loss: CrossEntropyLoss

Training Progress

Epoch	Train Loss	Train Acc	Val Loss	Val Acc	F1 Macro
1	0.6369	73.76%	0.4033	84.44%	75.33%
2	0.3697	86.77%	0.3681	86.59%	78.88%
3	0.2565	91.60%	0.3863	87.25%	79.59% ⭐
4	0.1857	94.49%	0.5379	87.12%	79.47%
5	0.1369	96.19%	0.5483	86.73%	79.00%

Features

✅ High Accuracy: 87.25% overall accuracy ✅ Excellent for Positive: 94.6% F1 for positive sentiment ✅ Balanced Performance: Good handling of all sentiment classes ✅ Fast Inference: ~50ms per comment on GPU ✅ Production Ready: Used in real-world tourism monitoring system

Limitations

Vietnamese text only (not multilingual)
Trained on tourism domain (may not generalize to other domains)
Slightly lower performance on neutral/negative classes due to data imbalance
Requires GPU for optimal inference speed

Use Cases

Tourism review sentiment analysis
Social media monitoring for tourism destinations
Customer feedback analysis for hotels/attractions
Tourism demand analysis
Automated content moderation

Citation

@misc{phobert-tourism-sentiment,
  author = {Strawberry0604},
  title = {PhoBERT Tourism Sentiment Classifier},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Strawberry0604/phobert-tourism-sentiment}}
}

Contact

Repository: tourism-data-monitor
HuggingFace: @Strawberry0604

Model Card Authors

Strawberry0604

Model Card Contact

For questions and feedback, please open an issue in the GitHub repository.

Downloads last month: 14

Evaluation results

F1 Macro
self-reported

0.796
Accuracy
self-reported

0.873