PhoBERT Tourism Sentiment Classifier
Fine-tuned PhoBERT model for Vietnamese tourism comment sentiment analysis.
Model Description
This model classifies Vietnamese tourism comments into 3 sentiment categories:
- positive (Tích cực)
- neutral (Trung lập)
- negative (Tiêu cực)
Training Data
- Language: Vietnamese only
- Dataset: 11,371 Vietnamese tourism comments from social media
- Sources: Google Maps, TikTok, YouTube, Facebook
- Split: 80% train (9,096), 20% validation (2,275)
- Quality: Filtered for meaningful content (min 10 words)
Sentiment Distribution
| Sentiment | Count | Percentage |
|---|---|---|
| Positive | 7,669 | 67.4% |
| Neutral | 1,894 | 16.7% |
| Negative | 1,808 | 15.9% |
Performance
| Metric | Score |
|---|---|
| Accuracy | 87.25% |
| F1 Macro | 79.59% |
| F1 Weighted | 87.26% |
Per-Class Performance
| Sentiment | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Positive | 94.36% | 94.85% | 94.60% | 1,534 |
| Neutral | 64.24% | 79.16% | 70.92% | 379 |
| Negative | 86.47% | 63.54% | 73.25% | 362 |
Confusion Matrix
Predicted
pos neu neg
Actual pos 1455 63 16
neu 59 300 20
neg 28 104 230
Usage
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
# Define model architecture
class PhoBERTSentimentClassifier(nn.Module):
def __init__(self, n_classes=3, dropout=0.3):
super(PhoBERTSentimentClassifier, self).__init__()
self.phobert = AutoModel.from_pretrained('vinai/phobert-base')
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(self.phobert.config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask):
outputs = self.phobert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return logits
# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained('vinai/phobert-base')
model = PhoBERTSentimentClassifier(n_classes=3)
checkpoint = torch.load('phobert_sentiment_best_model.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()
# Predict
text = "Bãi biển đẹp quá, tôi rất thích!"
encoding = tokenizer(
text,
add_special_tokens=True,
max_length=256,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt'
)
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
probs = torch.softmax(outputs, dim=1)
confidence, predicted = torch.max(probs, dim=1)
sentiments = ['positive', 'neutral', 'negative']
print(f"Sentiment: {sentiments[predicted.item()]} ({confidence.item():.4f})")
# Output: Sentiment: positive (0.9965)
Training Details
- Base Model: vinai/phobert-base
- Architecture: PhoBERT + Dropout (0.3) + Linear (768 → 3)
- Parameters: ~135M (base) + 2,307 (classifier)
- Training Time: ~20-25 minutes on CUDA
- Epochs: 5 (best at epoch 3)
- Batch Size: 16
- Learning Rate: 2e-5
- Optimizer: AdamW with linear warmup
- Loss: CrossEntropyLoss
Training Progress
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | F1 Macro |
|---|---|---|---|---|---|
| 1 | 0.6369 | 73.76% | 0.4033 | 84.44% | 75.33% |
| 2 | 0.3697 | 86.77% | 0.3681 | 86.59% | 78.88% |
| 3 | 0.2565 | 91.60% | 0.3863 | 87.25% | 79.59% ⭐ |
| 4 | 0.1857 | 94.49% | 0.5379 | 87.12% | 79.47% |
| 5 | 0.1369 | 96.19% | 0.5483 | 86.73% | 79.00% |
Features
✅ High Accuracy: 87.25% overall accuracy ✅ Excellent for Positive: 94.6% F1 for positive sentiment ✅ Balanced Performance: Good handling of all sentiment classes ✅ Fast Inference: ~50ms per comment on GPU ✅ Production Ready: Used in real-world tourism monitoring system
Limitations
- Vietnamese text only (not multilingual)
- Trained on tourism domain (may not generalize to other domains)
- Slightly lower performance on neutral/negative classes due to data imbalance
- Requires GPU for optimal inference speed
Use Cases
- Tourism review sentiment analysis
- Social media monitoring for tourism destinations
- Customer feedback analysis for hotels/attractions
- Tourism demand analysis
- Automated content moderation
Citation
@misc{phobert-tourism-sentiment,
author = {Strawberry0604},
title = {PhoBERT Tourism Sentiment Classifier},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Strawberry0604/phobert-tourism-sentiment}}
}
Contact
- Repository: tourism-data-monitor
- HuggingFace: @Strawberry0604
Model Card Authors
Strawberry0604
Model Card Contact
For questions and feedback, please open an issue in the GitHub repository.
- Downloads last month
- 14
Evaluation results
- F1 Macroself-reported0.796
- Accuracyself-reported0.873