PhoBERT Tourism Sentiment Classifier

Fine-tuned PhoBERT model for Vietnamese tourism comment sentiment analysis.

Model Description

This model classifies Vietnamese tourism comments into 3 sentiment categories:

  • positive (Tích cực)
  • neutral (Trung lập)
  • negative (Tiêu cực)

Training Data

  • Language: Vietnamese only
  • Dataset: 11,371 Vietnamese tourism comments from social media
  • Sources: Google Maps, TikTok, YouTube, Facebook
  • Split: 80% train (9,096), 20% validation (2,275)
  • Quality: Filtered for meaningful content (min 10 words)

Sentiment Distribution

Sentiment Count Percentage
Positive 7,669 67.4%
Neutral 1,894 16.7%
Negative 1,808 15.9%

Performance

Metric Score
Accuracy 87.25%
F1 Macro 79.59%
F1 Weighted 87.26%

Per-Class Performance

Sentiment Precision Recall F1-Score Support
Positive 94.36% 94.85% 94.60% 1,534
Neutral 64.24% 79.16% 70.92% 379
Negative 86.47% 63.54% 73.25% 362

Confusion Matrix

              Predicted
              pos  neu  neg
Actual  pos  1455  63   16
        neu    59  300  20
        neg    28  104  230

Usage

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

# Define model architecture
class PhoBERTSentimentClassifier(nn.Module):
    def __init__(self, n_classes=3, dropout=0.3):
        super(PhoBERTSentimentClassifier, self).__init__()
        self.phobert = AutoModel.from_pretrained('vinai/phobert-base')
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.phobert.config.hidden_size, n_classes)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.phobert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained('vinai/phobert-base')

model = PhoBERTSentimentClassifier(n_classes=3)
checkpoint = torch.load('phobert_sentiment_best_model.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to(device)
model.eval()

# Predict
text = "Bãi biển đẹp quá, tôi rất thích!"
encoding = tokenizer(
    text,
    add_special_tokens=True,
    max_length=256,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_tensors='pt'
)

input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)

with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    probs = torch.softmax(outputs, dim=1)
    confidence, predicted = torch.max(probs, dim=1)

sentiments = ['positive', 'neutral', 'negative']
print(f"Sentiment: {sentiments[predicted.item()]} ({confidence.item():.4f})")
# Output: Sentiment: positive (0.9965)

Training Details

  • Base Model: vinai/phobert-base
  • Architecture: PhoBERT + Dropout (0.3) + Linear (768 → 3)
  • Parameters: ~135M (base) + 2,307 (classifier)
  • Training Time: ~20-25 minutes on CUDA
  • Epochs: 5 (best at epoch 3)
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Optimizer: AdamW with linear warmup
  • Loss: CrossEntropyLoss

Training Progress

Epoch Train Loss Train Acc Val Loss Val Acc F1 Macro
1 0.6369 73.76% 0.4033 84.44% 75.33%
2 0.3697 86.77% 0.3681 86.59% 78.88%
3 0.2565 91.60% 0.3863 87.25% 79.59%
4 0.1857 94.49% 0.5379 87.12% 79.47%
5 0.1369 96.19% 0.5483 86.73% 79.00%

Features

High Accuracy: 87.25% overall accuracy ✅ Excellent for Positive: 94.6% F1 for positive sentiment ✅ Balanced Performance: Good handling of all sentiment classes ✅ Fast Inference: ~50ms per comment on GPU ✅ Production Ready: Used in real-world tourism monitoring system

Limitations

  • Vietnamese text only (not multilingual)
  • Trained on tourism domain (may not generalize to other domains)
  • Slightly lower performance on neutral/negative classes due to data imbalance
  • Requires GPU for optimal inference speed

Use Cases

  • Tourism review sentiment analysis
  • Social media monitoring for tourism destinations
  • Customer feedback analysis for hotels/attractions
  • Tourism demand analysis
  • Automated content moderation

Citation

@misc{phobert-tourism-sentiment,
  author = {Strawberry0604},
  title = {PhoBERT Tourism Sentiment Classifier},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Strawberry0604/phobert-tourism-sentiment}}
}

Contact

Model Card Authors

Strawberry0604

Model Card Contact

For questions and feedback, please open an issue in the GitHub repository.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results