XLM-RoBERTa-Base for ESS Variable Classification
Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories.
Model Description
This model is a fine-tuned version of FacebookAI/xlm-roberta-base on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories.
- Base Model: XLM-RoBERTa-Base (125M parameters)
- Task: Multi-class text classification (19 categories)
- Language: English
- Dataset: European Social Survey variables
Performance
Evaluated on test set:
- Accuracy: 0.8381
- Precision (weighted): 0.7858
- Recall (weighted): 0.8381
- F1-Score (weighted): 0.7959
- Test samples: 105
Intended Use
This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for:
- Organizing large survey datasets
- Automating metadata generation
- Subject classification of research questions
- Data cataloging and discovery
Training Data
The model was trained on the benjaminBeuster/ess_classification dataset, which contains survey variables extracted from European Social Survey DDI XML files.
Label Mapping
The model predicts one of 19 subject categories:
| Code | Category |
|---|---|
| 0 | DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES) |
| 1 | ECONOMICS |
| 2 | EDUCATION |
| 3 | HEALTH |
| 4 | HISTORY |
| 5 | HOUSING AND LAND USE |
| 6 | LABOUR AND EMPLOYMENT |
| 7 | LAW, CRIME AND LEGAL SYSTEMS |
| 8 | MEDIA, COMMUNICATION AND LANGUAGE |
| 9 | NATURAL ENVIRONMENT |
| 10 | OTHER |
| 11 | POLITICS |
| 12 | PSYCHOLOGY |
| 13 | SCIENCE AND TECHNOLOGY |
| 14 | SOCIAL STRATIFICATION AND GROUPINGS |
| 15 | SOCIAL WELFARE POLICY AND SYSTEMS |
| 16 | SOCIETY AND CULTURE |
| 17 | TRADE, INDUSTRY AND MARKETS |
| 18 | TRANSPORT AND TRAVEL |
Usage
Basic Classification
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
# Load model and tokenizer
model_name = "benjaminBeuster/xlm-roberta-base-ess-classification"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Classify a survey question
text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out."
result = classifier(text)
print(f"Category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")
Batch Classification
# Classify multiple questions
questions = [
"How often pray apart from at religious services",
"Highest level of education completed",
"Trust in politicians"
]
results = classifier(questions)
for question, result in zip(questions, results):
print(f"{question[:50]}: {result['label']} ({result['score']:.2f})")
Manual Prediction
import torch
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
# Get label name
label = model.config.id2label[predicted_class]
confidence = predictions[0][predicted_class].item()
print(f"Predicted: {label} (confidence: {confidence:.4f})")
Training Procedure
Training Hyperparameters
- Learning rate: 2e-05
- Batch size: 8
- Epochs: 5
- Weight decay: 0.01
- Warmup ratio: 0.1
- Max sequence length: 256
- Optimizer: AdamW
- LR scheduler: Linear with warmup
Training Details
The model was fine-tuned using the Hugging Face Transformers library with the following setup:
- Early stopping with patience of 2 epochs
- Evaluation on validation set after each epoch
- Best model selection based on validation loss
- Mixed precision training (fp16/bf16 where supported)
Limitations and Bias
- The model is trained on a relatively small dataset (50 samples), which may limit generalization
- Performance may vary on survey questions outside the European Social Survey domain
- The model may inherit biases present in the training data
- English-language surveys are the primary focus, though the base model supports 100 languages
Citation
If you use this model, please cite:
@misc{xlm-roberta-ess-classifier,
author = {Benjamin Beuster},
title = {XLM-RoBERTa-Large for ESS Variable Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification}
}
Model Card Authors
Benjamin Beuster
Model Card Contact
For questions or feedback, please open an issue on the model repository.
- Downloads last month
- 9
Model tree for benjaminBeuster/xlm-roberta-base-ess-classification
Base model
FacebookAI/xlm-roberta-base