XLM-RoBERTa-Base for ESS Variable Classification

Fine-tuned XLM-RoBERTa-Base model for classifying European Social Survey variables into 19 subject categories.

Model Description

This model is a fine-tuned version of FacebookAI/xlm-roberta-base on the ESS variable classification dataset. It classifies survey questions and variables into predefined subject categories.

  • Base Model: XLM-RoBERTa-Base (125M parameters)
  • Task: Multi-class text classification (19 categories)
  • Language: English
  • Dataset: European Social Survey variables

Performance

Evaluated on test set:

  • Accuracy: 0.8381
  • Precision (weighted): 0.7858
  • Recall (weighted): 0.8381
  • F1-Score (weighted): 0.7959
  • Test samples: 105

Intended Use

This model is designed to automatically classify survey variables and questions from social science research into subject categories. It can be used for:

  • Organizing large survey datasets
  • Automating metadata generation
  • Subject classification of research questions
  • Data cataloging and discovery

Training Data

The model was trained on the benjaminBeuster/ess_classification dataset, which contains survey variables extracted from European Social Survey DDI XML files.

Label Mapping

The model predicts one of 19 subject categories:

Code Category
0 DEMOGRAPHY (POPULATION, VITAL STATISTICS, AND CENSUSES)
1 ECONOMICS
2 EDUCATION
3 HEALTH
4 HISTORY
5 HOUSING AND LAND USE
6 LABOUR AND EMPLOYMENT
7 LAW, CRIME AND LEGAL SYSTEMS
8 MEDIA, COMMUNICATION AND LANGUAGE
9 NATURAL ENVIRONMENT
10 OTHER
11 POLITICS
12 PSYCHOLOGY
13 SCIENCE AND TECHNOLOGY
14 SOCIAL STRATIFICATION AND GROUPINGS
15 SOCIAL WELFARE POLICY AND SYSTEMS
16 SOCIETY AND CULTURE
17 TRADE, INDUSTRY AND MARKETS
18 TRANSPORT AND TRAVEL

Usage

Basic Classification

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Load model and tokenizer
model_name = "benjaminBeuster/xlm-roberta-base-ess-classification"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Classify a survey question
text = "Trust in country's parliament. Using this card, please tell me on a score of 0-10 how much you personally trust each of the institutions I read out."
result = classifier(text)

print(f"Category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")

Batch Classification

# Classify multiple questions
questions = [
    "How often pray apart from at religious services",
    "Highest level of education completed",
    "Trust in politicians"
]

results = classifier(questions)
for question, result in zip(questions, results):
    print(f"{question[:50]}: {result['label']} ({result['score']:.2f})")

Manual Prediction

import torch

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

# Get label name
label = model.config.id2label[predicted_class]
confidence = predictions[0][predicted_class].item()

print(f"Predicted: {label} (confidence: {confidence:.4f})")

Training Procedure

Training Hyperparameters

  • Learning rate: 2e-05
  • Batch size: 8
  • Epochs: 5
  • Weight decay: 0.01
  • Warmup ratio: 0.1
  • Max sequence length: 256
  • Optimizer: AdamW
  • LR scheduler: Linear with warmup

Training Details

The model was fine-tuned using the Hugging Face Transformers library with the following setup:

  • Early stopping with patience of 2 epochs
  • Evaluation on validation set after each epoch
  • Best model selection based on validation loss
  • Mixed precision training (fp16/bf16 where supported)

Limitations and Bias

  • The model is trained on a relatively small dataset (50 samples), which may limit generalization
  • Performance may vary on survey questions outside the European Social Survey domain
  • The model may inherit biases present in the training data
  • English-language surveys are the primary focus, though the base model supports 100 languages

Citation

If you use this model, please cite:

@misc{xlm-roberta-ess-classifier,
  author = {Benjamin Beuster},
  title = {XLM-RoBERTa-Large for ESS Variable Classification},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/benjaminBeuster/xlm-roberta-base-ess-classification}
}

Model Card Authors

Benjamin Beuster

Model Card Contact

For questions or feedback, please open an issue on the model repository.

Downloads last month
9
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for benjaminBeuster/xlm-roberta-base-ess-classification

Finetuned
(3698)
this model

Dataset used to train benjaminBeuster/xlm-roberta-base-ess-classification

Space using benjaminBeuster/xlm-roberta-base-ess-classification 1