π·οΈ lfm2.5-1.2b-unesco-tagger-v3
Fine-tuned model for extracting UNESCO Thesaurus keywords from documents.
Model Version: v3 (600 training examples) Performance: F1=0.361, Precision=0.364, Recall=0.368 Training Data: unesco-data-ai/unesco-thesaurus-sft
π Model Description
This model is fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct to automatically tag documents with keywords from the UNESCO Thesaurus.
β¨ Use Cases:
- π Document classification and indexing
- ποΈ Metadata extraction for digital libraries
- π Knowledge organization and discovery
- ποΈ Automated tagging for UNESCO/UNESDOC documents
π Usage
Basic Example
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"LiquidAI/LFM2.5-1.2B-Instruct",
trust_remote_code=True
)
# Prepare input
text = """The UNESCO Recommendation on the Ethics of Artificial Intelligence
addresses ethical issues related to AI systems throughout their life cycle,
including research, design, development, deployment, and use."""
prompt = f"Extract UNESCO Thesaurus keywords from this text:\n\n{text}"
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
# Output: ["Artificial intelligence", "Ethics of science", "Human rights", ...]
π Input Format
Prompt template:
Extract UNESCO Thesaurus keywords from this text:
{document_text}
Output format: JSON array of keywords
["Keyword1", "Keyword2", "Keyword3"]
π Real-World Example
π Document: UNESCO Recommendation on the Ethics of Artificial Intelligence (44 pages, ~103,000 characters)
βοΈ Method: Document chunked into 41 segments, keywords extracted and ranked by frequency
π― Keywords extracted:
{
"keywords": [
"Artificial intelligence",
"Ethics of science",
"Computer applications",
"Human rights",
"Computer science",
"Automation",
"Cognition",
"Access to information",
"Transparency",
"Evaluation"
]
}
π Handling Long Documents
For documents longer than ~3000 characters:
- βοΈ Chunk the document with overlapping segments (e.g., 3000 chars with 500 overlap)
- π Process each chunk separately
- π Aggregate results by keyword frequency
- π Return top N keywords (e.g., top 10)
Example implementation:
from collections import Counter
def tag_long_document(text, model, tokenizer, chunk_size=3000, overlap=500, max_keywords=10):
# Split into chunks
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start = end - overlap
if start >= len(text) - overlap:
break
# Extract keywords from each chunk
all_keywords = []
for chunk in chunks:
keywords = extract_keywords(chunk, model, tokenizer) # Your extraction function
all_keywords.extend(keywords)
# Rank by frequency
counts = Counter(kw.lower() for kw in all_keywords)
top = [kw for kw, _ in counts.most_common(max_keywords)]
return top
β Validation
For best results, validate extracted keywords against the UNESCO Thesaurus vocabulary:
π Training Details
Dataset Statistics
- Total Examples: 600 (480 train / 60 validation / 60 test)
- Data Source: Ollama synthetic generation (lfm2.5-thinking model)
- Quality: 99.7% generation success rate
- Average Text Length: 272 words per example
- Average Keywords: 5.5 per example
- Dataset: unesco-data-ai/unesco-thesaurus-sft
Training Configuration
| Parameter | Value |
|---|---|
| π§ Base model | LiquidAI/LFM2.5-1.2B-Instruct |
| π Method | Supervised Fine-Tuning (SFT) |
| π οΈ Library | TRL 0.27.1 |
| β‘ Hardware | HuggingFace Jobs (a10g-large GPU, 22GB VRAM) |
| β±οΈ Training Time | ~1.5 hours |
| π Epochs | 3 |
| π― Batch Size | 1 (gradient accumulation: 16) |
| π Learning Rate | 2e-5 |
| π Max Sequence Length | 1024 tokens |
π¦ Framework Versions
- TRL: 0.27.1
- Transformers: 5.0.0
- PyTorch: 2.10.0
- Datasets: 4.5.0
- Tokenizers: 0.22.2
π Performance
Evaluated on 10 test examples:
| Metric | Score | Interpretation |
|---|---|---|
| F1 Score | 0.361 | Moderate performance; functional but needs improvement |
| Precision | 0.364 | 36% of predicted keywords are correct |
| Recall | 0.368 | 37% of ground truth keywords are captured |
| Valid Predictions | 10/10 | Model generates valid JSON output 100% of the time |
Strengths
- β 100% valid predictions: Always produces well-formed JSON output
- β Semantic understanding: Captures general domain correctly
- β Reasonable keyword count: Generates 3-8 keywords appropriately
- β No hallucinations: All predicted keywords are valid UNESCO terms
Known Issues
- β οΈ Specificity gap: Often predicts general terms instead of specific ones
- β οΈ Moderate recall: Misses ~63% of ground truth keywords
- β οΈ Some duplicates: Occasionally repeats keywords
π§ Graph-Based Post-Processing
This model can be enhanced with graph-based refinement to remove parent-child redundancy:
# Tag with graph-based post-processing
python scripts/tag_document.py \
--file document.pdf \
--use-graph \
--refine-strategy balanced \
--max-keywords 10
Benefits:
- Removes redundant parent-child terms (e.g., keeps "Child psychology" and removes "Educational psychology")
- Preserves sibling terms (e.g., keeps both "Educational psychology" and "Educational philosophy")
- Improves precision by focusing on specific terms
β οΈ Limitations
- Training data averaged 272 words per example (synthetic data)
- Works best with chunk sizes of 2000-4000 characters
- May generate keywords not in the UNESCO Thesaurus (validation recommended)
- Optimized for English text only
- Production use recommended after additional training (target F1 > 0.50)
π Citation
@misc{unesco-tagger-v3-2026,
title = {UNESCO Thesaurus Keyword Tagger v3},
author = {UNESCO Data & AI},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3}}
}
π Resources
- Model: unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
- Dataset: unesco-data-ai/unesco-thesaurus-sft
- Training Logs: unesco-data-ai/trackio
- Base Model: LiquidAI/LFM2.5-1.2B-Instruct
π License
See LICENSE for details.
- Downloads last month
- 1
Model tree for unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
Base model
LiquidAI/LFM2.5-1.2B-BaseDataset used to train unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
Evaluation results
- F1 Score on UNESCO Thesaurus SFTtest set self-reported0.361
- Precision on UNESCO Thesaurus SFTtest set self-reported0.364
- Recall on UNESCO Thesaurus SFTtest set self-reported0.368