🏷️ lfm2.5-1.2b-unesco-tagger-v3

Fine-tuned model for extracting UNESCO Thesaurus keywords from documents.

Model Version: v3 (600 training examples) Performance: F1=0.361, Precision=0.364, Recall=0.368 Training Data: unesco-data-ai/unesco-thesaurus-sft

📋 Model Description

This model is fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct to automatically tag documents with keywords from the UNESCO Thesaurus.

✨ Use Cases:

📚 Document classification and indexing
🗂️ Metadata extraction for digital libraries
🔍 Knowledge organization and discovery
🏛️ Automated tagging for UNESCO/UNESDOC documents

🚀 Usage

Basic Example

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "LiquidAI/LFM2.5-1.2B-Instruct",
    trust_remote_code=True
)

# Prepare input
text = """The UNESCO Recommendation on the Ethics of Artificial Intelligence
addresses ethical issues related to AI systems throughout their life cycle,
including research, design, development, deployment, and use."""

prompt = f"Extract UNESCO Thesaurus keywords from this text:\n\n{text}"
messages = [{"role": "user", "content": prompt}]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

print(response)
# Output: ["Artificial intelligence", "Ethics of science", "Human rights", ...]

📝 Input Format

Prompt template:

Extract UNESCO Thesaurus keywords from this text:

{document_text}

Output format: JSON array of keywords

["Keyword1", "Keyword2", "Keyword3"]

🌍 Real-World Example

📄 Document: UNESCO Recommendation on the Ethics of Artificial Intelligence (44 pages, ~103,000 characters)

⚙️ Method: Document chunked into 41 segments, keywords extracted and ranked by frequency

🎯 Keywords extracted:

{
  "keywords": [
    "Artificial intelligence",
    "Ethics of science",
    "Computer applications",
    "Human rights",
    "Computer science",
    "Automation",
    "Cognition",
    "Access to information",
    "Transparency",
    "Evaluation"
  ]
}

📖 Handling Long Documents

For documents longer than ~3000 characters:

✂️ Chunk the document with overlapping segments (e.g., 3000 chars with 500 overlap)
🔄 Process each chunk separately
📊 Aggregate results by keyword frequency
🏆 Return top N keywords (e.g., top 10)

Example implementation:

from collections import Counter

def tag_long_document(text, model, tokenizer, chunk_size=3000, overlap=500, max_keywords=10):
    # Split into chunks
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start = end - overlap
        if start >= len(text) - overlap:
            break

    # Extract keywords from each chunk
    all_keywords = []
    for chunk in chunks:
        keywords = extract_keywords(chunk, model, tokenizer)  # Your extraction function
        all_keywords.extend(keywords)

    # Rank by frequency
    counts = Counter(kw.lower() for kw in all_keywords)
    top = [kw for kw, _ in counts.most_common(max_keywords)]
    return top

✅ Validation

For best results, validate extracted keywords against the UNESCO Thesaurus vocabulary:

🔗 UNESCO Thesaurus SPARQL endpoint
🌐 UNESCO Thesaurus browser

🎓 Training Details

Dataset Statistics

Total Examples: 600 (480 train / 60 validation / 60 test)
Data Source: Ollama synthetic generation (lfm2.5-thinking model)
Quality: 99.7% generation success rate
Average Text Length: 272 words per example
Average Keywords: 5.5 per example
Dataset: unesco-data-ai/unesco-thesaurus-sft

Training Configuration

Parameter	Value
🧠 Base model	LiquidAI/LFM2.5-1.2B-Instruct
📚 Method	Supervised Fine-Tuning (SFT)
🛠️ Library	TRL 0.27.1
⚡ Hardware	HuggingFace Jobs (a10g-large GPU, 22GB VRAM)
⏱️ Training Time	~1.5 hours
📊 Epochs	3
🎯 Batch Size	1 (gradient accumulation: 16)
📈 Learning Rate	2e-5
📏 Max Sequence Length	1024 tokens

📦 Framework Versions

TRL: 0.27.1
Transformers: 5.0.0
PyTorch: 2.10.0
Datasets: 4.5.0
Tokenizers: 0.22.2

📊 Performance

Evaluated on 10 test examples:

Metric	Score	Interpretation
F1 Score	0.361	Moderate performance; functional but needs improvement
Precision	0.364	36% of predicted keywords are correct
Recall	0.368	37% of ground truth keywords are captured
Valid Predictions	10/10	Model generates valid JSON output 100% of the time

Strengths

✅ 100% valid predictions: Always produces well-formed JSON output
✅ Semantic understanding: Captures general domain correctly
✅ Reasonable keyword count: Generates 3-8 keywords appropriately
✅ No hallucinations: All predicted keywords are valid UNESCO terms

Known Issues

⚠️ Specificity gap: Often predicts general terms instead of specific ones
⚠️ Moderate recall: Misses ~63% of ground truth keywords
⚠️ Some duplicates: Occasionally repeats keywords

🔧 Graph-Based Post-Processing

This model can be enhanced with graph-based refinement to remove parent-child redundancy:

# Tag with graph-based post-processing
python scripts/tag_document.py \
  --file document.pdf \
  --use-graph \
  --refine-strategy balanced \
  --max-keywords 10

Benefits:

Removes redundant parent-child terms (e.g., keeps "Child psychology" and removes "Educational psychology")
Preserves sibling terms (e.g., keeps both "Educational psychology" and "Educational philosophy")
Improves precision by focusing on specific terms

⚠️ Limitations

Training data averaged 272 words per example (synthetic data)
Works best with chunk sizes of 2000-4000 characters
May generate keywords not in the UNESCO Thesaurus (validation recommended)
Optimized for English text only
Production use recommended after additional training (target F1 > 0.50)

📖 Citation

@misc{unesco-tagger-v3-2026,
    title = {UNESCO Thesaurus Keyword Tagger v3},
    author = {UNESCO Data & AI},
    year = {2026},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3}}
}

🔗 Resources

Model: unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
Dataset: unesco-data-ai/unesco-thesaurus-sft
Training Logs: unesco-data-ai/trackio
Base Model: LiquidAI/LFM2.5-1.2B-Instruct

📜 License

See LICENSE for details.

Downloads last month: 1

Safetensors

Model size

1B params

Tensor type

F32

Model tree for unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3

Base model

LiquidAI/LFM2.5-1.2B-Base

Finetuned

LiquidAI/LFM2.5-1.2B-Instruct

Finetuned

(66)

this model

Dataset used to train unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3

Evaluation results

F1 Score on UNESCO Thesaurus SFT
test set self-reported

0.361
Precision on UNESCO Thesaurus SFT
test set self-reported

0.364
Recall on UNESCO Thesaurus SFT
test set self-reported

0.368