🏷️ lfm2.5-1.2b-unesco-tagger-v3

Fine-tuned model for extracting UNESCO Thesaurus keywords from documents.

Model Version: v3 (600 training examples) Performance: F1=0.361, Precision=0.364, Recall=0.368 Training Data: unesco-data-ai/unesco-thesaurus-sft

πŸ“‹ Model Description

This model is fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct to automatically tag documents with keywords from the UNESCO Thesaurus.

✨ Use Cases:

  • πŸ“š Document classification and indexing
  • πŸ—‚οΈ Metadata extraction for digital libraries
  • πŸ” Knowledge organization and discovery
  • πŸ›οΈ Automated tagging for UNESCO/UNESDOC documents

πŸš€ Usage

Basic Example

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "LiquidAI/LFM2.5-1.2B-Instruct",
    trust_remote_code=True
)

# Prepare input
text = """The UNESCO Recommendation on the Ethics of Artificial Intelligence
addresses ethical issues related to AI systems throughout their life cycle,
including research, design, development, deployment, and use."""

prompt = f"Extract UNESCO Thesaurus keywords from this text:\n\n{text}"
messages = [{"role": "user", "content": prompt}]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

print(response)
# Output: ["Artificial intelligence", "Ethics of science", "Human rights", ...]

πŸ“ Input Format

Prompt template:

Extract UNESCO Thesaurus keywords from this text:

{document_text}

Output format: JSON array of keywords

["Keyword1", "Keyword2", "Keyword3"]

🌍 Real-World Example

πŸ“„ Document: UNESCO Recommendation on the Ethics of Artificial Intelligence (44 pages, ~103,000 characters)

βš™οΈ Method: Document chunked into 41 segments, keywords extracted and ranked by frequency

🎯 Keywords extracted:

{
  "keywords": [
    "Artificial intelligence",
    "Ethics of science",
    "Computer applications",
    "Human rights",
    "Computer science",
    "Automation",
    "Cognition",
    "Access to information",
    "Transparency",
    "Evaluation"
  ]
}

πŸ“– Handling Long Documents

For documents longer than ~3000 characters:

  1. βœ‚οΈ Chunk the document with overlapping segments (e.g., 3000 chars with 500 overlap)
  2. πŸ”„ Process each chunk separately
  3. πŸ“Š Aggregate results by keyword frequency
  4. πŸ† Return top N keywords (e.g., top 10)

Example implementation:

from collections import Counter

def tag_long_document(text, model, tokenizer, chunk_size=3000, overlap=500, max_keywords=10):
    # Split into chunks
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start = end - overlap
        if start >= len(text) - overlap:
            break

    # Extract keywords from each chunk
    all_keywords = []
    for chunk in chunks:
        keywords = extract_keywords(chunk, model, tokenizer)  # Your extraction function
        all_keywords.extend(keywords)

    # Rank by frequency
    counts = Counter(kw.lower() for kw in all_keywords)
    top = [kw for kw, _ in counts.most_common(max_keywords)]
    return top

βœ… Validation

For best results, validate extracted keywords against the UNESCO Thesaurus vocabulary:

πŸŽ“ Training Details

Dataset Statistics

  • Total Examples: 600 (480 train / 60 validation / 60 test)
  • Data Source: Ollama synthetic generation (lfm2.5-thinking model)
  • Quality: 99.7% generation success rate
  • Average Text Length: 272 words per example
  • Average Keywords: 5.5 per example
  • Dataset: unesco-data-ai/unesco-thesaurus-sft

Training Configuration

Parameter Value
🧠 Base model LiquidAI/LFM2.5-1.2B-Instruct
πŸ“š Method Supervised Fine-Tuning (SFT)
πŸ› οΈ Library TRL 0.27.1
⚑ Hardware HuggingFace Jobs (a10g-large GPU, 22GB VRAM)
⏱️ Training Time ~1.5 hours
πŸ“Š Epochs 3
🎯 Batch Size 1 (gradient accumulation: 16)
πŸ“ˆ Learning Rate 2e-5
πŸ“ Max Sequence Length 1024 tokens

πŸ“¦ Framework Versions

  • TRL: 0.27.1
  • Transformers: 5.0.0
  • PyTorch: 2.10.0
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

πŸ“Š Performance

Evaluated on 10 test examples:

Metric Score Interpretation
F1 Score 0.361 Moderate performance; functional but needs improvement
Precision 0.364 36% of predicted keywords are correct
Recall 0.368 37% of ground truth keywords are captured
Valid Predictions 10/10 Model generates valid JSON output 100% of the time

Strengths

  • βœ… 100% valid predictions: Always produces well-formed JSON output
  • βœ… Semantic understanding: Captures general domain correctly
  • βœ… Reasonable keyword count: Generates 3-8 keywords appropriately
  • βœ… No hallucinations: All predicted keywords are valid UNESCO terms

Known Issues

  • ⚠️ Specificity gap: Often predicts general terms instead of specific ones
  • ⚠️ Moderate recall: Misses ~63% of ground truth keywords
  • ⚠️ Some duplicates: Occasionally repeats keywords

πŸ”§ Graph-Based Post-Processing

This model can be enhanced with graph-based refinement to remove parent-child redundancy:

# Tag with graph-based post-processing
python scripts/tag_document.py \
  --file document.pdf \
  --use-graph \
  --refine-strategy balanced \
  --max-keywords 10

Benefits:

  • Removes redundant parent-child terms (e.g., keeps "Child psychology" and removes "Educational psychology")
  • Preserves sibling terms (e.g., keeps both "Educational psychology" and "Educational philosophy")
  • Improves precision by focusing on specific terms

⚠️ Limitations

  • Training data averaged 272 words per example (synthetic data)
  • Works best with chunk sizes of 2000-4000 characters
  • May generate keywords not in the UNESCO Thesaurus (validation recommended)
  • Optimized for English text only
  • Production use recommended after additional training (target F1 > 0.50)

πŸ“– Citation

@misc{unesco-tagger-v3-2026,
    title = {UNESCO Thesaurus Keyword Tagger v3},
    author = {UNESCO Data & AI},
    year = {2026},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3}}
}

πŸ”— Resources

πŸ“œ License

See LICENSE for details.

Downloads last month
1
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3

Finetuned
(66)
this model

Dataset used to train unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3

Evaluation results