KazEmbed-V5: Kazakh Embedding Model for RAG
🏆 Best BASE-size embedding model for Kazakh language retrieval tasks
Model Description
KazEmbed-V5 is a fine-tuned version of multilingual-e5-base optimized for Kazakh language retrieval and RAG (Retrieval-Augmented Generation) applications.
Key Features
- 🇰🇿 Specialized for Kazakh: Fine-tuned on 61,255 Kazakh text pairs
- 📈 +2.1% MRR improvement over multilingual-e5-base
- ⚡ Efficient: 278M parameters (base-size model)
- 🔍 RAG-optimized: Trained specifically for retrieval tasks
Benchmark Results
| Model | Hits@1 | Hits@5 | MRR | Params |
|---|---|---|---|---|
| KazEmbed-V5 (Ours) | 72% | 96% | 0.835 | 278M |
| multilingual-e5-base | 72% | 96% | 0.818 | 278M |
| multilingual-e5-large | 85% | 99% | 0.909 | 560M |
| paraphrase-mpnet-v2 | 53% | 80% | 0.648 | 278M |
| LaBSE | 48% | 73% | 0.601 | 471M |
Evaluated on KazQAD test set with TF-IDF hard negatives (100 candidates per query)
Usage
Installation
pip install sentence-transformers
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('YOUR_USERNAME/kazembed-v5')
# For queries (questions)
query = "query: Қазақстанның астанасы қай қала?"
query_embedding = model.encode(query)
# For passages (documents)
passage = "passage: Астана — Қазақстан Республикасының астанасы."
passage_embedding = model.encode(passage)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([query_embedding], [passage_embedding])[0][0]
print(f"Similarity: {similarity:.4f}")
For RAG Applications
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('YOUR_USERNAME/kazembed-v5')
# Your document corpus
documents = [
"Астана — Қазақстан Республикасының астанасы.",
"Алматы — Қазақстанның ең үлкен қаласы.",
"Қазақстан — Орталық Азиядағы мемлекет.",
]
# Encode documents (do once, store in vector DB)
doc_embeddings = model.encode(["passage: " + doc for doc in documents])
# Query
query = "Қазақстанның астанасы қай қала?"
query_embedding = model.encode("query: " + query)
# Find most similar
similarities = np.dot(doc_embeddings, query_embedding)
best_idx = np.argmax(similarities)
print(f"Best match: {documents[best_idx]}")
Training Details
Training Data
| Dataset | Pairs | Description |
|---|---|---|
| KazQAD | 6,640 | Question-Context pairs |
| KazQAD-Retrieval | 44,615 | Title-Text pairs |
| Powerful-Kazakh-Dialogue | 10,000 | User-Assistant pairs |
| Total | 61,255 | Retrieval-focused pairs |
Training Configuration
- Base Model: intfloat/multilingual-e5-base
- Epochs: 2
- Batch Size: 16
- Learning Rate: 1e-5
- Loss: MultipleNegativesRankingLoss
- Hardware: NVIDIA GPU
Training Strategy
We found that:
- Retrieval-only data works best (no NLI/STS data)
- 2 epochs is optimal (1 = underfit, 3 = overfit)
- Larger batch size (16) provides more in-batch negatives
Limitations
- Optimized for Kazakh; performance on other languages may vary
- Best for retrieval tasks; may not be optimal for semantic similarity
- Requires
query:andpassage:prefixes for best results
Citation
@misc{kazembed2024,
title={KazEmbed-V5: A Fine-tuned Embedding Model for Kazakh Language Retrieval},
author={Your Name},
year={2024},
howpublished={HuggingFace Hub}
}
Acknowledgments
- Base model: intfloat/multilingual-e5-base
- Training data: ISSAI for KazQAD dataset
- Downloads last month
- 17
Model tree for Nurlykhan/kazembed-v5
Base model
intfloat/multilingual-e5-base