KazEmbed-V5: Kazakh Embedding Model for RAG

🏆 Best BASE-size embedding model for Kazakh language retrieval tasks

Model Description

KazEmbed-V5 is a fine-tuned version of multilingual-e5-base optimized for Kazakh language retrieval and RAG (Retrieval-Augmented Generation) applications.

Key Features

  • 🇰🇿 Specialized for Kazakh: Fine-tuned on 61,255 Kazakh text pairs
  • 📈 +2.1% MRR improvement over multilingual-e5-base
  • Efficient: 278M parameters (base-size model)
  • 🔍 RAG-optimized: Trained specifically for retrieval tasks

Benchmark Results

Model Hits@1 Hits@5 MRR Params
KazEmbed-V5 (Ours) 72% 96% 0.835 278M
multilingual-e5-base 72% 96% 0.818 278M
multilingual-e5-large 85% 99% 0.909 560M
paraphrase-mpnet-v2 53% 80% 0.648 278M
LaBSE 48% 73% 0.601 471M

Evaluated on KazQAD test set with TF-IDF hard negatives (100 candidates per query)

Usage

Installation

pip install sentence-transformers

Basic Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('YOUR_USERNAME/kazembed-v5')

# For queries (questions)
query = "query: Қазақстанның астанасы қай қала?"
query_embedding = model.encode(query)

# For passages (documents)
passage = "passage: Астана — Қазақстан Республикасының астанасы."
passage_embedding = model.encode(passage)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([query_embedding], [passage_embedding])[0][0]
print(f"Similarity: {similarity:.4f}")

For RAG Applications

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('YOUR_USERNAME/kazembed-v5')

# Your document corpus
documents = [
    "Астана — Қазақстан Республикасының астанасы.",
    "Алматы — Қазақстанның ең үлкен қаласы.",
    "Қазақстан — Орталық Азиядағы мемлекет.",
]

# Encode documents (do once, store in vector DB)
doc_embeddings = model.encode(["passage: " + doc for doc in documents])

# Query
query = "Қазақстанның астанасы қай қала?"
query_embedding = model.encode("query: " + query)

# Find most similar
similarities = np.dot(doc_embeddings, query_embedding)
best_idx = np.argmax(similarities)
print(f"Best match: {documents[best_idx]}")

Training Details

Training Data

Dataset Pairs Description
KazQAD 6,640 Question-Context pairs
KazQAD-Retrieval 44,615 Title-Text pairs
Powerful-Kazakh-Dialogue 10,000 User-Assistant pairs
Total 61,255 Retrieval-focused pairs

Training Configuration

  • Base Model: intfloat/multilingual-e5-base
  • Epochs: 2
  • Batch Size: 16
  • Learning Rate: 1e-5
  • Loss: MultipleNegativesRankingLoss
  • Hardware: NVIDIA GPU

Training Strategy

We found that:

  1. Retrieval-only data works best (no NLI/STS data)
  2. 2 epochs is optimal (1 = underfit, 3 = overfit)
  3. Larger batch size (16) provides more in-batch negatives

Limitations

  • Optimized for Kazakh; performance on other languages may vary
  • Best for retrieval tasks; may not be optimal for semantic similarity
  • Requires query: and passage: prefixes for best results

Citation

@misc{kazembed2024,
  title={KazEmbed-V5: A Fine-tuned Embedding Model for Kazakh Language Retrieval},
  author={Your Name},
  year={2024},
  howpublished={HuggingFace Hub}
}

Acknowledgments

Downloads last month
17
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nurlykhan/kazembed-v5

Finetuned
(104)
this model

Datasets used to train Nurlykhan/kazembed-v5