Instructions to use MWirelabs/ne-embed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use MWirelabs/ne-embed with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("MWirelabs/ne-embed") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Purpose-built Multilingual Embeddings for Northeast Indian Languages
10 languages • 201k parallel pairs • 768 dimensions • Built on LaBSE
Semantic search, Retrieval and RAG for low-resource Northeast Indian languages.
Highlights
- Supports 10 Northeast Indian languages
- Optimized for Semantic Search, Retrieval and RAG
- Up to 7× higher retrieval accuracy than raw LaBSE on low-resource languages
- Built on sentence-transformers/LaBSE
- Trained on 201,738 balanced English ↔ Northeast language parallel pairs
- Released under the CC-BY-4.0 license
What is NE-Embed?
NE-Embed is a multilingual sentence embedding model designed for semantic understanding across Northeast Indian languages. It is optimized for semantic search, dense retrieval, Retrieval-Augmented Generation (RAG), and cross-lingual information retrieval, where general-purpose multilingual embedding models often perform poorly.
The model is fine-tuned from LaBSE using 201,738 balanced English↔Northeast language parallel pairs spanning 10 languages. It substantially improves retrieval quality for several low-resource languages—including Garo, Khasi, Nyishi, Pnar, and Kokborok—while maintaining strong multilingual alignment.
Why NE-Embed?
General-purpose multilingual embedding models are trained on hundreds of languages, but many Northeast Indian languages receive little or no representation during training. As a result, semantically similar sentences are often mapped far apart, leading to poor retrieval performance.
NE-Embed addresses this gap through targeted contrastive fine-tuning on balanced parallel data, producing embeddings that better capture semantic similarity for low-resource Northeast Indian languages while preserving multilingual compatibility.
Supported Languages
| Code | Language | Script | Tier | Training Pairs |
|---|---|---|---|---|
asm |
Assamese | Bengali | ✅ Supported | 25,000 |
brx |
Bodo | Devanagari | ✅ Supported | 25,000 |
grt |
Garo | Latin | ✅ Supported | 25,000 |
kha |
Khasi | Latin | ✅ Supported | 25,000 |
lus |
Mizo | Latin | ✅ Supported | 25,000 |
mni |
Meitei | Meitei Mayek | ✅ Supported | 25,000 |
njz |
Nyishi | Latin | ✅ Supported | 25,000 |
trp |
Kokborok | Latin | ⚠️ Limited | 12,545 |
pbv |
Pnar | Latin | ⚠️ Limited | 6,034 |
nag |
Nagamese | Latin | ⚠️ Limited | 1,996 |
Supported = strong retrieval performance. Limited = model has coverage but quality is lower; use with caution in production.
Performance
Evaluated on 500 samples per language. CLRI = Cross-Language Retrieval Interference (lower is better).
| Language | R@1 (Base) | R@1 (NE-Embed) | CLRI (Base) | CLRI (NE-Embed) |
|---|---|---|---|---|
| Assamese | 95.6 | 97.4 | 1.8% | 4.6% |
| Bodo | 55.8 | 99.8 | 61.0% | 3.0% |
| Garo | 13.2 | 90.8 | 88.8% | 3.0% |
| Khasi | 28.6 | 95.6 | 65.0% | 3.4% |
| Mizo | 46.6 | 91.8 | 58.4% | 9.4% |
| Meitei | 13.6 | 34.2 | 90.8% | 19.8% |
| Nyishi | 10.2 | 75.0 | 71.0% | 17.4% |
| Pnar | 27.2 | 86.2 | 79.6% | 8.0% |
| Kokborok | 26.4 | 71.6 | 63.8% | 11.8% |
| Nagamese | 77.0 | 88.0 | 17.8% | 8.4% |
Base = raw LaBSE zero-shot. All CLRI reductions represent genuine cross-lingual confusion fixed by fine-tuning.
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("MWirelabs/ne-embed")
sentences = [
"Where is the nearest hospital?", # English
"Ngi lah ia shong ha ki shnong baroh", # Khasi
"Pilakchin an·senganiko man·na am·tokenga.", # Garo
]
embeddings = model.encode(sentences, normalize_embeddings=True)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
Recommended for RAG / Hybrid Retrieval
# Hybrid: NE-Embed dense + BM25 char 3-gram sparse
score = 0.7 * ne_embed_score + 0.3 * bm25_score
Training
- Base model:
sentence-transformers/LaBSE - Loss:
MultipleNegativesRankingLoss - Data: 201,738 English↔NE language parallel pairs, capped at 25k per language to prevent Assamese attractor bias
- Epochs: 3 · Batch size: 64 · Max seq length: 128
- Hardware: 1× NVIDIA A40 (48 GB) · Training time: ~1.3 hours
Intended Uses
- Semantic search
- Dense retrieval
- RAG
- Cross-lingual retrieval
- Clustering
Citation
@misc{mwirelabs2026neembed,
title = {NE-Embed: Multilingual Text Embeddings for Northeast Indian Languages},
author = {MWire Labs},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/MWirelabs/ne-embed}},
note = {CC-BY-4.0}
}
Built with ♥ in Shillong, Meghalaya · MWire Labs · Part of the NE-Stack
NE-LID · NE-BERT · NE-Embed · Kren · Aganbo · Klam
- Downloads last month
- 78
Model tree for MWirelabs/ne-embed
Base model
sentence-transformers/LaBSE