Highlights

Supports 10 Northeast Indian languages
Optimized for Semantic Search, Retrieval and RAG
Up to 7× higher retrieval accuracy than raw LaBSE on low-resource languages
Built on sentence-transformers/LaBSE
Trained on 201,738 balanced English ↔ Northeast language parallel pairs
Released under the CC-BY-4.0 license

What is NE-Embed?

NE-Embed is a multilingual sentence embedding model designed for semantic understanding across Northeast Indian languages. It is optimized for semantic search, dense retrieval, Retrieval-Augmented Generation (RAG), and cross-lingual information retrieval, where general-purpose multilingual embedding models often perform poorly.

The model is fine-tuned from LaBSE using 201,738 balanced English↔Northeast language parallel pairs spanning 10 languages. It substantially improves retrieval quality for several low-resource languages—including Garo, Khasi, Nyishi, Pnar, and Kokborok—while maintaining strong multilingual alignment.

Why NE-Embed?

General-purpose multilingual embedding models are trained on hundreds of languages, but many Northeast Indian languages receive little or no representation during training. As a result, semantically similar sentences are often mapped far apart, leading to poor retrieval performance.

NE-Embed addresses this gap through targeted contrastive fine-tuning on balanced parallel data, producing embeddings that better capture semantic similarity for low-resource Northeast Indian languages while preserving multilingual compatibility.

Supported Languages

Code	Language	Script	Tier	Training Pairs
`asm`	Assamese	Bengali	✅ Supported	25,000
`brx`	Bodo	Devanagari	✅ Supported	25,000
`grt`	Garo	Latin	✅ Supported	25,000
`kha`	Khasi	Latin	✅ Supported	25,000
`lus`	Mizo	Latin	✅ Supported	25,000
`mni`	Meitei	Meitei Mayek	✅ Supported	25,000
`njz`	Nyishi	Latin	✅ Supported	25,000
`trp`	Kokborok	Latin	⚠️ Limited	12,545
`pbv`	Pnar	Latin	⚠️ Limited	6,034
`nag`	Nagamese	Latin	⚠️ Limited	1,996

Supported = strong retrieval performance. Limited = model has coverage but quality is lower; use with caution in production.

Performance

Evaluated on 500 samples per language. CLRI = Cross-Language Retrieval Interference (lower is better).

Language	R@1 (Base)	R@1 (NE-Embed)	CLRI (Base)	CLRI (NE-Embed)
Assamese	95.6	97.4	1.8%	4.6%
Bodo	55.8	99.8	61.0%	3.0%
Garo	13.2	90.8	88.8%	3.0%
Khasi	28.6	95.6	65.0%	3.4%
Mizo	46.6	91.8	58.4%	9.4%
Meitei	13.6	34.2	90.8%	19.8%
Nyishi	10.2	75.0	71.0%	17.4%
Pnar	27.2	86.2	79.6%	8.0%
Kokborok	26.4	71.6	63.8%	11.8%
Nagamese	77.0	88.0	17.8%	8.4%

Base = raw LaBSE zero-shot. All CLRI reductions represent genuine cross-lingual confusion fixed by fine-tuning.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("MWirelabs/ne-embed")

sentences = [
    "Where is the nearest hospital?",                          # English
    "Ngi lah ia shong ha ki shnong baroh",     # Khasi
    "Pilakchin an·senganiko man·na am·tokenga.",               # Garo
]

embeddings = model.encode(sentences, normalize_embeddings=True)
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Recommended for RAG / Hybrid Retrieval

# Hybrid: NE-Embed dense + BM25 char 3-gram sparse
score = 0.7 * ne_embed_score + 0.3 * bm25_score

Training

Base model: sentence-transformers/LaBSE
Loss: MultipleNegativesRankingLoss
Data: 201,738 English↔NE language parallel pairs, capped at 25k per language to prevent Assamese attractor bias
Epochs: 3 · Batch size: 64 · Max seq length: 128
Hardware: 1× NVIDIA A40 (48 GB) · Training time: ~1.3 hours

Intended Uses

Semantic search
Dense retrieval
RAG
Cross-lingual retrieval
Clustering

Citation

@misc{mwirelabs2026neembed,
  title        = {NE-Embed: Multilingual Text Embeddings for Northeast Indian Languages},
  author       = {MWire Labs},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/ne-embed}},
  note         = {CC-BY-4.0}
}

Built with ♥ in Shillong, Meghalaya · MWire Labs · Part of the NE-Stack

NE-LID · NE-BERT · NE-Embed · Kren · Aganbo · Klam

Downloads last month: 78

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MWirelabs/ne-embed

Base model

sentence-transformers/LaBSE

Finetuned

(89)

this model