Sionic AI

comsat-embed-ko-8b-preview

comsat-embed-ko-8b-preview is a decoder-based embedding model developed by Sionic AI, optimized for Korean semantic retrieval tasks. Trained on over 1M Korean examples, it encodes queries and documents into vectors so that the most relevant documents can be found by similarity. The model is designed to provide high-quality text representations for real-world information retrieval scenarios, including document search, question answering, knowledge base retrieval, and enterprise semantic search. By leveraging Korean retrieval-oriented training data, comsat-embed-ko-8b-preview delivers robust performance across Korean search environments where accurate semantic matching is essential.

Highlights

Korean-specialized — trained on 1M+ Korean examples and tuned for Korean search; achieves state-of-the-art average NDCG@10 (0.7930) on the 9-subset MTEB Korean retrieval benchmark among the compared models.
Long context — handles inputs up to 8,192 tokens, well suited to long-document retrieval.
Instruction-aware queries — queries are encoded with a task-instruction prompt to improve retrieval quality; documents need no prefix.
High-dimensional embeddings — 4096-dimensional, last-token pooled and L2-normalized, compared with cosine similarity.

Usage

First install the Sentence Transformers library

pip install -U sentence-transformers

Sentence Transformers Usage

⚠️ Queries must be encoded with the query prompt; documents are encoded without any prefix. (Skipping the query prompt slightly degrades retrieval quality.)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sionic-ai/comsat-embed-ko-8b-preview")

queries  = ["한국의 수도는 어디인가?"]
passages = ["대한민국의 수도는 서울특별시이다."]

# Option 1) pass the query prompt explicitly (query only; documents get no prefix)
q_emb = model.encode(queries,  prompt_name="query", normalize_embeddings=True)
d_emb = model.encode(passages,                      normalize_embeddings=True)

# Option 2) sentence-transformers 5.x helper API (equivalent result)
# q_emb = model.encode_query(queries)
# d_emb = model.encode_document(passages)

scores = q_emb @ d_emb.T   # cosine similarity
print(scores)

Transformers Usage

# Requires transformers>=4.51.0

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, '한국의 수도는 어디인가?'),
    get_detailed_instruct(task, '광합성은 어떻게 일어나는가?')
]
# No need to add instruction for retrieval documents
documents = [
    "대한민국의 수도는 서울특별시이다.",
    "광합성은 식물이 빛 에너지를 이용해 이산화탄소와 물로 포도당을 합성하는 과정이다."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', padding_side='left')
model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview')

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).cuda()

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())

Korean Retrieval Benchmark

LawIRKo: A Korean legal-domain retrieval dataset for finding statutes and precedents relevant to legal queries.
SQuADKorV1Retrieval: A Korean Wikipedia passage retrieval dataset based on Korean SQuAD v1.
AutoRAGRetrieval: A Korean document retrieval dataset constructed by parsing PDFs from five domains: finance, public, medical, legal, and commerce.
Ko-StrategyQA: A Korean ODQA multi-hop retrieval dataset, translated from StrategyQA.
PublicHealthQA: A retrieval dataset focused on medical and public health domains in Korean.
BelebeleRetrieval: A Korean document retrieval dataset based on FLORES-200.
MultiLongDocRetrieval: A long-document retrieval dataset covering various domains in Korean.
MIRACLRetrieval: A Korean document retrieval dataset based on Wikipedia.
MrTidyRetrieval: A Wikipedia-based Korean document retrieval dataset.

Performance (MTEB Korean Retrieval, NDCG@10)

All scores are NDCG@10 on the full corpus, measured with the standard MTEB evaluation pipeline. For multilingual tasks the Korean subset is used (MLDR=ko, MIRACL/MrTidy=ko, Belebele=kor-kor).

Model	Avg	MIRACL	MrTidy	MLDR	AutoRAG	Ko-StrategyQA	PublicHealthQA	Belebele	SQuADKorV1	LawIRKo
comsat-embed-ko-8b-preview	0.7930	0.6964	0.6253	0.5183	0.8518	0.8394	0.8871	0.9853	0.9168	0.8164
Qwen/Qwen3-Embedding-8B	0.7825	0.6783	0.6187	0.5036	0.8276	0.8363	0.8721	0.9828	0.9063	0.8171
Qwen/Qwen3-Embedding-4B	0.7718	0.6803	0.6076	0.4895	0.8431	0.8270	0.8693	0.9479	0.9044	0.7769
upstage/solar-embedding-1-large	0.7674	0.6703	0.5766	0.3850	0.8833	0.8366	0.8787	0.9684	0.9521	0.7557
microsoft/harrier-oss-v1-27b	0.7669	0.6653	0.5306	0.4073	0.8176	0.8361	0.8971	0.9538	0.9204	0.8737
dragonkue/snowflake-arctic-embed-l-v2.0-ko	0.7636	0.6685	0.5712	0.4150	0.9093	0.8050	0.8337	0.9518	0.9447	0.7735
codefuse-ai/F2LLM-v2-8B	0.7621	0.6311	0.6162	0.3950	0.7678	0.8371	0.9332	0.9509	0.8874	0.8405
nlpai-lab/KURE-v1	0.7603	0.6816	0.5909	0.4521	0.8708	0.7999	0.8193	0.9502	0.9357	0.7426
telepix/PIXIE-Rune-v1.5	0.7602	0.6393	0.5492	0.4340	0.8927	0.8064	0.8426	0.9617	0.9457	0.7705
nvidia/llama-nemotron-embed-vl-1b-v2	0.7579	0.6975	0.5998	0.3704	0.8773	0.8084	0.8223	0.9584	0.9360	0.7513
dragonkue/BGE-m3-ko	0.7534	0.6833	0.6099	0.3784	0.8738	0.7959	0.8155	0.9503	0.9414	0.7322
BAAI/bge-m3	0.7508	0.7015	0.6471	0.4273	0.8301	0.7941	0.8041	0.9316	0.9038	0.7174
intfloat/multilingual-e5-large	0.7333	0.6649	0.6421	0.2708	0.8134	0.8035	0.8253	0.9450	0.9056	0.7293
nlpai-lab/KoE5	0.7329	0.6235	0.5841	0.2942	0.8434	0.8001	0.8351	0.9425	0.8980	0.7756

Avg is the mean over the 9 subsets (higher is better). Reproduction: evaluated with the MTEB retrieval pipeline (NDCG@10, full corpus); the query prompt is applied to queries only (documents get no prefix).

License

Model weights: cc-by-nc-4.0 (non-commercial use).

Downloads last month: 6

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for sionic-ai/comsat-embed-ko-8b-preview

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B

Finetuned

(36)

this model