Sionic AI

comsat-embed-ko-8b-preview

comsat-embed-ko-8b-preview is a decoder-based embedding model developed by Sionic AI, optimized for Korean semantic retrieval tasks. Trained on over 1M Korean examples, it encodes queries and documents into vectors so that the most relevant documents can be found by similarity. The model is designed to provide high-quality text representations for real-world information retrieval scenarios, including document search, question answering, knowledge base retrieval, and enterprise semantic search. By leveraging Korean retrieval-oriented training data, comsat-embed-ko-8b-preview delivers robust performance across Korean search environments where accurate semantic matching is essential.

Highlights

  • Korean-specialized β€” trained on 1M+ Korean examples and tuned for Korean search; achieves state-of-the-art average NDCG@10 (0.7930) on the 9-subset MTEB Korean retrieval benchmark among the compared models.
  • Long context β€” handles inputs up to 8,192 tokens, well suited to long-document retrieval.
  • Instruction-aware queries β€” queries are encoded with a task-instruction prompt to improve retrieval quality; documents need no prefix.
  • High-dimensional embeddings β€” 4096-dimensional, last-token pooled and L2-normalized, compared with cosine similarity.

Usage

First install the Sentence Transformers library

pip install -U sentence-transformers

Sentence Transformers Usage

⚠️ Queries must be encoded with the query prompt; documents are encoded without any prefix. (Skipping the query prompt slightly degrades retrieval quality.)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sionic-ai/comsat-embed-ko-8b-preview")

queries  = ["ν•œκ΅­μ˜ μˆ˜λ„λŠ” 어디인가?"]
passages = ["λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈνŠΉλ³„μ‹œμ΄λ‹€."]

# Option 1) pass the query prompt explicitly (query only; documents get no prefix)
q_emb = model.encode(queries,  prompt_name="query", normalize_embeddings=True)
d_emb = model.encode(passages,                      normalize_embeddings=True)

# Option 2) sentence-transformers 5.x helper API (equivalent result)
# q_emb = model.encode_query(queries)
# d_emb = model.encode_document(passages)

scores = q_emb @ d_emb.T   # cosine similarity
print(scores)

Transformers Usage

# Requires transformers>=4.51.0

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'ν•œκ΅­μ˜ μˆ˜λ„λŠ” 어디인가?'),
    get_detailed_instruct(task, '광합성은 μ–΄λ–»κ²Œ μΌμ–΄λ‚˜λŠ”κ°€?')
]
# No need to add instruction for retrieval documents
documents = [
    "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈνŠΉλ³„μ‹œμ΄λ‹€.",
    "광합성은 식물이 λΉ› μ—λ„ˆμ§€λ₯Ό μ΄μš©ν•΄ μ΄μ‚°ν™”νƒ„μ†Œμ™€ 물둜 포도당을 ν•©μ„±ν•˜λŠ” 과정이닀."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', padding_side='left')
model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview')

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).cuda()

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())

Korean Retrieval Benchmark

  • LawIRKo: A Korean legal-domain retrieval dataset for finding statutes and precedents relevant to legal queries.
  • SQuADKorV1Retrieval: A Korean Wikipedia passage retrieval dataset based on Korean SQuAD v1.
  • AutoRAGRetrieval: A Korean document retrieval dataset constructed by parsing PDFs from five domains: finance, public, medical, legal, and commerce.
  • Ko-StrategyQA: A Korean ODQA multi-hop retrieval dataset, translated from StrategyQA.
  • PublicHealthQA: A retrieval dataset focused on medical and public health domains in Korean.
  • BelebeleRetrieval: A Korean document retrieval dataset based on FLORES-200.
  • MultiLongDocRetrieval: A long-document retrieval dataset covering various domains in Korean.
  • MIRACLRetrieval: A Korean document retrieval dataset based on Wikipedia.
  • MrTidyRetrieval: A Wikipedia-based Korean document retrieval dataset.

Performance (MTEB Korean Retrieval, NDCG@10)

All scores are NDCG@10 on the full corpus, measured with the standard MTEB evaluation pipeline. For multilingual tasks the Korean subset is used (MLDR=ko, MIRACL/MrTidy=ko, Belebele=kor-kor).

Model Avg MIRACL MrTidy MLDR AutoRAG Ko-StrategyQA PublicHealthQA Belebele SQuADKorV1 LawIRKo
comsat-embed-ko-8b-preview 0.7930 0.6964 0.6253 0.5183 0.8518 0.8394 0.8871 0.9853 0.9168 0.8164
Qwen/Qwen3-Embedding-8B 0.7825 0.6783 0.6187 0.5036 0.8276 0.8363 0.8721 0.9828 0.9063 0.8171
Qwen/Qwen3-Embedding-4B 0.7718 0.6803 0.6076 0.4895 0.8431 0.8270 0.8693 0.9479 0.9044 0.7769
upstage/solar-embedding-1-large 0.7674 0.6703 0.5766 0.3850 0.8833 0.8366 0.8787 0.9684 0.9521 0.7557
microsoft/harrier-oss-v1-27b 0.7669 0.6653 0.5306 0.4073 0.8176 0.8361 0.8971 0.9538 0.9204 0.8737
dragonkue/snowflake-arctic-embed-l-v2.0-ko 0.7636 0.6685 0.5712 0.4150 0.9093 0.8050 0.8337 0.9518 0.9447 0.7735
codefuse-ai/F2LLM-v2-8B 0.7621 0.6311 0.6162 0.3950 0.7678 0.8371 0.9332 0.9509 0.8874 0.8405
nlpai-lab/KURE-v1 0.7603 0.6816 0.5909 0.4521 0.8708 0.7999 0.8193 0.9502 0.9357 0.7426
telepix/PIXIE-Rune-v1.5 0.7602 0.6393 0.5492 0.4340 0.8927 0.8064 0.8426 0.9617 0.9457 0.7705
nvidia/llama-nemotron-embed-vl-1b-v2 0.7579 0.6975 0.5998 0.3704 0.8773 0.8084 0.8223 0.9584 0.9360 0.7513
dragonkue/BGE-m3-ko 0.7534 0.6833 0.6099 0.3784 0.8738 0.7959 0.8155 0.9503 0.9414 0.7322
BAAI/bge-m3 0.7508 0.7015 0.6471 0.4273 0.8301 0.7941 0.8041 0.9316 0.9038 0.7174
intfloat/multilingual-e5-large 0.7333 0.6649 0.6421 0.2708 0.8134 0.8035 0.8253 0.9450 0.9056 0.7293
nlpai-lab/KoE5 0.7329 0.6235 0.5841 0.2942 0.8434 0.8001 0.8351 0.9425 0.8980 0.7756

Avg is the mean over the 9 subsets (higher is better). Reproduction: evaluated with the MTEB retrieval pipeline (NDCG@10, full corpus); the query prompt is applied to queries only (documents get no prefix).

License

  • Model weights: cc-by-nc-4.0 (non-commercial use).
Downloads last month
6
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sionic-ai/comsat-embed-ko-8b-preview

Finetuned
(36)
this model