Automatically add EOS via Tokenizer, add Sentence Transformers snippet

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("codefuse-ai/F2LLM-0.6B", model_kwargs={"torch_dtype": "bfloat16"}, revision="refs/pr/2")

# Some sample query and documents
query = "What is F2LLM used for?"
documents = [
    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
]

# Encode the query and documents separately, the encode_query method uses the query prompt
query_embedding = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embedding.shape, document_embeddings.shape)
# (1024,) (3, 1024)

# Compute cosine similarity between the query and documents
similarity = model.similarity(query_embedding, document_embeddings)
print(similarity)
# tensor([[0.5132, 0.5376, 0.8017]])

With Transformers

Or directly with the Transformers library:

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F


model_path = "codefuse-ai/F2LLM-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_path, revision="refs/pr/2")
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0}, revision="refs/pr/2")

query = "What is F2LLM used for?"
query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:"
documents = [
    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
    'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
]

def encode(sentences):
    batch_size = len(sentences)
    tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
    last_hidden_state = model(**tokenized_inputs).last_hidden_state
    eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
    embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings

# Encode the query and documents
query_embedding = encode([query_prompt + query])
document_embeddings = encode(documents)
print(query_embedding.shape, document_embeddings.shape)
# torch.Size([1, 1024]) torch.Size([3, 1024])

# Compute cosine similarity between the query and documents
similarity = query_embedding @ document_embeddings.T
print(similarity)
# tensor([[0.5039, 0.5312, 0.7930]], device='cuda:0', dtype=torch.bfloat16,
#        grad_fn=<MmBackward0>)

The change to the tokenizer means that the EOS is automatically included, simplifying both the transformers code and allowing for an integration with Sentence Transformers and related libraries. Note that there's a small difference in outputs between Sentence Transformers and pure Transformers: this is caused by bf16. If you use fp32, it should disappear. I wasn't able to remove this discrepancy after quite a while of trying. Either way, the results are close to fp32 for both Sentence Transformers and Transformers.

I also added snippets for usage with Sentence Transformers to the README.

cc @Geralt-Targaryen

Tom Aarsen

Automatically add EOS via Tokenizer, add Sentence Transformers snippet29d76382

tomaarsen changed pull request status to open 21 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment