File size: 3,467 Bytes

7b09b36

---
license: apache-2.0
language:
- code
library_name: peft
tags:
- llm2vec
- mntp
- decoder-only
- pre-training
- codegemma
---

## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.

In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

➡️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**

---

## Model Card: CodeGemma-7B - MNTP Pre-trained Model

### 📜 Model Description

This is a PEFT adapter for the **`Qwen/Qwen2.5-Coder-3B-Instruct`** model, pre-trained with the **Masked Next Token Prediction (MNTP)** objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework.

**Important Note on its Role**:
This model is **not intended for direct downstream task evaluation**. Instead, it serves as a crucial **foundational prerequisite** for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning.

### 🚀 How to Use

#### Standalone Use (for Base Embeddings)

You can also use this MNTP model by itself to generate text or code embeddings.

```python
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec

base_model_id = "Qwen/Qwen2.5-Coder-3B-Instruct"
mntp_model_id = "SYSUSELab/DCS-Qwen2.5-Coder-3B-It-MNTP"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config, 
                                  torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, mntp_model_id)

l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
embeddings = l2v.encode(["def hello_world():\n    print('Hello, World!')"])
print("Embedding from MNTP model:", embeddings.shape)
```

### ⚙️ Training Methodology

This model was pre-trained using the **MNTP** objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository.

### 📄 Citation

If you use this model, please cite both our paper and the foundational work of `llm2vec`.

```bibtex
@article{chen2024decoder,
  title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
  author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
  journal={arXiv preprint arXiv:2410.22240},
  year={2024}
}

@article{vaishaal2024llm2vec,
    title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
    author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
    journal={arXiv preprint arXiv:2404.05961},
    year={2024}
}
```