File size: 3,467 Bytes
7b09b36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: apache-2.0
language:
- code
library_name: peft
tags:
- llm2vec
- mntp
- decoder-only
- pre-training
- codegemma
---

## πŸ“– Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.

In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

➑️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**

---

## Model Card: CodeGemma-7B - MNTP Pre-trained Model

### πŸ“œ Model Description

This is a PEFT adapter for the **`Qwen/Qwen2.5-Coder-3B-Instruct`** model, pre-trained with the **Masked Next Token Prediction (MNTP)** objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework.

**Important Note on its Role**:
This model is **not intended for direct downstream task evaluation**. Instead, it serves as a crucial **foundational prerequisite** for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning.

### πŸš€ How to Use

#### Standalone Use (for Base Embeddings)

You can also use this MNTP model by itself to generate text or code embeddings.

```python
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec

base_model_id = "Qwen/Qwen2.5-Coder-3B-Instruct"
mntp_model_id = "SYSUSELab/DCS-Qwen2.5-Coder-3B-It-MNTP"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config, 
                                  torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, mntp_model_id)

l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
embeddings = l2v.encode(["def hello_world():\n    print('Hello, World!')"])
print("Embedding from MNTP model:", embeddings.shape)
```

### βš™οΈ Training Methodology

This model was pre-trained using the **MNTP** objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository.

### πŸ“„ Citation

If you use this model, please cite both our paper and the foundational work of `llm2vec`.

```bibtex
@article{chen2024decoder,
  title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
  author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
  journal={arXiv preprint arXiv:2410.22240},
  year={2024}
}

@article{vaishaal2024llm2vec,
    title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
    author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
    journal={arXiv preprint arXiv:2404.05961},
    year={2024}
}
```