File size: 3,467 Bytes
7b09b36 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
license: apache-2.0
language:
- code
library_name: peft
tags:
- llm2vec
- mntp
- decoder-only
- pre-training
- codegemma
---
## π Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.
In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
β‘οΈ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**
---
## Model Card: CodeGemma-7B - MNTP Pre-trained Model
### π Model Description
This is a PEFT adapter for the **`Qwen/Qwen2.5-Coder-3B-Instruct`** model, pre-trained with the **Masked Next Token Prediction (MNTP)** objective from the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework.
**Important Note on its Role**:
This model is **not intended for direct downstream task evaluation**. Instead, it serves as a crucial **foundational prerequisite** for our supervised fine-tuned (SupCon) models. The MNTP pre-training enables the decoder-only model to learn bidirectional representations, which is an essential step before applying supervised contrastive learning.
### π How to Use
#### Standalone Use (for Base Embeddings)
You can also use this MNTP model by itself to generate text or code embeddings.
```python
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec
base_model_id = "Qwen/Qwen2.5-Coder-3B-Instruct"
mntp_model_id = "SYSUSELab/DCS-Qwen2.5-Coder-3B-It-MNTP"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model_id, trust_remote_code=True, config=config,
torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, mntp_model_id)
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
embeddings = l2v.encode(["def hello_world():\n print('Hello, World!')"])
print("Embedding from MNTP model:", embeddings.shape)
```
### βοΈ Training Methodology
This model was pre-trained using the **MNTP** objective as described in the `llm2vec` paper. If you wish to train your own MNTP model from scratch, please refer to the instructions in the `Fine-tuning/Fine-tuning_method/MNTP/` directory of our GitHub repository.
### π Citation
If you use this model, please cite both our paper and the foundational work of `llm2vec`.
```bibtex
@article{chen2024decoder,
title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
journal={arXiv preprint arXiv:2410.22240},
year={2024}
}
@article{vaishaal2024llm2vec,
title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
``` |