File size: 2,213 Bytes
2c77bdd 6855ad8 2c77bdd 6855ad8 2c77bdd 6855ad8 2c77bdd 4312eeb 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 0960430 a364f07 0960430 a364f07 0960430 a364f07 0960430 a364f07 0960430 8e53fcf 0960430 4312eeb 0960430 039203b 0960430 039203b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: fill-mask
tags:
- fill-mask
- smart-contract
- web3
- software-engineering
- embedding
- codebert
library_name: transformers
---
# SmartBERT V2 CodeBERT

## Overview
SmartBERT V2 CodeBERT is a pre-trained model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**, designed to transfer **Smart Contract** function-level code into embeddings effectively.
- **Training Data:** Trained on **16,000** smart contracts.
- **Hardware:** Utilized 2 Nvidia A100 80G GPUs.
- **Training Duration:** More than 10 hours.
- **Evaluation Data:** Evaluated on **4,000** smart contracts.
## Preprocessing
All newline (`\n`) and tab (`\t`) characters in the function code were replaced with a single space to ensure consistency in the input data format.
## Base Model
- **Base Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
## Training Setup
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=20,
per_device_train_batch_size=64,
save_steps=10000,
save_total_limit=2,
evaluation_strategy="steps",
eval_steps=10000,
resume_from_checkpoint=checkpoint
)
```
## How to Use
To train and deploy the SmartBERT V2 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
Or use pipeline:
```python
import torch
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")
code = "function totalSupply() external view returns (uint256);"
inputs = tokenizer(
code,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
# Option 2: Mean pooling (often better for code)
mean_embedding = outputs.last_hidden_state.mean(dim=1)
```
|