File size: 2,213 Bytes
2c77bdd
 
 
6855ad8
 
2c77bdd
6855ad8
 
2c77bdd
6855ad8
 
 
 
 
 
 
2c77bdd
4312eeb
8e53fcf
20847d2
8e53fcf
20847d2
8e53fcf
20847d2
8e53fcf
20847d2
8e53fcf
 
 
 
20847d2
8e53fcf
 
 
 
 
 
 
 
 
20847d2
 
8e53fcf
 
 
 
 
 
 
 
 
 
 
 
 
20847d2
8e53fcf
 
 
 
 
0960430
a364f07
 
0960430
 
a364f07
0960430
 
a364f07
0960430
a364f07
0960430
 
 
 
 
 
8e53fcf
0960430
 
4312eeb
0960430
 
039203b
0960430
 
039203b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: fill-mask
tags:
- fill-mask
- smart-contract
- web3
- software-engineering
- embedding
- codebert
library_name: transformers
---

# SmartBERT V2 CodeBERT

![SmartBERT](./framework.png)

## Overview

SmartBERT V2 CodeBERT is a pre-trained model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**, designed to transfer **Smart Contract** function-level code into embeddings effectively.

- **Training Data:** Trained on **16,000** smart contracts.
- **Hardware:** Utilized 2 Nvidia A100 80G GPUs.
- **Training Duration:** More than 10 hours.
- **Evaluation Data:** Evaluated on **4,000** smart contracts.

## Preprocessing

All newline (`\n`) and tab (`\t`) characters in the function code were replaced with a single space to ensure consistency in the input data format.

## Base Model

- **Base Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)

## Training Setup

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=64,
    save_steps=10000,
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=10000,
    resume_from_checkpoint=checkpoint
)
```

## How to Use

To train and deploy the SmartBERT V2 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).

Or use pipeline:

```python
import torch
from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")

code = "function totalSupply() external view returns (uint256);"

inputs = tokenizer(
    code,
    return_tensors="pt",
    truncation=True,
    max_length=512
)

with torch.no_grad():
    outputs = model(**inputs)

# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Option 2: Mean pooling (often better for code)
mean_embedding = outputs.last_hidden_state.mean(dim=1)
```