anarlavrenov commited on
Commit
ed00d52
·
verified ·
1 Parent(s): a43de7a

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ logo.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![logo](logo.png)
2
+ **LIME-1B Model Card**
3
+
4
+ ---
5
+
6
+ > **Note**: This model serves as proof that a single individual, without any team or institutional backing, can develop an LLM that demonstrates competitive results.
7
+
8
+ ---
9
+
10
+ # LIME-1B
11
+
12
+ LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a **compact, practical base model** for:
13
+
14
+ - Building RAG systems (context + question → answer)
15
+ - Assistant-style Q&A and task completion
16
+ - Summarization, explanation, and rewriting tasks in English
17
+
18
+ > ⚠️ LIME-1B is **not** RLHF/DPO-aligned and does **not** have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.
19
+
20
+ ---
21
+
22
+ ## 1. Model architecture
23
+
24
+ LIME-1B follows a modern GPT-style decoder-only Transformer with several quality-oriented design choices:
25
+
26
+ | Component | Value |
27
+ |-----------------------------|-------------------------|
28
+ | Architecture | Decoder-only Transformer |
29
+ | Parameters | 1.0B |
30
+ | Layers (decoder blocks) | 32 |
31
+ | d_model | 1536 |
32
+ | FFN dimension (d_ff) | 6144 |
33
+ | Attention heads | 24 |
34
+ | Vocabulary size | 50,000 |
35
+ | Max sequence length | 512 tokens |
36
+ | Positional encoding | Sinusoidal |
37
+ | Norm | `RMSNorm` |
38
+ | FFN | SiLU MLP |
39
+ | Attention | FlashAttention |
40
+ | Tying of embeddings | Output head tied to embedding |
41
+ | Precision (training) | Mixed fp32/bf16 (autocast) + grad clipping |
42
+
43
+
44
+ ## 2. Training data
45
+
46
+ ### 2.1 Pretraining
47
+
48
+ The base model is pretrained as a standard causal language model on English web data:
49
+
50
+ - **Corpus**: FineWeb-Edu (CC-MAIN-2025-05 split)
51
+ - **Language filter**: English-only subset
52
+ - **Objective**: next-token prediction (causal LM)
53
+ - **Token budget**: 20B tokens
54
+ - **Context length**: 512 tokens
55
+
56
+
57
+ ### 2.2 Instruction fine-tuning (SFT)
58
+
59
+ After pretraining, the model is fine-tuned on a **unified instruction schema**:
60
+
61
+ ```text
62
+ [context (optional)] <user> instruction_text <assistant> response_text <eos>
63
+ ```
64
+
65
+ **SFT Data Mixture** (~97k examples total):
66
+ - [projecte-aina/RAG_Multilingual](https://huggingface.co/datasets/projecte-aina/RAG_Multilingual)
67
+ - [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
68
+ - [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
69
+ - [CohereLabs/aya_dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset)
70
+ - [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
71
+
72
+ ## Training Details
73
+
74
+ ### Hardware
75
+ - **GPUs**: 8 × NVIDIA A100 80GB (data parallel)
76
+ - **Precision**: bfloat16 with gradient clipping (max_norm = 1.0)
77
+
78
+ ### Pretraining
79
+
80
+ **Objective**: Cross-entropy loss on next-token prediction
81
+
82
+ **Optimizer**: AdamW
83
+ - β₁ = 0.9
84
+ - β₂ = 0.95
85
+ - Weight decay applied to non-norm/non-bias parameters
86
+
87
+ **Learning Rate Schedule**:
88
+ - Peak LR: ~5e-4
89
+ - Polynomial decay to 5e-6
90
+ - Warmup: ~5% of total steps
91
+
92
+ ### Instruction fine-tuning (SFT)
93
+
94
+ **Objective**: Cross-entropy loss on next-token prediction
95
+
96
+ **Optimizer**: AdamW
97
+ - β₁ = 0.9
98
+ - β₂ = 0.95
99
+ - Weight decay applied to non-norm/non-bias parameters
100
+
101
+ **Learning Rate Schedule**:
102
+ - Peak LR: 8e-5
103
+ - Polynomial decay to 1e-5
104
+ - Warmup: 10% of total steps
105
+
106
+ ## Usage
107
+ ```python
108
+ from transformers import AutoTokenizer, AutoModelForCausalLM
109
+ import torch
110
+
111
+ model_name = "anarlavrenov/LIME-1B"
112
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
113
+ model = AutoModelForCausalLM.from_pretrained(
114
+ model_name,
115
+ torch_dtype=torch.bfloat16,
116
+ device_map="auto",
117
+ )
118
+
119
+ def build_inference_prompt(context, question):
120
+
121
+ context_txt = clean_text(context) if context is not None else ""
122
+ question_txt = clean_text(question)
123
+
124
+ context_ids = tokenizer.encode(context_txt) if context_txt else []
125
+ question_ids = tokenizer.encode(question_txt)
126
+
127
+ uid = args.user_id
128
+ aid = args.assistant_id
129
+
130
+ ids = []
131
+
132
+ if context_ids:
133
+ ids.extend(context_ids)
134
+ ids.append(uid)
135
+ ids.extend(question_ids)
136
+ ids.append(aid)
137
+
138
+ return torch.tensor(ids, dtype=torch.long)
139
+
140
+ # Example usage
141
+ context = "..." # optional context
142
+ question = "Write five questions for a Data Scientist interview."
143
+ prompt = build_prompt(context, question)
144
+
145
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
146
+ outputs = model.generate(
147
+ **inputs,
148
+ max_new_tokens=256,
149
+ do_sample=True,
150
+ top_p=0.9,
151
+ temperature=0.5,
152
+ pad_token_id=tokenizer.pad_token_id,
153
+ eos_token_id=tokenizer.eos_token_id,
154
+ )
155
+
156
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
157
+
158
+ # 1. Can you tell us about your experience with data analysis and modeling?
159
+ # 2. How do you approach data cleaning and preprocessing?
160
+ # 3. How do you approach data visualization and storytelling?
161
+ # 4. Can you walk us through a time when you used data to solve a problem?
162
+ # 5. How do you approach the ethical considerations of data science and machine learning?
163
+
164
+ ```
165
+
166
+ If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets (FineWeb-Edu, Dolly, No Robots, Aya, Alpaca, RAG_Multilingual, etc.) according to their respective licenses and documentation.
167
+
168
+ ## Citation
169
+ ```bibtex
170
+ @misc{lime1b2025,
171
+ title = {LIME-1B: A 1B-parameter English Causal Language Model},
172
+ author = {Anar Lavrenov},
173
+ year = {2025},
174
+ howpublished = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
175
+ }
176
+ ```
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LIMEForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_lime.LIMEConfig",
7
+ "AutoModelForCausalLM": "modeling_lime.LIMEForCausalLM"
8
+ },
9
+ "d_model": 1536,
10
+ "dff": 6144,
11
+ "dropout_rate": 0.0,
12
+ "dtype": "float32",
13
+ "eos_token_id": 1,
14
+ "is_decoder": true,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "lime",
17
+ "multiple_of": 256,
18
+ "num_decoder_layers": 32,
19
+ "num_encoder_layers": 0,
20
+ "num_heads": 24,
21
+ "pad_token_id": 0,
22
+ "transformers_version": "4.57.3",
23
+ "use_cache": false,
24
+ "use_encoder": false,
25
+ "use_flash": true,
26
+ "vocab_size": 50000
27
+ }
configuration_lime.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+
3
+
4
+ class LIMEConfig(PretrainedConfig):
5
+ model_type = "lime"
6
+
7
+ def __init__(
8
+ self,
9
+ vocab_size=50000,
10
+ d_model=1536,
11
+ num_encoder_layers=0,
12
+ num_decoder_layers=32,
13
+ num_heads=24,
14
+ dff=6144,
15
+ dropout_rate=0.0,
16
+ max_position_embeddings=512,
17
+ pad_token_id=0,
18
+ eos_token_id=1,
19
+ use_encoder=False,
20
+ use_flash=True,
21
+ multiple_of=256,
22
+ **kwargs
23
+ ):
24
+ super().__init__(
25
+ pad_token_id=pad_token_id,
26
+ eos_token_id=eos_token_id,
27
+ **kwargs
28
+ )
29
+
30
+ self.vocab_size = vocab_size
31
+ self.d_model = d_model
32
+ self.num_encoder_layers = num_encoder_layers
33
+ self.num_decoder_layers = num_decoder_layers
34
+ self.num_heads = num_heads
35
+ self.dff = dff
36
+ self.dropout_rate = dropout_rate
37
+ self.max_position_embeddings = max_position_embeddings
38
+ self.pad_token_id = pad_token_id
39
+ self.eos_token_id = eos_token_id
40
+ self.use_encoder = use_encoder
41
+ self.use_flash = use_flash
42
+ self.multiple_of = multiple_of
43
+
44
+ # For Transformers library.
45
+ self.is_decoder = True
46
+ self.is_encoder_decoder = False
47
+ self.tie_word_embeddings = True
48
+ self.use_cache = False
logo.png ADDED

Git LFS Details

  • SHA256: e1c90f071aec48e1b36fdf6dfa6ee7ccd1904a36874d757f9843ae93a8b3cb44
  • Pointer size: 132 Bytes
  • Size of remote file: 2.89 MB
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86a8bb74eac1976913c500149defc4a2f43c24b7f534260843db7727ddf69634
3
+ size 3937660880
modeling_lime.py ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+ from transformers import PreTrainedModel
4
+ from transformers.modeling_outputs import CausalLMOutputWithPast
5
+ from typing import Optional, Tuple, Union
6
+ from ukraine.research.transformer.transformer import Transformer
7
+ from ukraine.research.transformer.layers import SiLUFeedForward
8
+ from ukraine.research.transformer.masking import generate_square_subsequent_mask
9
+ from src.configuration_lime import LIMEConfig
10
+
11
+
12
+ def make_ff(config: LIMEConfig):
13
+ return SiLUFeedForward(
14
+ d_model=config.d_model,
15
+ dff=config.dff,
16
+ multiple_of=config.multiple_of
17
+ )
18
+
19
+
20
+ def make_norm(config: LIMEConfig):
21
+ return nn.RMSNorm(config.d_model)
22
+
23
+
24
+ class LIMEForCausalLM(PreTrainedModel):
25
+ config_class = LIMEConfig
26
+ base_model_prefix = "lime"
27
+ _tied_weights_keys = ["transformer.output_fc.weight"]
28
+
29
+ def __init__(self, config: LIMEConfig):
30
+ super().__init__(config)
31
+ self.config = config
32
+
33
+ self.transformer = Transformer(
34
+ num_encoder_layers=config.num_encoder_layers,
35
+ num_decoder_layers=config.num_decoder_layers,
36
+ d_model=config.d_model,
37
+ num_heads=config.num_heads,
38
+ input_vocab_size=config.vocab_size,
39
+ target_vocab_size=config.vocab_size,
40
+ dropout_rate=config.dropout_rate,
41
+ ff_factory=lambda: make_ff(config),
42
+ norm_factory=lambda: make_norm(config),
43
+ pad_token_id=config.pad_token_id,
44
+ use_encoder=config.use_encoder,
45
+ use_flash=config.use_flash
46
+ )
47
+
48
+ self.post_init()
49
+
50
+ # For transformers library
51
+ def get_input_embeddings(self):
52
+ return self.transformer.decoder.embedding
53
+
54
+ def set_input_embeddings(self, value):
55
+ self.transformer.decoder.embedding = value
56
+
57
+ def get_output_embeddings(self):
58
+ return self.transformer.output_fc
59
+
60
+ def set_output_embeddings(self, new_embeddings):
61
+ self.transformer.output_fc = new_embeddings
62
+
63
+ def _tie_weights(self):
64
+ if self.config.tie_word_embeddings:
65
+ self._tie_or_clone_weights(
66
+ self.transformer.output_fc,
67
+ self.get_input_embeddings()
68
+ )
69
+
70
+ def forward(
71
+ self,
72
+ input_ids: torch.LongTensor,
73
+ attention_mask: Optional[torch.Tensor] = None,
74
+ labels: Optional[torch.LongTensor] = None,
75
+ return_dict: Optional[bool] = None,
76
+ **kwargs
77
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
78
+
79
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
80
+
81
+ batch_size, seq_len = input_ids.shape
82
+ device = input_ids.device
83
+
84
+ tgt_mask = generate_square_subsequent_mask(seq_len, device)
85
+
86
+ # If we are planning to train the model.
87
+ if labels is not None:
88
+ tgt_key_padding_mask = input_ids.eq(self.config.pad_token_id)
89
+ # For inference we do not need it.
90
+ else:
91
+ tgt_key_padding_mask = None
92
+
93
+ logits, _ = self.transformer(
94
+ src=input_ids,
95
+ tgt_mask=tgt_mask,
96
+ tgt_key_padding_mask=tgt_key_padding_mask
97
+ )
98
+
99
+ loss = None
100
+ if labels is not None:
101
+ shift_logits = logits[:, :-1, :].contiguous()
102
+ shift_labels = labels[:, 1:].contiguous()
103
+ # This ignore index was used during SFT training.
104
+ criterion = nn.CrossEntropyLoss(ignore_index=-100)
105
+ loss = criterion(
106
+ shift_logits.reshape(-1, self.config.vocab_size),
107
+ shift_labels.reshape(-1)
108
+ )
109
+
110
+ if not return_dict:
111
+ output = (logits,)
112
+ return ((loss,) + output) if loss is not None else output
113
+
114
+ return CausalLMOutputWithPast(
115
+ loss=loss,
116
+ logits=logits,
117
+ past_key_values=None,
118
+ hidden_states=None,
119
+ attentions=None
120
+ )
special_tokens_map.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<user>",
4
+ "<assistant>"
5
+ ],
6
+ "eos_token": {
7
+ "content": "<eos>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "pad_token": {
14
+ "content": "<pad>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ }
20
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<pad>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<eos>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<user>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<assistant>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "additional_special_tokens": [
37
+ "<user>",
38
+ "<assistant>"
39
+ ],
40
+ "clean_up_tokenization_spaces": false,
41
+ "eos_token": "<eos>",
42
+ "extra_special_tokens": {},
43
+ "model_max_length": 1000000000000000019884624838656,
44
+ "pad_token": "<pad>",
45
+ "tokenizer_class": "PreTrainedTokenizerFast"
46
+ }