Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +176 -0
config.json +27 -0
configuration_lime.py +48 -0
logo.png +3 -0
model.safetensors +3 -0
modeling_lime.py +120 -0
special_tokens_map.json +20 -0
tokenizer.json +0 -0
tokenizer_config.json +46 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+logo.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+![logo](logo.png)
+**LIME-1B Model Card**
+---
+> **Note**: This model serves as proof that a single individual, without any team or institutional backing, can develop an LLM that demonstrates competitive results.
+---
+# LIME-1B
+LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a **compact, practical base model** for:
+- Building RAG systems (context + question → answer)
+- Assistant-style Q&A and task completion
+- Summarization, explanation, and rewriting tasks in English
+> ⚠️ LIME-1B is **not** RLHF/DPO-aligned and does **not** have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.
+---
+## 1. Model architecture
+LIME-1B follows a modern GPT-style decoder-only Transformer with several quality-oriented design choices:
+| Component                   | Value                   |
+|-----------------------------|-------------------------|
+| Architecture                | Decoder-only Transformer |
+| Parameters                  | 1.0B                    |
+| Layers (decoder blocks)     | 32                      |
+| d_model                     | 1536                    |
+| FFN dimension (d_ff)        | 6144                    |
+| Attention heads             | 24                      |
+| Vocabulary size             | 50,000                  |
+| Max sequence length         | 512 tokens              |
+| Positional encoding         | Sinusoidal              |
+| Norm                        | `RMSNorm`               |
+| FFN                         | SiLU MLP                |
+| Attention                   | FlashAttention   |
+| Tying of embeddings         | Output head tied to embedding |
+| Precision (training)        | Mixed fp32/bf16 (autocast) + grad clipping |
+## 2. Training data
+### 2.1 Pretraining
+The base model is pretrained as a standard causal language model on English web data:
+- **Corpus**: FineWeb-Edu (CC-MAIN-2025-05 split)
+- **Language filter**: English-only subset
+- **Objective**: next-token prediction (causal LM)
+- **Token budget**: 20B tokens
+- **Context length**: 512 tokens
+### 2.2 Instruction fine-tuning (SFT)
+After pretraining, the model is fine-tuned on a **unified instruction schema**:
+```text
+[context (optional)] <user> instruction_text <assistant> response_text <eos>
+```
+**SFT Data Mixture** (~97k examples total):
+- [projecte-aina/RAG_Multilingual](https://huggingface.co/datasets/projecte-aina/RAG_Multilingual)
+- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
+- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
+- [CohereLabs/aya_dataset](https://huggingface.co/datasets/CohereLabs/aya_dataset)
+- [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
+## Training Details
+### Hardware
+- **GPUs**: 8 × NVIDIA A100 80GB (data parallel)
+- **Precision**: bfloat16 with gradient clipping (max_norm = 1.0)
+### Pretraining
+**Objective**: Cross-entropy loss on next-token prediction
+**Optimizer**: AdamW
+- β₁ = 0.9
+- β₂ = 0.95
+- Weight decay applied to non-norm/non-bias parameters
+**Learning Rate Schedule**:
+- Peak LR: ~5e-4
+- Polynomial decay to 5e-6
+- Warmup: ~5% of total steps
+### Instruction fine-tuning (SFT)
+**Objective**: Cross-entropy loss on next-token prediction
+**Optimizer**: AdamW
+- β₁ = 0.9
+- β₂ = 0.95
+- Weight decay applied to non-norm/non-bias parameters
+**Learning Rate Schedule**:
+- Peak LR: 8e-5
+- Polynomial decay to 1e-5
+- Warmup: 10% of total steps
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_name = "anarlavrenov/LIME-1B"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+def build_inference_prompt(context, question):
+  context_txt  = clean_text(context) if context is not None else ""
+  question_txt = clean_text(question)
+  context_ids  = tokenizer.encode(context_txt) if context_txt else []
+  question_ids = tokenizer.encode(question_txt)
+  uid = args.user_id
+  aid = args.assistant_id
+  ids = []
+  if context_ids:
+    ids.extend(context_ids)
+  ids.append(uid)
+  ids.extend(question_ids)
+  ids.append(aid)
+  return torch.tensor(ids, dtype=torch.long)
+# Example usage
+context = "..."  # optional context
+question = "Write five questions for a Data Scientist interview."
+prompt = build_prompt(context, question)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=256,
+    do_sample=True,
+    top_p=0.9,
+    temperature=0.5,
+    pad_token_id=tokenizer.pad_token_id,
+    eos_token_id=tokenizer.eos_token_id,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+# 1. Can you tell us about your experience with data analysis and modeling?
+# 2. How do you approach data cleaning and preprocessing?
+# 3. How do you approach data visualization and storytelling?
+# 4. Can you walk us through a time when you used data to solve a problem?
+# 5. How do you approach the ethical considerations of data science and machine learning?
+```
+If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets (FineWeb-Edu, Dolly, No Robots, Aya, Alpaca, RAG_Multilingual, etc.) according to their respective licenses and documentation.
+## Citation
+```bibtex
+@misc{lime1b2025,
+  title         = {LIME-1B: A 1B-parameter English Causal Language Model},
+  author        = {Anar Lavrenov},
+  year          = {2025},
+  howpublished  = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "architectures": [
+    "LIMEForCausalLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_lime.LIMEConfig",
+    "AutoModelForCausalLM": "modeling_lime.LIMEForCausalLM"
+  },
+  "d_model": 1536,
+  "dff": 6144,
+  "dropout_rate": 0.0,
+  "dtype": "float32",
+  "eos_token_id": 1,
+  "is_decoder": true,
+  "max_position_embeddings": 512,
+  "model_type": "lime",
+  "multiple_of": 256,
+  "num_decoder_layers": 32,
+  "num_encoder_layers": 0,
+  "num_heads": 24,
+  "pad_token_id": 0,
+  "transformers_version": "4.57.3",
+  "use_cache": false,
+  "use_encoder": false,
+  "use_flash": true,
+  "vocab_size": 50000
+}

configuration_lime.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from transformers import PretrainedConfig
+class LIMEConfig(PretrainedConfig):
+    model_type = "lime"
+    def __init__(
+            self,
+            vocab_size=50000,
+            d_model=1536,
+            num_encoder_layers=0,
+            num_decoder_layers=32,
+            num_heads=24,
+            dff=6144,
+            dropout_rate=0.0,
+            max_position_embeddings=512,
+            pad_token_id=0,
+            eos_token_id=1,
+            use_encoder=False,
+            use_flash=True,
+            multiple_of=256,
+            **kwargs
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+            **kwargs
+        )
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.num_encoder_layers = num_encoder_layers
+        self.num_decoder_layers = num_decoder_layers
+        self.num_heads = num_heads
+        self.dff = dff
+        self.dropout_rate = dropout_rate
+        self.max_position_embeddings = max_position_embeddings
+        self.pad_token_id = pad_token_id
+        self.eos_token_id = eos_token_id
+        self.use_encoder = use_encoder
+        self.use_flash = use_flash
+        self.multiple_of = multiple_of
+        # For Transformers library.
+        self.is_decoder = True
+        self.is_encoder_decoder = False
+        self.tie_word_embeddings = True
+        self.use_cache = False

logo.png ADDED Viewed

Git LFS Details

SHA256: e1c90f071aec48e1b36fdf6dfa6ee7ccd1904a36874d757f9843ae93a8b3cb44
Pointer size: 132 Bytes
Size of remote file: 2.89 MB

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:86a8bb74eac1976913c500149defc4a2f43c24b7f534260843db7727ddf69634
+size 3937660880

modeling_lime.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import torch
+from torch import nn
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from typing import Optional, Tuple, Union
+from ukraine.research.transformer.transformer import Transformer
+from ukraine.research.transformer.layers import SiLUFeedForward
+from ukraine.research.transformer.masking import generate_square_subsequent_mask
+from src.configuration_lime import LIMEConfig
+def make_ff(config: LIMEConfig):
+    return SiLUFeedForward(
+        d_model=config.d_model,
+        dff=config.dff,
+        multiple_of=config.multiple_of
+    )
+def make_norm(config: LIMEConfig):
+    return nn.RMSNorm(config.d_model)
+class LIMEForCausalLM(PreTrainedModel):
+    config_class = LIMEConfig
+    base_model_prefix = "lime"
+    _tied_weights_keys = ["transformer.output_fc.weight"]
+    def __init__(self, config: LIMEConfig):
+        super().__init__(config)
+        self.config = config
+        self.transformer = Transformer(
+            num_encoder_layers=config.num_encoder_layers,
+            num_decoder_layers=config.num_decoder_layers,
+            d_model=config.d_model,
+            num_heads=config.num_heads,
+            input_vocab_size=config.vocab_size,
+            target_vocab_size=config.vocab_size,
+            dropout_rate=config.dropout_rate,
+            ff_factory=lambda: make_ff(config),
+            norm_factory=lambda: make_norm(config),
+            pad_token_id=config.pad_token_id,
+            use_encoder=config.use_encoder,
+            use_flash=config.use_flash
+        )
+        self.post_init()
+    # For transformers library
+    def get_input_embeddings(self):
+        return self.transformer.decoder.embedding
+    def set_input_embeddings(self, value):
+        self.transformer.decoder.embedding = value
+    def get_output_embeddings(self):
+        return self.transformer.output_fc
+    def set_output_embeddings(self, new_embeddings):
+        self.transformer.output_fc = new_embeddings
+    def _tie_weights(self):
+        if self.config.tie_word_embeddings:
+            self._tie_or_clone_weights(
+                self.transformer.output_fc,
+                self.get_input_embeddings()
+            )
+    def forward(
+            self,
+            input_ids: torch.LongTensor,
+            attention_mask: Optional[torch.Tensor] = None,
+            labels: Optional[torch.LongTensor] = None,
+            return_dict: Optional[bool] = None,
+            **kwargs
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        batch_size, seq_len = input_ids.shape
+        device = input_ids.device
+        tgt_mask = generate_square_subsequent_mask(seq_len, device)
+        # If we are planning to train the model.
+        if labels is not None:
+            tgt_key_padding_mask = input_ids.eq(self.config.pad_token_id)
+        # For inference we do not need it.
+        else:
+            tgt_key_padding_mask = None
+        logits, _ = self.transformer(
+            src=input_ids,
+            tgt_mask=tgt_mask,
+            tgt_key_padding_mask=tgt_key_padding_mask
+        )
+        loss = None
+        if labels is not None:
+            shift_logits = logits[:, :-1, :].contiguous()
+            shift_labels = labels[:, 1:].contiguous()
+            # This ignore index was used during SFT training.
+            criterion = nn.CrossEntropyLoss(ignore_index=-100)
+            loss = criterion(
+                shift_logits.reshape(-1, self.config.vocab_size),
+                shift_labels.reshape(-1)
+            )
+        if not return_dict:
+            output = (logits,)
+            return ((loss,) + output) if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=None,
+            hidden_states=None,
+            attentions=None
+        )

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "additional_special_tokens": [
+    "<user>",
+    "<assistant>"
+  ],
+  "eos_token": {
+    "content": "<eos>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<eos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<user>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<assistant>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<user>",
+    "<assistant>"
+  ],
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<eos>",
+  "extra_special_tokens": {},
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<pad>",
+  "tokenizer_class": "PreTrainedTokenizerFast"
+}