Note: This model serves as proof that a single individual, without any team or institutional backing, can develop an LLM that demonstrates competitive results.
LIME-1B
LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a compact, practical base model for:
- Building RAG systems (context + question → answer)
- Assistant-style Q&A and task completion
- Summarization, explanation, and rewriting tasks in English
⚠️ LIME-1B is not RLHF/DPO-aligned and does not have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.
1. Model architecture
LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:
| Component | Value |
|---|---|
| Architecture | Decoder-only Transformer |
| Parameters | 1.0B |
| Layers (decoder blocks) | 32 |
| d_model | 1536 |
| FFN dimension (d_ff) | 6144 |
| Attention heads | 24 |
| Vocabulary size | 50,000 |
| Max sequence length | 512 tokens |
| Positional encoding | Sinusoidal |
| Norm | RMSNorm |
| FFN | SiLU MLP |
| Attention | FlashAttention |
| Tying of embeddings | Output head tied to embedding |
| Precision (training) | Mixed fp32/bf16 (autocast) + grad clipping |
2. Training data
2.1 Pretraining
The base model is pretrained as a standard causal language model on English web data:
- Corpus: FineWeb-Edu (CC-MAIN-2025-05 split)
- Language filter: English-only subset
- Objective: next-token prediction (causal LM)
- Token budget: 20B tokens
- Context length: 512 tokens
2.2 Instruction fine-tuning (SFT)
After pretraining, the model is fine-tuned on a unified instruction schema:
[context (optional)] <user> instruction_text <assistant> response_text <eos>
SFT Data Mixture (~97k examples total):
- projecte-aina/RAG_Multilingual
- databricks/databricks-dolly-15k
- HuggingFaceH4/no_robots
- CohereLabs/aya_dataset
- yahma/alpaca-cleaned
Training Details
Hardware
- GPUs: 8 × NVIDIA A100 80GB (data parallel)
- Precision: bfloat16 with gradient clipping (max_norm = 1.0)
Pretraining
Objective: Cross-entropy loss on next-token prediction
Optimizer: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters
Learning Rate Schedule:
- Peak LR: ~5e-4
- Polynomial decay to 5e-6
- Warmup: ~5% of total steps
Instruction fine-tuning (SFT)
Objective: Cross-entropy loss on next-token prediction
Optimizer: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters
Learning Rate Schedule:
- Peak LR: 8e-5
- Polynomial decay to 1e-5
- Warmup: 10% of total steps
3. Evaluation Benchmarks
The following charts comparing LIME-1B against other models across 8 standard evaluation tasks can be viewed here: 
Usage
# Example usage
# pip install -U ukraine
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "anarlavrenov/lime-1b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
def build_prompt(question):
uid = "<user>"
aid = "<assistant>"
return uid + question + aid
question = "Write five questions for a Data Scientist interview."
prompt = build_prompt(question)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs['input_ids'].shape[1]
outputs = model.generate(
**inputs,
max_new_tokens=128,
num_beams=4,
early_stopping=True,
repetition_penalty=1.15,
no_repeat_ngram_size=3,
min_new_tokens=16,
do_sample=False,
top_p=None,
temperature=None,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_tokens = outputs[0][input_length:]
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(output)
# 1. Can you tell us about your experience with data analysis and modeling?
# 2. How do you approach data cleaning and preprocessing?
# 3. How do you approach data visualization and storytelling?
# 4. Can you walk us through a time when you used data to solve a problem?
# 5. How do you approach the ethical considerations of data science and machine learning?
If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets according to their respective licenses and documentation.
Anar Lavrenov
Feel free to reach out for questions, or feedback about LIME-1B!
Citation
@misc{lime1b2025,
title = {LIME-1B: A 1B-parameter English Causal Language Model},
author = {Anar Lavrenov},
year = {2025},
howpublished = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
}
- Downloads last month
- 234
