Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- cofrgenet
|
| 7 |
+
- baseline
|
| 8 |
+
- transformer
|
| 9 |
+
- language-model
|
| 10 |
+
- experiment
|
| 11 |
+
datasets:
|
| 12 |
+
- HuggingFaceFW/fineweb-edu
|
| 13 |
+
model-index:
|
| 14 |
+
- name: pair3-baseline-7b
|
| 15 |
+
results:
|
| 16 |
+
- task:
|
| 17 |
+
type: text-generation
|
| 18 |
+
name: Language Modeling
|
| 19 |
+
dataset:
|
| 20 |
+
name: WikiText-2
|
| 21 |
+
type: wikitext
|
| 22 |
+
split: test
|
| 23 |
+
metrics:
|
| 24 |
+
- name: Perplexity (step 10K, best generalization)
|
| 25 |
+
type: perplexity
|
| 26 |
+
value: 39.52
|
| 27 |
+
- name: Perplexity (step 95K, final)
|
| 28 |
+
type: perplexity
|
| 29 |
+
value: 2952578.70
|
| 30 |
+
- task:
|
| 31 |
+
type: text-generation
|
| 32 |
+
name: LAMBADA
|
| 33 |
+
dataset:
|
| 34 |
+
name: LAMBADA
|
| 35 |
+
type: lambada
|
| 36 |
+
split: test
|
| 37 |
+
metrics:
|
| 38 |
+
- name: Accuracy (step 10K, best generalization)
|
| 39 |
+
type: accuracy
|
| 40 |
+
value: 15.89
|
| 41 |
+
- name: Accuracy (step 95K, final)
|
| 42 |
+
type: accuracy
|
| 43 |
+
value: 6.31
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
# Pair 3 Baseline: 7.5B Standard Transformer
|
| 47 |
+
|
| 48 |
+
## What This Model Is (And Isn't)
|
| 49 |
+
|
| 50 |
+
**This is NOT a general-purpose language model.** This is the **baseline control** in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)).
|
| 51 |
+
|
| 52 |
+
This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is **catastrophically overfit** (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair.
|
| 53 |
+
|
| 54 |
+
## Checkpoints
|
| 55 |
+
|
| 56 |
+
Two checkpoints are provided:
|
| 57 |
+
|
| 58 |
+
| Checkpoint | Step | Tokens Seen | Purpose |
|
| 59 |
+
|-----------|------|-------------|---------|
|
| 60 |
+
| `step_010000.safetensors` | 10,000 / 95,367 | 5.2B / 50B | **Best generalizing model** (lowest val loss) |
|
| 61 |
+
| `step_095367.safetensors` | 95,367 / 95,367 | 50B / 50B | **Final checkpoint** (for head-to-head comparison with CoFrGeNet-F) |
|
| 62 |
+
|
| 63 |
+
### Why Two Checkpoints?
|
| 64 |
+
|
| 65 |
+
The best *language model* and the best *experiment endpoint* are different checkpoints:
|
| 66 |
+
|
| 67 |
+
- **Step 10K** saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually *use* this as an LLM, use this checkpoint.
|
| 68 |
+
- **Step 95K** completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint.
|
| 69 |
+
|
| 70 |
+
## Evaluation Results
|
| 71 |
+
|
| 72 |
+
| Metric | Step 10K (Best LLM) | Step 20K | Step 95K (Final) |
|
| 73 |
+
|--------|---------------------|----------|-----------------|
|
| 74 |
+
| **WikiText-2 PPL** | **39.52** | 52.21 | 2,952,579 |
|
| 75 |
+
| **WikiText-103 PPL** | **39.52** | 52.21 | 2,952,579 |
|
| 76 |
+
| **LAMBADA PPL** | **51.48** | 76.88 | 5,240,843 |
|
| 77 |
+
| **LAMBADA Acc** | **15.89%** | 13.12% | 6.31% |
|
| 78 |
+
| Throughput | 29,561 tok/s | 26,799 tok/s | 55,693 tok/s |
|
| 79 |
+
| Gen Speed | 104.47 ms/tok | 88.06 ms/tok | 47.19 ms/tok |
|
| 80 |
+
|
| 81 |
+
Evaluated with `scripts/04_evaluate.py` on a single NVIDIA B200 GPU using stride-512 sliding window perplexity.
|
| 82 |
+
|
| 83 |
+
### Context: Pair 1 (450M) vs Pair 3 (7.5B)
|
| 84 |
+
|
| 85 |
+
| Model | Params | WikiText-2 PPL | LAMBADA Acc |
|
| 86 |
+
|-------|--------|---------------|-------------|
|
| 87 |
+
| Pair 1 Baseline (final) | 450M | 23.69 | 26.88% |
|
| 88 |
+
| **Pair 3 Baseline (step 10K)** | **7.5B** | **39.52** | **15.89%** |
|
| 89 |
+
| **Pair 3 Baseline (final)** | **7.5B** | **2,952,579** | **6.31%** |
|
| 90 |
+
|
| 91 |
+
The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data.
|
| 92 |
+
|
| 93 |
+
## Architecture
|
| 94 |
+
|
| 95 |
+
Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN).
|
| 96 |
+
|
| 97 |
+
| Parameter | Value |
|
| 98 |
+
|-----------|-------|
|
| 99 |
+
| Layers | 36 |
|
| 100 |
+
| Hidden dim | 4096 |
|
| 101 |
+
| Attention heads | 32 |
|
| 102 |
+
| Head dim | 128 |
|
| 103 |
+
| FFN inner dim | 16,384 (4x hidden) |
|
| 104 |
+
| Vocab size | 50,257 (GPT-2 tokenizer) |
|
| 105 |
+
| Context length | 1,024 |
|
| 106 |
+
| Total parameters | 7,458,103,296 |
|
| 107 |
+
| Weight tying | Yes (lm_head = tok_emb) |
|
| 108 |
+
|
| 109 |
+
## Training Details
|
| 110 |
+
|
| 111 |
+
| Setting | Value |
|
| 112 |
+
|---------|-------|
|
| 113 |
+
| **Dataset** | FineWeb-Edu 50BT (educational web text) |
|
| 114 |
+
| **Tokenizer** | GPT-2 (tiktoken, `gpt2` encoding) |
|
| 115 |
+
| **Hardware** | 8x NVIDIA B200 (179 GB each) |
|
| 116 |
+
| **Parallelism** | FSDP FULL_SHARD |
|
| 117 |
+
| **Precision** | bfloat16 |
|
| 118 |
+
| **Optimizer** | AdamW (fused), beta1=0.9, beta2=0.95 |
|
| 119 |
+
| **Learning rate** | 3e-4 peak, cosine decay to 0 |
|
| 120 |
+
| **Warmup** | 2,000 steps |
|
| 121 |
+
| **Weight decay** | 0.1 (2D weight tensors only) |
|
| 122 |
+
| **Gradient clipping** | 1.0 max norm |
|
| 123 |
+
| **Batch size** | 524,288 tokens/update (micro_batch=64, no grad accumulation) |
|
| 124 |
+
| **Total steps** | 95,367 (1 epoch over 50B tokens) |
|
| 125 |
+
| **Throughput** | ~132,800 tok/s |
|
| 126 |
+
| **Wall time** | ~5.5 days |
|
| 127 |
+
| **torch.compile** | Disabled (dtype mismatch crash at 7B+ scale) |
|
| 128 |
+
|
| 129 |
+
### Validation Loss Trajectory
|
| 130 |
+
|
| 131 |
+
The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data.
|
| 132 |
+
|
| 133 |
+
| Step | Train Loss | Val Loss | Val PPL |
|
| 134 |
+
|------|-----------|----------|---------|
|
| 135 |
+
| 5,000 | ~3.0 | 3.01 | 20.3 |
|
| 136 |
+
| 8,000 | ~2.9 | 2.94 | 18.8 |
|
| 137 |
+
| **10,000** | **~2.9** | **2.95** | **19.0** |
|
| 138 |
+
| 20,000 | ~1.2 | 3.05 | 21.2 |
|
| 139 |
+
| 40,000 | ~0.4 | 3.33 | 27.8 |
|
| 140 |
+
| 60,000 | ~0.04 | 7.13 | 1,251 |
|
| 141 |
+
| 80,000 | ~0.01 | 11.60 | 109,013 |
|
| 142 |
+
| 95,367 | 0.008 | ~12.0 | ~163,000 |
|
| 143 |
+
|
| 144 |
+
## The CoFrGeNet-F Experiment
|
| 145 |
+
|
| 146 |
+
This model is one half of **Pair 3** in a series of experiments testing IBM Research's CoFrGeNet-F architecture ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters.
|
| 147 |
+
|
| 148 |
+
### Experiment Design
|
| 149 |
+
|
| 150 |
+
Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer.
|
| 151 |
+
|
| 152 |
+
| | Baseline (this model) | CoFrGeNet-F |
|
| 153 |
+
|-|----------------------|-------------|
|
| 154 |
+
| **Params** | 7.5B | ~4.8B (35% fewer) |
|
| 155 |
+
| **Architecture** | 36L, 4096d, 32h, standard FFN | 36L, 4608d, 36h, Cffn (L=3, d=5) |
|
| 156 |
+
| **Data** | 50B tokens FineWeb-Edu | 50B tokens FineWeb-Edu |
|
| 157 |
+
| **LR / Schedule** | 3e-4, cosine to 0 | 3e-4, cosine to 0 |
|
| 158 |
+
| **Batch size** | 524,288 tokens | 524,288 tokens |
|
| 159 |
+
|
| 160 |
+
The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at [`cahlen/pair3-cofrgenet-5b`](https://huggingface.co/cahlen/pair3-cofrgenet-5b) when training completes.
|
| 161 |
+
|
| 162 |
+
### Prior Pairs
|
| 163 |
+
|
| 164 |
+
| Pair | Baseline | CoFrGeNet-F | Result |
|
| 165 |
+
|------|----------|-------------|--------|
|
| 166 |
+
| Pair 1 | 450M, WikiText-2 PPL 23.69 | 410M, PPL 56.61 | Baseline wins |
|
| 167 |
+
| **Pair 3** | **7.5B (this model)** | **4.8B (training next)** | **Pending** |
|
| 168 |
+
|
| 169 |
+
## Usage
|
| 170 |
+
|
| 171 |
+
```python
|
| 172 |
+
from safetensors.torch import load_file
|
| 173 |
+
import torch
|
| 174 |
+
|
| 175 |
+
# Load the best-generalization checkpoint
|
| 176 |
+
state_dict = load_file("step_010000.safetensors")
|
| 177 |
+
|
| 178 |
+
# You'll need the model definition from the project repo
|
| 179 |
+
# git clone https://github.com/cahlen/cofrgenet-f
|
| 180 |
+
# Then:
|
| 181 |
+
from src.baseline.config import BaselineConfig
|
| 182 |
+
from src.baseline.model import BaselineTransformer
|
| 183 |
+
|
| 184 |
+
config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32)
|
| 185 |
+
model = BaselineTransformer(config)
|
| 186 |
+
model.load_state_dict(state_dict, strict=False) # strict=False for weight tying
|
| 187 |
+
model.eval()
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
## Project Links
|
| 191 |
+
|
| 192 |
+
- **GitHub:** [cahlen/cofrgenet-f](https://github.com/cahlen/cofrgenet-f)
|
| 193 |
+
- **HuggingFace (all models):** [cahlen/cofrgenet-f](https://huggingface.co/cahlen/cofrgenet-f)
|
| 194 |
+
- **CoFrGeNet-F paper:** [arXiv:2601.21766](https://arxiv.org/abs/2601.21766)
|
| 195 |
+
- **Project Wiki:** [GitHub Wiki](https://github.com/cahlen/cofrgenet-f/wiki)
|
| 196 |
+
|
| 197 |
+
## License
|
| 198 |
+
|
| 199 |
+
MIT
|