cahlen commited on
Commit
dfae758
·
verified ·
1 Parent(s): 3197925

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +199 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - cofrgenet
7
+ - baseline
8
+ - transformer
9
+ - language-model
10
+ - experiment
11
+ datasets:
12
+ - HuggingFaceFW/fineweb-edu
13
+ model-index:
14
+ - name: pair3-baseline-7b
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ name: Language Modeling
19
+ dataset:
20
+ name: WikiText-2
21
+ type: wikitext
22
+ split: test
23
+ metrics:
24
+ - name: Perplexity (step 10K, best generalization)
25
+ type: perplexity
26
+ value: 39.52
27
+ - name: Perplexity (step 95K, final)
28
+ type: perplexity
29
+ value: 2952578.70
30
+ - task:
31
+ type: text-generation
32
+ name: LAMBADA
33
+ dataset:
34
+ name: LAMBADA
35
+ type: lambada
36
+ split: test
37
+ metrics:
38
+ - name: Accuracy (step 10K, best generalization)
39
+ type: accuracy
40
+ value: 15.89
41
+ - name: Accuracy (step 95K, final)
42
+ type: accuracy
43
+ value: 6.31
44
+ ---
45
+
46
+ # Pair 3 Baseline: 7.5B Standard Transformer
47
+
48
+ ## What This Model Is (And Isn't)
49
+
50
+ **This is NOT a general-purpose language model.** This is the **baseline control** in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)).
51
+
52
+ This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is **catastrophically overfit** (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair.
53
+
54
+ ## Checkpoints
55
+
56
+ Two checkpoints are provided:
57
+
58
+ | Checkpoint | Step | Tokens Seen | Purpose |
59
+ |-----------|------|-------------|---------|
60
+ | `step_010000.safetensors` | 10,000 / 95,367 | 5.2B / 50B | **Best generalizing model** (lowest val loss) |
61
+ | `step_095367.safetensors` | 95,367 / 95,367 | 50B / 50B | **Final checkpoint** (for head-to-head comparison with CoFrGeNet-F) |
62
+
63
+ ### Why Two Checkpoints?
64
+
65
+ The best *language model* and the best *experiment endpoint* are different checkpoints:
66
+
67
+ - **Step 10K** saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually *use* this as an LLM, use this checkpoint.
68
+ - **Step 95K** completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint.
69
+
70
+ ## Evaluation Results
71
+
72
+ | Metric | Step 10K (Best LLM) | Step 20K | Step 95K (Final) |
73
+ |--------|---------------------|----------|-----------------|
74
+ | **WikiText-2 PPL** | **39.52** | 52.21 | 2,952,579 |
75
+ | **WikiText-103 PPL** | **39.52** | 52.21 | 2,952,579 |
76
+ | **LAMBADA PPL** | **51.48** | 76.88 | 5,240,843 |
77
+ | **LAMBADA Acc** | **15.89%** | 13.12% | 6.31% |
78
+ | Throughput | 29,561 tok/s | 26,799 tok/s | 55,693 tok/s |
79
+ | Gen Speed | 104.47 ms/tok | 88.06 ms/tok | 47.19 ms/tok |
80
+
81
+ Evaluated with `scripts/04_evaluate.py` on a single NVIDIA B200 GPU using stride-512 sliding window perplexity.
82
+
83
+ ### Context: Pair 1 (450M) vs Pair 3 (7.5B)
84
+
85
+ | Model | Params | WikiText-2 PPL | LAMBADA Acc |
86
+ |-------|--------|---------------|-------------|
87
+ | Pair 1 Baseline (final) | 450M | 23.69 | 26.88% |
88
+ | **Pair 3 Baseline (step 10K)** | **7.5B** | **39.52** | **15.89%** |
89
+ | **Pair 3 Baseline (final)** | **7.5B** | **2,952,579** | **6.31%** |
90
+
91
+ The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data.
92
+
93
+ ## Architecture
94
+
95
+ Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN).
96
+
97
+ | Parameter | Value |
98
+ |-----------|-------|
99
+ | Layers | 36 |
100
+ | Hidden dim | 4096 |
101
+ | Attention heads | 32 |
102
+ | Head dim | 128 |
103
+ | FFN inner dim | 16,384 (4x hidden) |
104
+ | Vocab size | 50,257 (GPT-2 tokenizer) |
105
+ | Context length | 1,024 |
106
+ | Total parameters | 7,458,103,296 |
107
+ | Weight tying | Yes (lm_head = tok_emb) |
108
+
109
+ ## Training Details
110
+
111
+ | Setting | Value |
112
+ |---------|-------|
113
+ | **Dataset** | FineWeb-Edu 50BT (educational web text) |
114
+ | **Tokenizer** | GPT-2 (tiktoken, `gpt2` encoding) |
115
+ | **Hardware** | 8x NVIDIA B200 (179 GB each) |
116
+ | **Parallelism** | FSDP FULL_SHARD |
117
+ | **Precision** | bfloat16 |
118
+ | **Optimizer** | AdamW (fused), beta1=0.9, beta2=0.95 |
119
+ | **Learning rate** | 3e-4 peak, cosine decay to 0 |
120
+ | **Warmup** | 2,000 steps |
121
+ | **Weight decay** | 0.1 (2D weight tensors only) |
122
+ | **Gradient clipping** | 1.0 max norm |
123
+ | **Batch size** | 524,288 tokens/update (micro_batch=64, no grad accumulation) |
124
+ | **Total steps** | 95,367 (1 epoch over 50B tokens) |
125
+ | **Throughput** | ~132,800 tok/s |
126
+ | **Wall time** | ~5.5 days |
127
+ | **torch.compile** | Disabled (dtype mismatch crash at 7B+ scale) |
128
+
129
+ ### Validation Loss Trajectory
130
+
131
+ The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data.
132
+
133
+ | Step | Train Loss | Val Loss | Val PPL |
134
+ |------|-----------|----------|---------|
135
+ | 5,000 | ~3.0 | 3.01 | 20.3 |
136
+ | 8,000 | ~2.9 | 2.94 | 18.8 |
137
+ | **10,000** | **~2.9** | **2.95** | **19.0** |
138
+ | 20,000 | ~1.2 | 3.05 | 21.2 |
139
+ | 40,000 | ~0.4 | 3.33 | 27.8 |
140
+ | 60,000 | ~0.04 | 7.13 | 1,251 |
141
+ | 80,000 | ~0.01 | 11.60 | 109,013 |
142
+ | 95,367 | 0.008 | ~12.0 | ~163,000 |
143
+
144
+ ## The CoFrGeNet-F Experiment
145
+
146
+ This model is one half of **Pair 3** in a series of experiments testing IBM Research's CoFrGeNet-F architecture ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters.
147
+
148
+ ### Experiment Design
149
+
150
+ Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer.
151
+
152
+ | | Baseline (this model) | CoFrGeNet-F |
153
+ |-|----------------------|-------------|
154
+ | **Params** | 7.5B | ~4.8B (35% fewer) |
155
+ | **Architecture** | 36L, 4096d, 32h, standard FFN | 36L, 4608d, 36h, Cffn (L=3, d=5) |
156
+ | **Data** | 50B tokens FineWeb-Edu | 50B tokens FineWeb-Edu |
157
+ | **LR / Schedule** | 3e-4, cosine to 0 | 3e-4, cosine to 0 |
158
+ | **Batch size** | 524,288 tokens | 524,288 tokens |
159
+
160
+ The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at [`cahlen/pair3-cofrgenet-5b`](https://huggingface.co/cahlen/pair3-cofrgenet-5b) when training completes.
161
+
162
+ ### Prior Pairs
163
+
164
+ | Pair | Baseline | CoFrGeNet-F | Result |
165
+ |------|----------|-------------|--------|
166
+ | Pair 1 | 450M, WikiText-2 PPL 23.69 | 410M, PPL 56.61 | Baseline wins |
167
+ | **Pair 3** | **7.5B (this model)** | **4.8B (training next)** | **Pending** |
168
+
169
+ ## Usage
170
+
171
+ ```python
172
+ from safetensors.torch import load_file
173
+ import torch
174
+
175
+ # Load the best-generalization checkpoint
176
+ state_dict = load_file("step_010000.safetensors")
177
+
178
+ # You'll need the model definition from the project repo
179
+ # git clone https://github.com/cahlen/cofrgenet-f
180
+ # Then:
181
+ from src.baseline.config import BaselineConfig
182
+ from src.baseline.model import BaselineTransformer
183
+
184
+ config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32)
185
+ model = BaselineTransformer(config)
186
+ model.load_state_dict(state_dict, strict=False) # strict=False for weight tying
187
+ model.eval()
188
+ ```
189
+
190
+ ## Project Links
191
+
192
+ - **GitHub:** [cahlen/cofrgenet-f](https://github.com/cahlen/cofrgenet-f)
193
+ - **HuggingFace (all models):** [cahlen/cofrgenet-f](https://huggingface.co/cahlen/cofrgenet-f)
194
+ - **CoFrGeNet-F paper:** [arXiv:2601.21766](https://arxiv.org/abs/2601.21766)
195
+ - **Project Wiki:** [GitHub Wiki](https://github.com/cahlen/cofrgenet-f/wiki)
196
+
197
+ ## License
198
+
199
+ MIT