cahlen
/

pair3-baseline-7b

+---
+license: mit
+language:
+  - en
+tags:
+  - cofrgenet
+  - baseline
+  - transformer
+  - language-model
+  - experiment
+datasets:
+  - HuggingFaceFW/fineweb-edu
+model-index:
+  - name: pair3-baseline-7b
+    results:
+      - task:
+          type: text-generation
+          name: Language Modeling
+        dataset:
+          name: WikiText-2
+          type: wikitext
+          split: test
+        metrics:
+          - name: Perplexity (step 10K, best generalization)
+            type: perplexity
+            value: 39.52
+          - name: Perplexity (step 95K, final)
+            type: perplexity
+            value: 2952578.70
+      - task:
+          type: text-generation
+          name: LAMBADA
+        dataset:
+          name: LAMBADA
+          type: lambada
+          split: test
+        metrics:
+          - name: Accuracy (step 10K, best generalization)
+            type: accuracy
+            value: 15.89
+          - name: Accuracy (step 95K, final)
+            type: accuracy
+            value: 6.31
+---
+# Pair 3 Baseline: 7.5B Standard Transformer
+## What This Model Is (And Isn't)
+**This is NOT a general-purpose language model.** This is the **baseline control** in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)).
+This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is **catastrophically overfit** (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair.
+## Checkpoints
+Two checkpoints are provided:
+| Checkpoint | Step | Tokens Seen | Purpose |
+|-----------|------|-------------|---------|
+| `step_010000.safetensors` | 10,000 / 95,367 | 5.2B / 50B | **Best generalizing model** (lowest val loss) |
+| `step_095367.safetensors` | 95,367 / 95,367 | 50B / 50B | **Final checkpoint** (for head-to-head comparison with CoFrGeNet-F) |
+### Why Two Checkpoints?
+The best *language model* and the best *experiment endpoint* are different checkpoints:
+- **Step 10K** saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually *use* this as an LLM, use this checkpoint.
+- **Step 95K** completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint.
+## Evaluation Results
+| Metric | Step 10K (Best LLM) | Step 20K | Step 95K (Final) |
+|--------|---------------------|----------|-----------------|
+| **WikiText-2 PPL** | **39.52** | 52.21 | 2,952,579 |
+| **WikiText-103 PPL** | **39.52** | 52.21 | 2,952,579 |
+| **LAMBADA PPL** | **51.48** | 76.88 | 5,240,843 |
+| **LAMBADA Acc** | **15.89%** | 13.12% | 6.31% |
+| Throughput | 29,561 tok/s | 26,799 tok/s | 55,693 tok/s |
+| Gen Speed | 104.47 ms/tok | 88.06 ms/tok | 47.19 ms/tok |
+Evaluated with `scripts/04_evaluate.py` on a single NVIDIA B200 GPU using stride-512 sliding window perplexity.
+### Context: Pair 1 (450M) vs Pair 3 (7.5B)
+| Model | Params | WikiText-2 PPL | LAMBADA Acc |
+|-------|--------|---------------|-------------|
+| Pair 1 Baseline (final) | 450M | 23.69 | 26.88% |
+| **Pair 3 Baseline (step 10K)** | **7.5B** | **39.52** | **15.89%** |
+| **Pair 3 Baseline (final)** | **7.5B** | **2,952,579** | **6.31%** |
+The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data.
+## Architecture
+Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN).
+| Parameter | Value |
+|-----------|-------|
+| Layers | 36 |
+| Hidden dim | 4096 |
+| Attention heads | 32 |
+| Head dim | 128 |
+| FFN inner dim | 16,384 (4x hidden) |
+| Vocab size | 50,257 (GPT-2 tokenizer) |
+| Context length | 1,024 |
+| Total parameters | 7,458,103,296 |
+| Weight tying | Yes (lm_head = tok_emb) |
+## Training Details
+| Setting | Value |
+|---------|-------|
+| **Dataset** | FineWeb-Edu 50BT (educational web text) |
+| **Tokenizer** | GPT-2 (tiktoken, `gpt2` encoding) |
+| **Hardware** | 8x NVIDIA B200 (179 GB each) |
+| **Parallelism** | FSDP FULL_SHARD |
+| **Precision** | bfloat16 |
+| **Optimizer** | AdamW (fused), beta1=0.9, beta2=0.95 |
+| **Learning rate** | 3e-4 peak, cosine decay to 0 |
+| **Warmup** | 2,000 steps |
+| **Weight decay** | 0.1 (2D weight tensors only) |
+| **Gradient clipping** | 1.0 max norm |
+| **Batch size** | 524,288 tokens/update (micro_batch=64, no grad accumulation) |
+| **Total steps** | 95,367 (1 epoch over 50B tokens) |
+| **Throughput** | ~132,800 tok/s |
+| **Wall time** | ~5.5 days |
+| **torch.compile** | Disabled (dtype mismatch crash at 7B+ scale) |
+### Validation Loss Trajectory
+The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data.
+| Step | Train Loss | Val Loss | Val PPL |
+|------|-----------|----------|---------|
+| 5,000 | ~3.0 | 3.01 | 20.3 |
+| 8,000 | ~2.9 | 2.94 | 18.8 |
+| **10,000** | **~2.9** | **2.95** | **19.0** |
+| 20,000 | ~1.2 | 3.05 | 21.2 |
+| 40,000 | ~0.4 | 3.33 | 27.8 |
+| 60,000 | ~0.04 | 7.13 | 1,251 |
+| 80,000 | ~0.01 | 11.60 | 109,013 |
+| 95,367 | 0.008 | ~12.0 | ~163,000 |
+## The CoFrGeNet-F Experiment
+This model is one half of **Pair 3** in a series of experiments testing IBM Research's CoFrGeNet-F architecture ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters.
+### Experiment Design
+Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer.
+| | Baseline (this model) | CoFrGeNet-F |
+|-|----------------------|-------------|
+| **Params** | 7.5B | ~4.8B (35% fewer) |
+| **Architecture** | 36L, 4096d, 32h, standard FFN | 36L, 4608d, 36h, Cffn (L=3, d=5) |
+| **Data** | 50B tokens FineWeb-Edu | 50B tokens FineWeb-Edu |
+| **LR / Schedule** | 3e-4, cosine to 0 | 3e-4, cosine to 0 |
+| **Batch size** | 524,288 tokens | 524,288 tokens |
+The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at [`cahlen/pair3-cofrgenet-5b`](https://huggingface.co/cahlen/pair3-cofrgenet-5b) when training completes.
+### Prior Pairs
+| Pair | Baseline | CoFrGeNet-F | Result |
+|------|----------|-------------|--------|
+| Pair 1 | 450M, WikiText-2 PPL 23.69 | 410M, PPL 56.61 | Baseline wins |
+| **Pair 3** | **7.5B (this model)** | **4.8B (training next)** | **Pending** |
+## Usage
+```python
+from safetensors.torch import load_file
+import torch
+# Load the best-generalization checkpoint
+state_dict = load_file("step_010000.safetensors")
+# You'll need the model definition from the project repo
+# git clone https://github.com/cahlen/cofrgenet-f
+# Then:
+from src.baseline.config import BaselineConfig
+from src.baseline.model import BaselineTransformer
+config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32)
+model = BaselineTransformer(config)
+model.load_state_dict(state_dict, strict=False)  # strict=False for weight tying
+model.eval()
+```
+## Project Links
+- **GitHub:** [cahlen/cofrgenet-f](https://github.com/cahlen/cofrgenet-f)
+- **HuggingFace (all models):** [cahlen/cofrgenet-f](https://huggingface.co/cahlen/cofrgenet-f)
+- **CoFrGeNet-F paper:** [arXiv:2601.21766](https://arxiv.org/abs/2601.21766)
+- **Project Wiki:** [GitHub Wiki](https://github.com/cahlen/cofrgenet-f/wiki)
+## License
+MIT