--- license: mit language: - en tags: - cofrgenet - baseline - transformer - language-model - experiment datasets: - HuggingFaceFW/fineweb-edu model-index: - name: pair3-baseline-7b results: - task: type: text-generation name: Language Modeling dataset: name: WikiText-2 type: wikitext split: test metrics: - name: Perplexity (step 10K, best generalization) type: perplexity value: 39.52 - name: Perplexity (step 95K, final) type: perplexity value: 2952578.70 - task: type: text-generation name: LAMBADA dataset: name: LAMBADA type: lambada split: test metrics: - name: Accuracy (step 10K, best generalization) type: accuracy value: 15.89 - name: Accuracy (step 95K, final) type: accuracy value: 6.31 --- # Pair 3 Baseline: 7.5B Standard Transformer ## What This Model Is (And Isn't) **This is NOT a general-purpose language model.** This is the **baseline control** in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)). This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is **catastrophically overfit** (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair. ## Checkpoints Two checkpoints are provided: | Checkpoint | Step | Tokens Seen | Purpose | |-----------|------|-------------|---------| | `step_010000.safetensors` | 10,000 / 95,367 | 5.2B / 50B | **Best generalizing model** (lowest val loss) | | `step_095367.safetensors` | 95,367 / 95,367 | 50B / 50B | **Final checkpoint** (for head-to-head comparison with CoFrGeNet-F) | ### Why Two Checkpoints? The best *language model* and the best *experiment endpoint* are different checkpoints: - **Step 10K** saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually *use* this as an LLM, use this checkpoint. - **Step 95K** completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint. ## Evaluation Results | Metric | Step 10K (Best LLM) | Step 20K | Step 95K (Final) | |--------|---------------------|----------|-----------------| | **WikiText-2 PPL** | **39.52** | 52.21 | 2,952,579 | | **WikiText-103 PPL** | **39.52** | 52.21 | 2,952,579 | | **LAMBADA PPL** | **51.48** | 76.88 | 5,240,843 | | **LAMBADA Acc** | **15.89%** | 13.12% | 6.31% | | Throughput | 29,561 tok/s | 26,799 tok/s | 55,693 tok/s | | Gen Speed | 104.47 ms/tok | 88.06 ms/tok | 47.19 ms/tok | Evaluated with `scripts/04_evaluate.py` on a single NVIDIA B200 GPU using stride-512 sliding window perplexity. ### Context: Pair 1 (450M) vs Pair 3 (7.5B) | Model | Params | WikiText-2 PPL | LAMBADA Acc | |-------|--------|---------------|-------------| | Pair 1 Baseline (final) | 450M | 23.69 | 26.88% | | **Pair 3 Baseline (step 10K)** | **7.5B** | **39.52** | **15.89%** | | **Pair 3 Baseline (final)** | **7.5B** | **2,952,579** | **6.31%** | The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data. ## Architecture Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN). | Parameter | Value | |-----------|-------| | Layers | 36 | | Hidden dim | 4096 | | Attention heads | 32 | | Head dim | 128 | | FFN inner dim | 16,384 (4x hidden) | | Vocab size | 50,257 (GPT-2 tokenizer) | | Context length | 1,024 | | Total parameters | 7,458,103,296 | | Weight tying | Yes (lm_head = tok_emb) | ## Training Details | Setting | Value | |---------|-------| | **Dataset** | FineWeb-Edu 50BT (educational web text) | | **Tokenizer** | GPT-2 (tiktoken, `gpt2` encoding) | | **Hardware** | 8x NVIDIA B200 (179 GB each) | | **Parallelism** | FSDP FULL_SHARD | | **Precision** | bfloat16 | | **Optimizer** | AdamW (fused), beta1=0.9, beta2=0.95 | | **Learning rate** | 3e-4 peak, cosine decay to 0 | | **Warmup** | 2,000 steps | | **Weight decay** | 0.1 (2D weight tensors only) | | **Gradient clipping** | 1.0 max norm | | **Batch size** | 524,288 tokens/update (micro_batch=64, no grad accumulation) | | **Total steps** | 95,367 (1 epoch over 50B tokens) | | **Throughput** | ~132,800 tok/s | | **Wall time** | ~5.5 days | | **torch.compile** | Disabled (dtype mismatch crash at 7B+ scale) | ### Validation Loss Trajectory The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data. | Step | Train Loss | Val Loss | Val PPL | |------|-----------|----------|---------| | 5,000 | ~3.0 | 3.01 | 20.3 | | 8,000 | ~2.9 | 2.94 | 18.8 | | **10,000** | **~2.9** | **2.95** | **19.0** | | 20,000 | ~1.2 | 3.05 | 21.2 | | 40,000 | ~0.4 | 3.33 | 27.8 | | 60,000 | ~0.04 | 7.13 | 1,251 | | 80,000 | ~0.01 | 11.60 | 109,013 | | 95,367 | 0.008 | ~12.0 | ~163,000 | ## The CoFrGeNet-F Experiment This model is one half of **Pair 3** in a series of experiments testing IBM Research's CoFrGeNet-F architecture ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters. ### Experiment Design Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer. | | Baseline (this model) | CoFrGeNet-F | |-|----------------------|-------------| | **Params** | 7.5B | ~4.8B (35% fewer) | | **Architecture** | 36L, 4096d, 32h, standard FFN | 36L, 4608d, 36h, Cffn (L=3, d=5) | | **Data** | 50B tokens FineWeb-Edu | 50B tokens FineWeb-Edu | | **LR / Schedule** | 3e-4, cosine to 0 | 3e-4, cosine to 0 | | **Batch size** | 524,288 tokens | 524,288 tokens | The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at [`cahlen/pair3-cofrgenet-5b`](https://huggingface.co/cahlen/pair3-cofrgenet-5b) when training completes. ### Prior Pairs | Pair | Baseline | CoFrGeNet-F | Result | |------|----------|-------------|--------| | Pair 1 | 450M, WikiText-2 PPL 23.69 | 410M, PPL 56.61 | Baseline wins | | **Pair 3** | **7.5B (this model)** | **4.8B (training next)** | **Pending** | ## Usage ```python from safetensors.torch import load_file import torch # Load the best-generalization checkpoint state_dict = load_file("step_010000.safetensors") # You'll need the model definition from the project repo # git clone https://github.com/cahlen/cofrgenet-f # Then: from src.baseline.config import BaselineConfig from src.baseline.model import BaselineTransformer config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32) model = BaselineTransformer(config) model.load_state_dict(state_dict, strict=False) # strict=False for weight tying model.eval() ``` ## Project Links - **GitHub:** [cahlen/cofrgenet-f](https://github.com/cahlen/cofrgenet-f) - **HuggingFace (all models):** [cahlen/cofrgenet-f](https://huggingface.co/cahlen/cofrgenet-f) - **CoFrGeNet-F paper:** [arXiv:2601.21766](https://arxiv.org/abs/2601.21766) - **Project Wiki:** [GitHub Wiki](https://github.com/cahlen/cofrgenet-f/wiki) ## License MIT