---
license: mit
language:
  - en
tags:
  - cofrgenet
  - baseline
  - transformer
  - language-model
  - experiment
datasets:
  - HuggingFaceFW/fineweb-edu
model-index:
  - name: pair3-baseline-7b
    results:
      - task:
          type: text-generation
          name: Language Modeling
        dataset:
          name: WikiText-2
          type: wikitext
          split: test
        metrics:
          - name: Perplexity (step 10K, best generalization)
            type: perplexity
            value: 39.52
          - name: Perplexity (step 95K, final)
            type: perplexity
            value: 2952578.70
      - task:
          type: text-generation
          name: LAMBADA
        dataset:
          name: LAMBADA
          type: lambada
          split: test
        metrics:
          - name: Accuracy (step 10K, best generalization)
            type: accuracy
            value: 15.89
          - name: Accuracy (step 95K, final)
            type: accuracy
            value: 6.31
---

# Pair 3 Baseline: 7.5B Standard Transformer

## What This Model Is (And Isn't)

**This is NOT a general-purpose language model.** This is the **baseline control** in a paired experiment comparing standard Transformer FFN layers against CoFrGeNet-F's continued fraction FFN replacement ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)).

This model exists solely to provide an apples-to-apples comparison target. It was trained on 50B tokens with a 7.5B parameter model — massively overparameterized for the data budget (Chinchilla optimal would be ~150B tokens for this model size). As a result, the final checkpoint is **catastrophically overfit** (train loss 0.008, WikiText-2 PPL 2.95M). This is by design — both the baseline and its CoFrGeNet-F counterpart face the same data constraint, making the comparison fair.

## Checkpoints

Two checkpoints are provided:

| Checkpoint | Step | Tokens Seen | Purpose |
|-----------|------|-------------|---------|
| `step_010000.safetensors` | 10,000 / 95,367 | 5.2B / 50B | **Best generalizing model** (lowest val loss) |
| `step_095367.safetensors` | 95,367 / 95,367 | 50B / 50B | **Final checkpoint** (for head-to-head comparison with CoFrGeNet-F) |

### Why Two Checkpoints?

The best *language model* and the best *experiment endpoint* are different checkpoints:

- **Step 10K** saw only 10% of the data but has the lowest validation loss (2.94) and best benchmark scores. This is the point before overfitting erodes generalization. If you want to actually *use* this as an LLM, use this checkpoint.
- **Step 95K** completed the full training run. It memorized the training data (train PPL 1.0) but lost all generalization (WikiText-2 PPL 2.95M). For the CoFrGeNet-F comparison, we evaluate both models at the same step count on the same data — so this is the experiment endpoint.

## Evaluation Results

| Metric | Step 10K (Best LLM) | Step 20K | Step 95K (Final) |
|--------|---------------------|----------|-----------------|
| **WikiText-2 PPL** | **39.52** | 52.21 | 2,952,579 |
| **WikiText-103 PPL** | **39.52** | 52.21 | 2,952,579 |
| **LAMBADA PPL** | **51.48** | 76.88 | 5,240,843 |
| **LAMBADA Acc** | **15.89%** | 13.12% | 6.31% |
| Throughput | 29,561 tok/s | 26,799 tok/s | 55,693 tok/s |
| Gen Speed | 104.47 ms/tok | 88.06 ms/tok | 47.19 ms/tok |

Evaluated with `scripts/04_evaluate.py` on a single NVIDIA B200 GPU using stride-512 sliding window perplexity.

### Context: Pair 1 (450M) vs Pair 3 (7.5B)

| Model | Params | WikiText-2 PPL | LAMBADA Acc |
|-------|--------|---------------|-------------|
| Pair 1 Baseline (final) | 450M | 23.69 | 26.88% |
| **Pair 3 Baseline (step 10K)** | **7.5B** | **39.52** | **15.89%** |
| **Pair 3 Baseline (final)** | **7.5B** | **2,952,579** | **6.31%** |

The 7.5B model at step 10K underperforms the 450M model's final checkpoint on benchmarks. This is expected: the 450M model completed a full 50B-token run at a much healthier tokens-per-parameter ratio (111 tok/param vs 6.7 tok/param), while the 7.5B model at step 10K has only seen 5.2B tokens — not enough for a model this large to learn effectively. This illustrates the Chinchilla scaling law: a smaller model with adequate data beats a larger model with insufficient data.

## Architecture

Standard GPT-2-style Transformer with pre-norm (LayerNorm before attention and FFN).

| Parameter | Value |
|-----------|-------|
| Layers | 36 |
| Hidden dim | 4096 |
| Attention heads | 32 |
| Head dim | 128 |
| FFN inner dim | 16,384 (4x hidden) |
| Vocab size | 50,257 (GPT-2 tokenizer) |
| Context length | 1,024 |
| Total parameters | 7,458,103,296 |
| Weight tying | Yes (lm_head = tok_emb) |

## Training Details

| Setting | Value |
|---------|-------|
| **Dataset** | FineWeb-Edu 50BT (educational web text) |
| **Tokenizer** | GPT-2 (tiktoken, `gpt2` encoding) |
| **Hardware** | 8x NVIDIA B200 (179 GB each) |
| **Parallelism** | FSDP FULL_SHARD |
| **Precision** | bfloat16 |
| **Optimizer** | AdamW (fused), beta1=0.9, beta2=0.95 |
| **Learning rate** | 3e-4 peak, cosine decay to 0 |
| **Warmup** | 2,000 steps |
| **Weight decay** | 0.1 (2D weight tensors only) |
| **Gradient clipping** | 1.0 max norm |
| **Batch size** | 524,288 tokens/update (micro_batch=64, no grad accumulation) |
| **Total steps** | 95,367 (1 epoch over 50B tokens) |
| **Throughput** | ~132,800 tok/s |
| **Wall time** | ~5.5 days |
| **torch.compile** | Disabled (dtype mismatch crash at 7B+ scale) |

### Validation Loss Trajectory

The model's best validation loss occurred early in training. After ~step 10K, val loss monotonically increases while train loss continues dropping — classic overfitting from an overparameterized model on limited data.

| Step | Train Loss | Val Loss | Val PPL |
|------|-----------|----------|---------|
| 5,000 | ~3.0 | 3.01 | 20.3 |
| 8,000 | ~2.9 | 2.94 | 18.8 |
| **10,000** | **~2.9** | **2.95** | **19.0** |
| 20,000 | ~1.2 | 3.05 | 21.2 |
| 40,000 | ~0.4 | 3.33 | 27.8 |
| 60,000 | ~0.04 | 7.13 | 1,251 |
| 80,000 | ~0.01 | 11.60 | 109,013 |
| 95,367 | 0.008 | ~12.0 | ~163,000 |

## The CoFrGeNet-F Experiment

This model is one half of **Pair 3** in a series of experiments testing IBM Research's CoFrGeNet-F architecture ([arXiv:2601.21766](https://arxiv.org/abs/2601.21766)). CoFrGeNet-F replaces standard FFN layers with continued fraction networks — a mathematically rich function approximator that achieves the same expressiveness with fewer parameters.

### Experiment Design

Each "pair" trains a standard Transformer baseline and a CoFrGeNet-F model on identical data with identical hyperparameters. The only difference is the FFN layer.

| | Baseline (this model) | CoFrGeNet-F |
|-|----------------------|-------------|
| **Params** | 7.5B | ~4.8B (35% fewer) |
| **Architecture** | 36L, 4096d, 32h, standard FFN | 36L, 4608d, 36h, Cffn (L=3, d=5) |
| **Data** | 50B tokens FineWeb-Edu | 50B tokens FineWeb-Edu |
| **LR / Schedule** | 3e-4, cosine to 0 | 3e-4, cosine to 0 |
| **Batch size** | 524,288 tokens | 524,288 tokens |

The IBM paper showed CoFrGeNet-F's advantage only emerges at GPT-2 XL scale (~1B+ params). Pair 3 tests at 7.5x that scale. Results for the CoFrGeNet-F counterpart will be published at [`cahlen/pair3-cofrgenet-5b`](https://huggingface.co/cahlen/pair3-cofrgenet-5b) when training completes.

### Prior Pairs

| Pair | Baseline | CoFrGeNet-F | Result |
|------|----------|-------------|--------|
| Pair 1 | 450M, WikiText-2 PPL 23.69 | 410M, PPL 56.61 | Baseline wins |
| **Pair 3** | **7.5B (this model)** | **4.8B (training next)** | **Pending** |

## Usage

```python
from safetensors.torch import load_file
import torch

# Load the best-generalization checkpoint
state_dict = load_file("step_010000.safetensors")

# You'll need the model definition from the project repo
# git clone https://github.com/cahlen/cofrgenet-f
# Then:
from src.baseline.config import BaselineConfig
from src.baseline.model import BaselineTransformer

config = BaselineConfig(n_layer=36, n_embd=4096, n_head=32)
model = BaselineTransformer(config)
model.load_state_dict(state_dict, strict=False)  # strict=False for weight tying
model.eval()
```

## Project Links

- **GitHub:** [cahlen/cofrgenet-f](https://github.com/cahlen/cofrgenet-f)
- **HuggingFace (all models):** [cahlen/cofrgenet-f](https://huggingface.co/cahlen/cofrgenet-f)
- **CoFrGeNet-F paper:** [arXiv:2601.21766](https://arxiv.org/abs/2601.21766)
- **Project Wiki:** [GitHub Wiki](https://github.com/cahlen/cofrgenet-f/wiki)

## License

MIT