Add model documentation
Browse files
README.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PolyPythias
|
| 2 |
+
|
| 3 |
+
This model is part of the **PolyPythias** suite, an extension of the [Pythia](https://github.com/EleutherAI/pythia) project providing 45 additional training runs across 5 model sizes with 9 different random seeds each. These models enable systematic study of training stability and reproducibility in language models.
|
| 4 |
+
|
| 5 |
+
## Paper
|
| 6 |
+
|
| 7 |
+
**[PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs](https://arxiv.org/abs/2503.09543)**
|
| 8 |
+
|
| 9 |
+
Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. *ICLR 2025*.
|
| 10 |
+
|
| 11 |
+
## Model Details
|
| 12 |
+
|
| 13 |
+
| Size | Parameters | Layers | Model Dim | Heads | Original Model |
|
| 14 |
+
|------|------------|--------|-----------|-------|----------------|
|
| 15 |
+
| 14M | 14M | 6 | 128 | 4 | [pythia-14m](https://huggingface.co/EleutherAI/pythia-14m) |
|
| 16 |
+
| 31M | 31M | 6 | 256 | 8 | [pythia-31m](https://huggingface.co/EleutherAI/pythia-31m) |
|
| 17 |
+
| 70M | 70M | 6 | 512 | 8 | [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m) |
|
| 18 |
+
| 160M | 160M | 12 | 768 | 12 | [pythia-160m](https://huggingface.co/EleutherAI/pythia-160m) |
|
| 19 |
+
| 410M | 410M | 24 | 1024 | 16 | [pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) |
|
| 20 |
+
|
| 21 |
+
All models were trained on 300B tokens from [The Pile](https://pile.eleuther.ai/).
|
| 22 |
+
|
| 23 |
+
## Naming Convention
|
| 24 |
+
|
| 25 |
+
- **`pythia-{size}m`** - Original Pythia model (seed 1234)
|
| 26 |
+
- **`pythia-{size}m-seed{1-9}`** - PolyPythias variants with different random seeds
|
| 27 |
+
- **`pythia-160m-data-seed{1-3}`** - 160M models with only data ordering varied (weight init fixed)
|
| 28 |
+
- **`pythia-160m-weight-seed{1-3}`** - 160M models with only weight initialization varied (data order fixed)
|
| 29 |
+
|
| 30 |
+
The decoupled seed variants (data-seed and weight-seed) allow researchers to separately study the effects of data ordering vs. weight initialization.
|
| 31 |
+
|
| 32 |
+
## Quick Start
|
| 33 |
+
|
| 34 |
+
```python
|
| 35 |
+
from transformers import GPTNeoXForCausalLM, AutoTokenizer
|
| 36 |
+
|
| 37 |
+
# Load the final checkpoint
|
| 38 |
+
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-70m-seed3")
|
| 39 |
+
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m-seed3")
|
| 40 |
+
|
| 41 |
+
# Generate text
|
| 42 |
+
inputs = tokenizer("The quick brown fox", return_tensors="pt")
|
| 43 |
+
outputs = model.generate(**inputs, max_new_tokens=20)
|
| 44 |
+
print(tokenizer.decode(outputs[0]))
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Available Checkpoints
|
| 48 |
+
|
| 49 |
+
Each model provides **154 intermediate checkpoints** saved as Git branches:
|
| 50 |
+
|
| 51 |
+
| Checkpoint | Training Tokens | Description |
|
| 52 |
+
|------------|-----------------|-------------|
|
| 53 |
+
| `step0` | 0 | Initialization (before training) |
|
| 54 |
+
| `step1`, `step2`, `step4`, ..., `step512` | 2M - 1B | 10 log-spaced early checkpoints |
|
| 55 |
+
| `step1000`, `step2000`, ..., `step143000` | 2B - 300B | 143 evenly-spaced checkpoints |
|
| 56 |
+
|
| 57 |
+
To load a specific checkpoint:
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
model = GPTNeoXForCausalLM.from_pretrained(
|
| 61 |
+
"EleutherAI/pythia-70m-seed3",
|
| 62 |
+
revision="step50000", # Any checkpoint step
|
| 63 |
+
)
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## Training Data
|
| 67 |
+
|
| 68 |
+
All models were trained on The Pile using pre-shuffled data orderings. The shuffled index files for each seed are available at:
|
| 69 |
+
|
| 70 |
+
**[EleutherAI/pile-preshuffled-seeds](https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds)**
|
| 71 |
+
|
| 72 |
+
This dataset contains `.idx` files for seeds 0-9 used with `MMapIndexedDataset` to load the memory-mapped Pile data in the correct order for each seed.
|
| 73 |
+
|
| 74 |
+
### Reproducing Training Data Order
|
| 75 |
+
|
| 76 |
+
To reproduce the exact data ordering used for a specific seed:
|
| 77 |
+
|
| 78 |
+
1. Download the Pile dataset and tokenize it using the Pythia tokenizer
|
| 79 |
+
2. Download the corresponding seed folder from `pile-preshuffled-seeds`:
|
| 80 |
+
```bash
|
| 81 |
+
# Using huggingface_hub
|
| 82 |
+
from huggingface_hub import snapshot_download
|
| 83 |
+
snapshot_download(
|
| 84 |
+
repo_id="EleutherAI/pile-preshuffled-seeds",
|
| 85 |
+
repo_type="dataset",
|
| 86 |
+
allow_patterns="seed3/*", # Download only seed3
|
| 87 |
+
local_dir="./pile-seeds"
|
| 88 |
+
)
|
| 89 |
+
```
|
| 90 |
+
3. Use the idx files with GPT-NeoX's `MMapIndexedDataset`:
|
| 91 |
+
```python
|
| 92 |
+
from dataset import MMapIndexedDataset
|
| 93 |
+
dataset = MMapIndexedDataset(path_prefix, skip_warmup=True)
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
For complete training reproduction instructions, see the [Pythia GitHub repository](https://github.com/EleutherAI/pythia).
|
| 97 |
+
|
| 98 |
+
## All PolyPythias Models
|
| 99 |
+
|
| 100 |
+
The complete collection is available at: [EleutherAI/polypythias](https://huggingface.co/collections/EleutherAI/polypythias)
|
| 101 |
+
|
| 102 |
+
### 14M Parameter Models
|
| 103 |
+
- [pythia-14m-seed1](https://huggingface.co/EleutherAI/pythia-14m-seed1) through [pythia-14m-seed9](https://huggingface.co/EleutherAI/pythia-14m-seed9)
|
| 104 |
+
|
| 105 |
+
### 31M Parameter Models
|
| 106 |
+
- [pythia-31m-seed1](https://huggingface.co/EleutherAI/pythia-31m-seed1) through [pythia-31m-seed9](https://huggingface.co/EleutherAI/pythia-31m-seed9)
|
| 107 |
+
|
| 108 |
+
### 70M Parameter Models
|
| 109 |
+
- [pythia-70m-seed1](https://huggingface.co/EleutherAI/pythia-70m-seed1) through [pythia-70m-seed9](https://huggingface.co/EleutherAI/pythia-70m-seed9)
|
| 110 |
+
|
| 111 |
+
### 160M Parameter Models
|
| 112 |
+
- [pythia-160m-seed1](https://huggingface.co/EleutherAI/pythia-160m-seed1) through [pythia-160m-seed9](https://huggingface.co/EleutherAI/pythia-160m-seed9)
|
| 113 |
+
- [pythia-160m-data-seed1](https://huggingface.co/EleutherAI/pythia-160m-data-seed1) through [pythia-160m-data-seed3](https://huggingface.co/EleutherAI/pythia-160m-data-seed3)
|
| 114 |
+
- [pythia-160m-weight-seed1](https://huggingface.co/EleutherAI/pythia-160m-weight-seed1) through [pythia-160m-weight-seed3](https://huggingface.co/EleutherAI/pythia-160m-weight-seed3)
|
| 115 |
+
|
| 116 |
+
### 410M Parameter Models
|
| 117 |
+
- [pythia-410m-seed1](https://huggingface.co/EleutherAI/pythia-410m-seed1) through [pythia-410m-seed9](https://huggingface.co/EleutherAI/pythia-410m-seed9)
|
| 118 |
+
|
| 119 |
+
## Evaluation Results
|
| 120 |
+
|
| 121 |
+
Evaluation results for all models are available in the [polypythias-evals](https://huggingface.co/datasets/EleutherAI/polypythias-evals) dataset.
|
| 122 |
+
|
| 123 |
+
## Limitations
|
| 124 |
+
|
| 125 |
+
These models are released for research purposes only. They are **not** intended for deployment in production systems.
|
| 126 |
+
|
| 127 |
+
- **Not instruction-tuned**: These are base language models that predict the next token; they will not follow instructions like ChatGPT
|
| 128 |
+
- **May generate harmful content**: The Pile contains diverse internet text that includes biased, offensive, and factually incorrect content
|
| 129 |
+
- **English only**: Models were trained primarily on English text
|
| 130 |
+
- **No safety filtering**: Outputs are not filtered for safety or accuracy
|
| 131 |
+
|
| 132 |
+
## License
|
| 133 |
+
|
| 134 |
+
Apache 2.0
|
| 135 |
+
|
| 136 |
+
## Contact
|
| 137 |
+
|
| 138 |
+
For questions about these models, please use:
|
| 139 |
+
- [EleutherAI Discord](https://discord.gg/eleutherai) - #release-discussion channel
|
| 140 |
+
- [GitHub Issues](https://github.com/EleutherAI/pythia/issues)
|
| 141 |
+
|
| 142 |
+
## Citation
|
| 143 |
+
|
| 144 |
+
If you use these models, please cite:
|
| 145 |
+
|
| 146 |
+
```bibtex
|
| 147 |
+
@inproceedings{vanderwal2025polypythias,
|
| 148 |
+
title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs},
|
| 149 |
+
author={van der Wal, Oskar and Lesci, Pietro and Muller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella},
|
| 150 |
+
booktitle={International Conference on Learning Representations},
|
| 151 |
+
year={2025},
|
| 152 |
+
url={https://arxiv.org/abs/2503.09543}
|
| 153 |
+
}
|
| 154 |
+
```
|