chess-gpt / README.md
jd0g's picture
Update README.md
7f7f69f verified
---
language:
- en
library_name: nanogpt
license: mit
tags:
- chess
- game-playing
- transformer
- gpt-2
- nanogpt
- strategic-reasoning
datasets:
- adamkarvonen/chess_games
model-index:
- name: ChessGPT-2
results:
- task:
type: text-generation
name: Chess Move Prediction
dataset:
type: adamkarvonen/chess_games
name: Chess Games Dataset
metrics:
- type: validation_loss
value: 0.2578
name: Best Validation Loss (large-16)
---
# ChessGPT-2
## Model Description
ChessGPT-2 is a series of transformer language models specifically trained on chess game data, demonstrating that language models can learn complex strategic reasoning through chess gameplay. This repository presents **large-16 (200M parameters)** as our best model.
The large-16 model is a 200-million parameter GPT-2 architecture trained on engine-generated chess games, capable of high-quality move prediction, strategic analysis, and chess reasoning.
## Model Details
### large-16 (Primary Model)
- **Model Type**: Autoregressive Transformer Language Model (GPT-2 architecture)
- **Parameters**: ~200 million
- **Architecture**:
- Layers: 16
- Attention Heads: 16
- Embedding Dimension: 1024
- Context Length: 1023 tokens
- Vocabulary Size: 32 tokens (chess-specific vocabulary)
- **Training Framework**: NanoGPT (PyTorch)
- **Precision**: Mixed precision training (bfloat16/float16)
## Training Data
All models were trained on datasets from [@adamkarvonen/chess_games](https://huggingface.co/datasets/adamkarvonen/chess_games):
### Primary Dataset: Stockfish Games
- **Dataset**: `stockfish_dataset_blocks.zip`
- **Description**: 4.5GB of games generated by White playing as Stockfish ELO 3200 against a range of Stockfish ELO 1300-3200 as Black
- **Format**: PGN (Portable Game Notation) games converted to 1024-character blocks
- **Tokenization**: Each block begins with ";" delimiter (e.g., ";1.e4 e5 2.Nf3...")
- **Data Split**: 99% training, 1% validation (random split with seed 2357)
## Training Configuration
### large-16 Training Settings
- **Batch Size**: 32 (micro-batch)
- **Gradient Accumulation**: 4 steps (effective batch size: 128)
- **Learning Rate**: 3e-4 with cosine decay to 3e-5
- **Warmup**: 2000 iterations
- **Max Iterations**: 600,000
- **Optimizer**: AdamW (β₁=0.9, β₂=0.95)
- **Dropout**: 0.0 (no dropout for pretraining)
- **Training Hardware**: RTX 3090/4090 GPUs with distributed training support
## Usage
### Loading the Model
```python
import torch
from model import GPT, GPTConfig
# Load large-16 configuration
config = GPTConfig(
block_size=1023,
n_layer=16,
n_head=16,
n_embd=1024,
dropout=0.0,
bias=False,
vocab_size=32
)
# Initialize and load model
model = GPT(config)
checkpoint = torch.load('ckpt.pt', map_location='cpu')
model.load_state_dict(checkpoint['model'])
model.eval()
# For GPU inference (recommended)
if torch.cuda.is_available():
model = model.cuda()
# Generate chess moves (requires proper tokenization)
prompt = ";1.d4 Nf6 2.c4 e6 3.Nc3 Bb4"
# ... tokenization and generation code ...
```
### Input Format
All models expect properly tokenized chess games:
- Must start with ";" delimiter
- Standard PGN algebraic notation
- 1024-character blocks for optimal performance
## Performance Characteristics
The large-16 model demonstrates:
- **Superior Chess Reasoning**: Advanced understanding of tactical and strategic patterns
- **High-Quality Planning**: Excellent long-term game planning capabilities
- **Pattern Recognition**: Enhanced recognition across diverse chess positions
- **Substantial Scale**: 202.5M parameters in 2.3GB model size
- **Optimal Architecture**: 16 layers, 16 heads, 1024 embedding dimension
- **Near-Expert Performance**: Potential for expert-level chess understanding
## Model Series & Ablation Studies
This repository represents extensive research into scaling transformer models for chess. Our complete series includes:
### Parameter Scaling Ablations
| Model Variant | Parameters | Layers | Heads | Embedding | Model Size | Val Loss | Key Characteristics |
|---------------|------------|--------|-------|-----------|------------|----------|-------------------|
| **small-8** | 25.7M | 8 | 8 | 512 | 294MB | 0.2944 | Compact baseline |
| **small-16** | 50.9M | 16 | 8 | 512 | 582MB | 0.2725 | Depth scaling study |
| **small-24** | 76.1M | 24 | 8 | 512 | 871MB | 0.2628 | Deep narrow model |
| **small-36** | 113.8M | 36 | 8 | 512 | 1.3GB | 0.2583 | Maximum depth |
| **medium-12** | 85.8M | 12 | 12 | 768 | 982MB | 0.2652 | Balanced medium |
| **medium-16** | 114.1M | 16 | 12 | 768 | 1.3GB | 0.2608 | Deeper medium |
| **large-16** | 202.5M | 16 | 16 | 1024 | 2.3GB | 0.2578 | **Primary model** |
### Dataset Comparison Studies
| Model | Dataset | Source | Size | Characteristics |
|-------|---------|--------|------|-----------------|
| **All Stockfish Models** | Stockfish | Engine games | 4.5GB | Optimal play patterns |
| **Lichess Model** | Lichess | Human games | 6GB | Human decision patterns |
### Key Research Findings
1. **Depth vs Width Trade-offs**: Small models (512 emb, 8 heads) scale from 25.7M→113.8M parameters purely through depth (8→36 layers)
2. **Clear Performance Scaling**: Validation loss improves consistently with depth: 0.2944 (8-layer) → 0.2583 (36-layer)
3. **Architecture Variations**: Medium models explore width scaling (768 emb, 12 heads) vs small models' depth scaling
4. **Parameter Efficiency**: small-36 (113.8M) achieves similar parameter count to medium-16 (114.1M) via different architectures
5. **No Overfitting**: All models trained to 600k iterations show continued learning potential
6. **Dataset Impact**: Significant behavioral differences between engine vs. human training data
## Evaluation Metrics
Models should be evaluated on:
- **Move Legality**: Percentage of generated moves that are legal
- **Game Continuation**: Quality and coherence of extended game sequences
- **Tactical Recognition**: Ability to identify tactical patterns and combinations
- **Strategic Understanding**: Long-term positional planning and evaluation
- **Opening Knowledge**: Familiarity with established opening theory
- **Endgame Technique**: Performance in simplified positions
## Intended Use
### Primary Use Cases
- **Chess Analysis**: High-quality position evaluation and move suggestion
- **Research**: Studying emergent reasoning in language models
- **Education**: Chess learning and pattern recognition tools
- **AI Development**: Baseline for chess AI systems
### Limitations
- Specialized for chess gameplay only
- Limited to standard chess rules and notation
- Requires proper tokenization format
- GPU recommended for practical inference
- May not generalize beyond chess domain
## Alternative Model Variants
### For Different Use Cases:
- **Fast Inference**: Use **small-8** for minimal resource requirements
- **Depth vs Width**: Compare **small-16/24/36** for layer depth ablations
- **Balanced Performance**: Use **medium-12** or **medium-16** for mid-range applications
- **Maximum Performance**: Use **large-16** for best overall results
- **Human Behavior Studies**: Use **lichess** model for human-like gameplay patterns
### Computational Requirements:
- **Small Models (8-36 layers)**: CPU inference possible, GPU recommended
- **Medium Models**: GPU recommended for practical use
- **Large Model**: Single high-end GPU required
## Technical Implementation
### Model Architecture
Based on GPT-2 with chess-specific adaptations:
- **Vocabulary**: Reduced to 32 chess-specific tokens
- **Context**: Optimized for 1023-token chess game sequences
- **Training**: Custom data loading for chess game blocks
- **Framework**: Built on NanoGPT for simplicity and efficiency
### Training Insights
- **Convergence**: Smooth training curves across all scales
- **Memory Efficiency**: Optimized for multi-GPU training
- **Data Processing**: Custom tokenization preserving chess structure
- **Evaluation**: Chess-specific validation metrics
## Ethical Considerations
- Models trained exclusively on chess data pose minimal ethical risks
- No personal data or sensitive information in training datasets
- Intended for educational, research, and recreational purposes
- Computational requirements may limit accessibility
- Models do not generalize beyond chess domain
## Citation
If you use ChessGPT in your research, please cite:
```bibtex
@misc{chessgpt,
title={ChessGPT-2},
author={[Your Name]},
year={2024},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/[your-username]/chessgpt-2}
}
@dataset{chess_games_dataset,
title={Chess Games Dataset},
author={Adam Karvonen},
year={2024},
url={https://huggingface.co/datasets/adamkarvonen/chess_games}
}
```
## References
- **NanoGPT**: [karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)
- **Chess Dataset**: [@adamkarvonen/chess_games](https://huggingface.co/datasets/adamkarvonen/chess_games)
- **GPT-2 Paper**: [Radford et al., 2019](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- **Scaling Laws**: [Kaplan et al., 2020](https://arxiv.org/abs/2001.08361)
## License
[MIT, Apache 2.0]
---