| | --- |
| | language: |
| | - en |
| | library_name: nanogpt |
| | license: mit |
| | tags: |
| | - chess |
| | - game-playing |
| | - transformer |
| | - gpt-2 |
| | - nanogpt |
| | - strategic-reasoning |
| | datasets: |
| | - adamkarvonen/chess_games |
| | model-index: |
| | - name: ChessGPT-2 |
| | results: |
| | - task: |
| | type: text-generation |
| | name: Chess Move Prediction |
| | dataset: |
| | type: adamkarvonen/chess_games |
| | name: Chess Games Dataset |
| | metrics: |
| | - type: validation_loss |
| | value: 0.2578 |
| | name: Best Validation Loss (large-16) |
| | --- |
| | |
| | # ChessGPT-2 |
| |
|
| | ## Model Description |
| |
|
| | ChessGPT-2 is a series of transformer language models specifically trained on chess game data, demonstrating that language models can learn complex strategic reasoning through chess gameplay. This repository presents **large-16 (200M parameters)** as our best model. |
| |
|
| | The large-16 model is a 200-million parameter GPT-2 architecture trained on engine-generated chess games, capable of high-quality move prediction, strategic analysis, and chess reasoning. |
| |
|
| | ## Model Details |
| |
|
| | ### large-16 (Primary Model) |
| | - **Model Type**: Autoregressive Transformer Language Model (GPT-2 architecture) |
| | - **Parameters**: ~200 million |
| | - **Architecture**: |
| | - Layers: 16 |
| | - Attention Heads: 16 |
| | - Embedding Dimension: 1024 |
| | - Context Length: 1023 tokens |
| | - Vocabulary Size: 32 tokens (chess-specific vocabulary) |
| | - **Training Framework**: NanoGPT (PyTorch) |
| | - **Precision**: Mixed precision training (bfloat16/float16) |
| |
|
| | ## Training Data |
| |
|
| | All models were trained on datasets from [@adamkarvonen/chess_games](https://huggingface.co/datasets/adamkarvonen/chess_games): |
| |
|
| | ### Primary Dataset: Stockfish Games |
| | - **Dataset**: `stockfish_dataset_blocks.zip` |
| | - **Description**: 4.5GB of games generated by White playing as Stockfish ELO 3200 against a range of Stockfish ELO 1300-3200 as Black |
| | - **Format**: PGN (Portable Game Notation) games converted to 1024-character blocks |
| | - **Tokenization**: Each block begins with ";" delimiter (e.g., ";1.e4 e5 2.Nf3...") |
| | - **Data Split**: 99% training, 1% validation (random split with seed 2357) |
| |
|
| | ## Training Configuration |
| |
|
| | ### large-16 Training Settings |
| | - **Batch Size**: 32 (micro-batch) |
| | - **Gradient Accumulation**: 4 steps (effective batch size: 128) |
| | - **Learning Rate**: 3e-4 with cosine decay to 3e-5 |
| | - **Warmup**: 2000 iterations |
| | - **Max Iterations**: 600,000 |
| | - **Optimizer**: AdamW (β₁=0.9, β₂=0.95) |
| | - **Dropout**: 0.0 (no dropout for pretraining) |
| | - **Training Hardware**: RTX 3090/4090 GPUs with distributed training support |
| |
|
| | ## Usage |
| |
|
| | ### Loading the Model |
| |
|
| | ```python |
| | import torch |
| | from model import GPT, GPTConfig |
| | |
| | # Load large-16 configuration |
| | config = GPTConfig( |
| | block_size=1023, |
| | n_layer=16, |
| | n_head=16, |
| | n_embd=1024, |
| | dropout=0.0, |
| | bias=False, |
| | vocab_size=32 |
| | ) |
| | |
| | # Initialize and load model |
| | model = GPT(config) |
| | checkpoint = torch.load('ckpt.pt', map_location='cpu') |
| | model.load_state_dict(checkpoint['model']) |
| | model.eval() |
| | |
| | # For GPU inference (recommended) |
| | if torch.cuda.is_available(): |
| | model = model.cuda() |
| | |
| | # Generate chess moves (requires proper tokenization) |
| | prompt = ";1.d4 Nf6 2.c4 e6 3.Nc3 Bb4" |
| | # ... tokenization and generation code ... |
| | ``` |
| |
|
| | ### Input Format |
| | All models expect properly tokenized chess games: |
| | - Must start with ";" delimiter |
| | - Standard PGN algebraic notation |
| | - 1024-character blocks for optimal performance |
| |
|
| | ## Performance Characteristics |
| |
|
| | The large-16 model demonstrates: |
| | - **Superior Chess Reasoning**: Advanced understanding of tactical and strategic patterns |
| | - **High-Quality Planning**: Excellent long-term game planning capabilities |
| | - **Pattern Recognition**: Enhanced recognition across diverse chess positions |
| | - **Substantial Scale**: 202.5M parameters in 2.3GB model size |
| | - **Optimal Architecture**: 16 layers, 16 heads, 1024 embedding dimension |
| | - **Near-Expert Performance**: Potential for expert-level chess understanding |
| |
|
| | ## Model Series & Ablation Studies |
| |
|
| | This repository represents extensive research into scaling transformer models for chess. Our complete series includes: |
| |
|
| | ### Parameter Scaling Ablations |
| |
|
| | | Model Variant | Parameters | Layers | Heads | Embedding | Model Size | Val Loss | Key Characteristics | |
| | |---------------|------------|--------|-------|-----------|------------|----------|-------------------| |
| | | **small-8** | 25.7M | 8 | 8 | 512 | 294MB | 0.2944 | Compact baseline | |
| | | **small-16** | 50.9M | 16 | 8 | 512 | 582MB | 0.2725 | Depth scaling study | |
| | | **small-24** | 76.1M | 24 | 8 | 512 | 871MB | 0.2628 | Deep narrow model | |
| | | **small-36** | 113.8M | 36 | 8 | 512 | 1.3GB | 0.2583 | Maximum depth | |
| | | **medium-12** | 85.8M | 12 | 12 | 768 | 982MB | 0.2652 | Balanced medium | |
| | | **medium-16** | 114.1M | 16 | 12 | 768 | 1.3GB | 0.2608 | Deeper medium | |
| | | **large-16** | 202.5M | 16 | 16 | 1024 | 2.3GB | 0.2578 | **Primary model** | |
| |
|
| | ### Dataset Comparison Studies |
| |
|
| | | Model | Dataset | Source | Size | Characteristics | |
| | |-------|---------|--------|------|-----------------| |
| | | **All Stockfish Models** | Stockfish | Engine games | 4.5GB | Optimal play patterns | |
| | | **Lichess Model** | Lichess | Human games | 6GB | Human decision patterns | |
| |
|
| | ### Key Research Findings |
| |
|
| | 1. **Depth vs Width Trade-offs**: Small models (512 emb, 8 heads) scale from 25.7M→113.8M parameters purely through depth (8→36 layers) |
| | 2. **Clear Performance Scaling**: Validation loss improves consistently with depth: 0.2944 (8-layer) → 0.2583 (36-layer) |
| | 3. **Architecture Variations**: Medium models explore width scaling (768 emb, 12 heads) vs small models' depth scaling |
| | 4. **Parameter Efficiency**: small-36 (113.8M) achieves similar parameter count to medium-16 (114.1M) via different architectures |
| | 5. **No Overfitting**: All models trained to 600k iterations show continued learning potential |
| | 6. **Dataset Impact**: Significant behavioral differences between engine vs. human training data |
| |
|
| | ## Evaluation Metrics |
| |
|
| | Models should be evaluated on: |
| | - **Move Legality**: Percentage of generated moves that are legal |
| | - **Game Continuation**: Quality and coherence of extended game sequences |
| | - **Tactical Recognition**: Ability to identify tactical patterns and combinations |
| | - **Strategic Understanding**: Long-term positional planning and evaluation |
| | - **Opening Knowledge**: Familiarity with established opening theory |
| | - **Endgame Technique**: Performance in simplified positions |
| |
|
| | ## Intended Use |
| |
|
| | ### Primary Use Cases |
| | - **Chess Analysis**: High-quality position evaluation and move suggestion |
| | - **Research**: Studying emergent reasoning in language models |
| | - **Education**: Chess learning and pattern recognition tools |
| | - **AI Development**: Baseline for chess AI systems |
| |
|
| | ### Limitations |
| | - Specialized for chess gameplay only |
| | - Limited to standard chess rules and notation |
| | - Requires proper tokenization format |
| | - GPU recommended for practical inference |
| | - May not generalize beyond chess domain |
| |
|
| | ## Alternative Model Variants |
| |
|
| | ### For Different Use Cases: |
| | - **Fast Inference**: Use **small-8** for minimal resource requirements |
| | - **Depth vs Width**: Compare **small-16/24/36** for layer depth ablations |
| | - **Balanced Performance**: Use **medium-12** or **medium-16** for mid-range applications |
| | - **Maximum Performance**: Use **large-16** for best overall results |
| | - **Human Behavior Studies**: Use **lichess** model for human-like gameplay patterns |
| |
|
| | ### Computational Requirements: |
| | - **Small Models (8-36 layers)**: CPU inference possible, GPU recommended |
| | - **Medium Models**: GPU recommended for practical use |
| | - **Large Model**: Single high-end GPU required |
| |
|
| | ## Technical Implementation |
| |
|
| | ### Model Architecture |
| | Based on GPT-2 with chess-specific adaptations: |
| | - **Vocabulary**: Reduced to 32 chess-specific tokens |
| | - **Context**: Optimized for 1023-token chess game sequences |
| | - **Training**: Custom data loading for chess game blocks |
| | - **Framework**: Built on NanoGPT for simplicity and efficiency |
| |
|
| | ### Training Insights |
| | - **Convergence**: Smooth training curves across all scales |
| | - **Memory Efficiency**: Optimized for multi-GPU training |
| | - **Data Processing**: Custom tokenization preserving chess structure |
| | - **Evaluation**: Chess-specific validation metrics |
| |
|
| | ## Ethical Considerations |
| |
|
| | - Models trained exclusively on chess data pose minimal ethical risks |
| | - No personal data or sensitive information in training datasets |
| | - Intended for educational, research, and recreational purposes |
| | - Computational requirements may limit accessibility |
| | - Models do not generalize beyond chess domain |
| |
|
| | ## Citation |
| |
|
| | If you use ChessGPT in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{chessgpt, |
| | title={ChessGPT-2}, |
| | author={[Your Name]}, |
| | year={2024}, |
| | howpublished={Hugging Face Model Hub}, |
| | url={https://huggingface.co/[your-username]/chessgpt-2} |
| | } |
| | |
| | @dataset{chess_games_dataset, |
| | title={Chess Games Dataset}, |
| | author={Adam Karvonen}, |
| | year={2024}, |
| | url={https://huggingface.co/datasets/adamkarvonen/chess_games} |
| | } |
| | ``` |
| |
|
| | ## References |
| |
|
| | - **NanoGPT**: [karpathy/nanoGPT](https://github.com/karpathy/nanoGPT) |
| | - **Chess Dataset**: [@adamkarvonen/chess_games](https://huggingface.co/datasets/adamkarvonen/chess_games) |
| | - **GPT-2 Paper**: [Radford et al., 2019](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) |
| | - **Scaling Laws**: [Kaplan et al., 2020](https://arxiv.org/abs/2001.08361) |
| |
|
| | ## License |
| |
|
| | [MIT, Apache 2.0] |
| |
|
| | --- |