Update README.md

7f7f69f verified 7 months ago

9.3 kB

	---
	language:
	- en
	library_name: nanogpt
	license: mit
	tags:
	- chess
	- game-playing
	- transformer
	- gpt-2
	- nanogpt
	- strategic-reasoning
	datasets:
	- adamkarvonen/chess_games
	model-index:
	- name: ChessGPT-2
	results:
	- task:
	type: text-generation
	name: Chess Move Prediction
	dataset:
	type: adamkarvonen/chess_games
	name: Chess Games Dataset
	metrics:
	- type: validation_loss
	value: 0.2578
	name: Best Validation Loss (large-16)
	---

	# ChessGPT-2

	## Model Description

	ChessGPT-2 is a series of transformer language models specifically trained on chess game data, demonstrating that language models can learn complex strategic reasoning through chess gameplay. This repository presents large-16 (200M parameters) as our best model.

	The large-16 model is a 200-million parameter GPT-2 architecture trained on engine-generated chess games, capable of high-quality move prediction, strategic analysis, and chess reasoning.

	## Model Details

	### large-16 (Primary Model)
	- Model Type: Autoregressive Transformer Language Model (GPT-2 architecture)
	- Parameters: ~200 million
	- Architecture:
	- Layers: 16
	- Attention Heads: 16
	- Embedding Dimension: 1024
	- Context Length: 1023 tokens
	- Vocabulary Size: 32 tokens (chess-specific vocabulary)
	- Training Framework: NanoGPT (PyTorch)
	- Precision: Mixed precision training (bfloat16/float16)

	## Training Data

	All models were trained on datasets from [@adamkarvonen/chess_games](https://huggingface.co/datasets/adamkarvonen/chess_games):

	### Primary Dataset: Stockfish Games
	- Dataset: `stockfish_dataset_blocks.zip`
	- Description: 4.5GB of games generated by White playing as Stockfish ELO 3200 against a range of Stockfish ELO 1300-3200 as Black
	- Format: PGN (Portable Game Notation) games converted to 1024-character blocks
	- Tokenization: Each block begins with ";" delimiter (e.g., ";1.e4 e5 2.Nf3...")
	- Data Split: 99% training, 1% validation (random split with seed 2357)

	## Training Configuration

	### large-16 Training Settings
	- Batch Size: 32 (micro-batch)
	- Gradient Accumulation: 4 steps (effective batch size: 128)
	- Learning Rate: 3e-4 with cosine decay to 3e-5
	- Warmup: 2000 iterations
	- Max Iterations: 600,000
	- Optimizer: AdamW (β₁=0.9, β₂=0.95)
	- Dropout: 0.0 (no dropout for pretraining)
	- Training Hardware: RTX 3090/4090 GPUs with distributed training support

	## Usage

	### Loading the Model

	```python
	import torch
	from model import GPT, GPTConfig

	# Load large-16 configuration
	config = GPTConfig(
	block_size=1023,
	n_layer=16,
	n_head=16,
	n_embd=1024,
	dropout=0.0,
	bias=False,
	vocab_size=32
	)

	# Initialize and load model
	model = GPT(config)
	checkpoint = torch.load('ckpt.pt', map_location='cpu')
	model.load_state_dict(checkpoint['model'])
	model.eval()

	# For GPU inference (recommended)
	if torch.cuda.is_available():
	model = model.cuda()

	# Generate chess moves (requires proper tokenization)
	prompt = ";1.d4 Nf6 2.c4 e6 3.Nc3 Bb4"
	# ... tokenization and generation code ...
	```

	### Input Format
	All models expect properly tokenized chess games:
	- Must start with ";" delimiter
	- Standard PGN algebraic notation
	- 1024-character blocks for optimal performance

	## Performance Characteristics

	The large-16 model demonstrates:
	- Superior Chess Reasoning: Advanced understanding of tactical and strategic patterns
	- High-Quality Planning: Excellent long-term game planning capabilities
	- Pattern Recognition: Enhanced recognition across diverse chess positions
	- Substantial Scale: 202.5M parameters in 2.3GB model size
	- Optimal Architecture: 16 layers, 16 heads, 1024 embedding dimension
	- Near-Expert Performance: Potential for expert-level chess understanding

	## Model Series & Ablation Studies

	This repository represents extensive research into scaling transformer models for chess. Our complete series includes:

	### Parameter Scaling Ablations

	\| Model Variant \| Parameters \| Layers \| Heads \| Embedding \| Model Size \| Val Loss \| Key Characteristics \|
	\|---------------\|------------\|--------\|-------\|-----------\|------------\|----------\|-------------------\|
	\| small-8 \| 25.7M \| 8 \| 8 \| 512 \| 294MB \| 0.2944 \| Compact baseline \|
	\| small-16 \| 50.9M \| 16 \| 8 \| 512 \| 582MB \| 0.2725 \| Depth scaling study \|
	\| small-24 \| 76.1M \| 24 \| 8 \| 512 \| 871MB \| 0.2628 \| Deep narrow model \|
	\| small-36 \| 113.8M \| 36 \| 8 \| 512 \| 1.3GB \| 0.2583 \| Maximum depth \|
	\| medium-12 \| 85.8M \| 12 \| 12 \| 768 \| 982MB \| 0.2652 \| Balanced medium \|
	\| medium-16 \| 114.1M \| 16 \| 12 \| 768 \| 1.3GB \| 0.2608 \| Deeper medium \|
	\| large-16 \| 202.5M \| 16 \| 16 \| 1024 \| 2.3GB \| 0.2578 \| Primary model \|

	### Dataset Comparison Studies

	\| Model \| Dataset \| Source \| Size \| Characteristics \|
	\|-------\|---------\|--------\|------\|-----------------\|
	\| All Stockfish Models \| Stockfish \| Engine games \| 4.5GB \| Optimal play patterns \|
	\| Lichess Model \| Lichess \| Human games \| 6GB \| Human decision patterns \|

	### Key Research Findings

	1. Depth vs Width Trade-offs: Small models (512 emb, 8 heads) scale from 25.7M→113.8M parameters purely through depth (8→36 layers)
	2. Clear Performance Scaling: Validation loss improves consistently with depth: 0.2944 (8-layer) → 0.2583 (36-layer)
	3. Architecture Variations: Medium models explore width scaling (768 emb, 12 heads) vs small models' depth scaling
	4. Parameter Efficiency: small-36 (113.8M) achieves similar parameter count to medium-16 (114.1M) via different architectures
	5. No Overfitting: All models trained to 600k iterations show continued learning potential
	6. Dataset Impact: Significant behavioral differences between engine vs. human training data

	## Evaluation Metrics

	Models should be evaluated on:
	- Move Legality: Percentage of generated moves that are legal
	- Game Continuation: Quality and coherence of extended game sequences
	- Tactical Recognition: Ability to identify tactical patterns and combinations
	- Strategic Understanding: Long-term positional planning and evaluation
	- Opening Knowledge: Familiarity with established opening theory
	- Endgame Technique: Performance in simplified positions

	## Intended Use

	### Primary Use Cases
	- Chess Analysis: High-quality position evaluation and move suggestion
	- Research: Studying emergent reasoning in language models
	- Education: Chess learning and pattern recognition tools
	- AI Development: Baseline for chess AI systems

	### Limitations
	- Specialized for chess gameplay only
	- Limited to standard chess rules and notation
	- Requires proper tokenization format
	- GPU recommended for practical inference
	- May not generalize beyond chess domain

	## Alternative Model Variants

	### For Different Use Cases:
	- Fast Inference: Use small-8 for minimal resource requirements
	- Depth vs Width: Compare small-16/24/36 for layer depth ablations
	- Balanced Performance: Use medium-12 or medium-16 for mid-range applications
	- Maximum Performance: Use large-16 for best overall results
	- Human Behavior Studies: Use lichess model for human-like gameplay patterns

	### Computational Requirements:
	- Small Models (8-36 layers): CPU inference possible, GPU recommended
	- Medium Models: GPU recommended for practical use
	- Large Model: Single high-end GPU required

	## Technical Implementation

	### Model Architecture
	Based on GPT-2 with chess-specific adaptations:
	- Vocabulary: Reduced to 32 chess-specific tokens
	- Context: Optimized for 1023-token chess game sequences
	- Training: Custom data loading for chess game blocks
	- Framework: Built on NanoGPT for simplicity and efficiency

	### Training Insights
	- Convergence: Smooth training curves across all scales
	- Memory Efficiency: Optimized for multi-GPU training
	- Data Processing: Custom tokenization preserving chess structure
	- Evaluation: Chess-specific validation metrics

	## Ethical Considerations

	- Models trained exclusively on chess data pose minimal ethical risks
	- No personal data or sensitive information in training datasets
	- Intended for educational, research, and recreational purposes
	- Computational requirements may limit accessibility
	- Models do not generalize beyond chess domain

	## Citation

	If you use ChessGPT in your research, please cite:

	```bibtex
	@misc{chessgpt,
	title={ChessGPT-2},
	author={[Your Name]},
	year={2024},
	howpublished={Hugging Face Model Hub},
	url={https://huggingface.co/[your-username]/chessgpt-2}
	}

	@dataset{chess_games_dataset,
	title={Chess Games Dataset},
	author={Adam Karvonen},
	year={2024},
	url={https://huggingface.co/datasets/adamkarvonen/chess_games}
	}
	```

	## References

	- NanoGPT: [karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)
	- Chess Dataset: [@adamkarvonen/chess_games](https://huggingface.co/datasets/adamkarvonen/chess_games)
	- GPT-2 Paper: [Radford et al., 2019](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
	- Scaling Laws: [Kaplan et al., 2020](https://arxiv.org/abs/2001.08361)

	## License

	[MIT, Apache 2.0]

	---