| | --- |
| | language: en |
| | tags: |
| | - evolutionary-strategy |
| | - cma-es |
| | - gymnasium |
| | - cartpole |
| | - optimization |
| | library_name: custom |
| | datasets: |
| | - gymnasium/CartPole-v1 |
| | metrics: |
| | - mean_episode_length |
| | model-index: |
| | - name: CartPole |
| | results: |
| | - task: |
| | type: optimization |
| | name: CartPole-v1 |
| | dataset: |
| | name: gymnasium/CartPole-v1 |
| | type: gymnasium |
| | metrics: |
| | - type: mean_episode_length |
| | value: 500 |
| | name: Mean Episode Length |
| | license: mit |
| | pipeline_tag: reinforcement-learning |
| | --- |
| | |
| | # CartPole Solution |
| |
|
| | This model solves CartPole-v1 using CMA-ES with a linear policy. |
| |
|
| | ### Training Convergence |
| |  |
| | *Figure: Training convergence showing the mean fitness (episode length) across generations. The model achieves optimal performance (500 steps) within 3 generations.* |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This is a linear policy model for the CartPole-v1 environment that: |
| | - Uses a simple weight matrix to map 4D state inputs to 2D action outputs |
| | - Achieves optimal performance (500/500 steps) consistently |
| | - Was optimized using CMA-ES, requiring only 3 generations for convergence |
| | - Demonstrates sample-efficient learning for the CartPole balancing task |
| |
|
| | ```python |
| | def get_action(self, observation): |
| | observation = np.array(observation, dtype=np.float32) |
| | action_scores = np.dot(observation, self.weights) |
| | return int(np.argmax(action_scores)) |
| | ``` |
| |
|
| | - **Developed by:** Niladri Das |
| | - **Model type:** Linear Policy |
| | - **Language:** Python |
| | - **License:** MIT |
| | - **Finetuned from model:** No (trained from scratch) |
| |
|
| | ### Model Sources |
| |
|
| | - **Hugging Face:** https://huggingface.co/harpertoken/harpertoken-cartpole |
| |
|
| |
|
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | The model is designed for: |
| | 1. Solving the CartPole-v1 environment from Gymnasium |
| | 2. Demonstrating CMA-ES optimization for RL tasks |
| | 3. Serving as a baseline for comparison with other algorithms |
| | 4. Educational purposes in evolutionary strategies |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | The model should not be used for: |
| | 1. Complex control tasks beyond CartPole |
| | 2. Real-world robotics applications |
| | 3. Tasks requiring non-linear policies |
| | 4. Environments with partial observability |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | ### Technical Limitations |
| | - Limited to CartPole-v1 environment |
| | - Requires full state observation |
| | - Linear policy architecture |
| | - No transfer learning capability |
| | - Environment-specific solution |
| |
|
| | ### Performance Limitations |
| | - May not handle significant environment variations |
| | - No adaptation to changing dynamics |
| | - Limited by linear policy capacity |
| | - Requires precise state information |
| |
|
| | ### Recommendations |
| |
|
| | Users should: |
| | 1. Only use for CartPole-v1 environment |
| | 2. Ensure full state observability |
| | 3. Understand the limitations of linear policies |
| | 4. Consider more complex architectures for other tasks |
| | 5. Validate performance in their specific setup |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | ### Method 1: Using the CMAESAgent Class |
| |
|
| | ```python |
| | from model import CMAESAgent |
| | |
| | # Load the model |
| | agent = CMAESAgent.from_pretrained("harpertoken/harpertoken-cartpole") |
| | |
| | # Evaluate |
| | mean_reward, std_reward = agent.evaluate(num_episodes=5) |
| | print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}") |
| | ``` |
| |
|
| | ### Method 2: Manual Implementation |
| |
|
| | ```python |
| | import numpy as np |
| | from gymnasium import make |
| | |
| | # Load model weights |
| | weights = np.load('model_weights.npy') # 4x2 matrix |
| | |
| | # Create environment |
| | env = make('CartPole-v1') |
| | |
| | # Run inference |
| | def get_action(observation): |
| | logits = observation @ weights |
| | return int(np.argmax(logits)) |
| | |
| | observation, _ = env.reset() |
| | while True: |
| | action = get_action(observation) |
| | observation, reward, done, truncated, info = env.step(action) |
| | if done or truncated: |
| | break |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | - **Environment:** Gymnasium CartPole-v1 |
| | - **State Space:** 4D continuous (cart position, velocity, pole angle, angular velocity) |
| | - **Action Space:** 2D discrete (left, right) |
| | - **Reward:** +1 for each step, max 500 steps |
| | - **Episode Termination:** Pole angle > 15°, cart position > 2.4, or 500 steps reached |
| | - **Training Approach:** Direct environment interaction (no pre-collected dataset) |
| |
|
| | ### Training Procedure |
| |
|
| | #### Training Hyperparameters |
| |
|
| | - **Algorithm:** CMA-ES |
| | - **Population size:** 16 |
| | - **Number of generations:** 100 (early convergence by generation 3) |
| | - **Initial step size:** 0.5 |
| | - **Parameters:** 8 (4x2 weight matrix) |
| | - **Training regime:** Single precision (fp32) |
| |
|
| | #### Hardware Requirements |
| |
|
| | - **CPU:** Single core sufficient |
| | - **Memory:** <100MB RAM |
| | - **GPU:** Not required |
| | - **Training time:** ~5 minutes on standard CPU |
| |
|
| | ### Evaluation |
| |
|
| | #### Testing Data & Metrics |
| |
|
| | - **Environment:** Same as training (CartPole-v1) |
| | - **Episodes:** 100 test episodes |
| | - **Metrics:** Episode length, success rate |
| |
|
| | #### Results |
| |
|
| | - **Average Episode Length:** 500.0 ±0.0 |
| | - **Success Rate:** 100% |
| | - **Convergence:** Achieved in 3 generations |
| | - **Final Population Mean:** 500.00 |
| | - **Best Performance:** 500/500 consistently |
| |
|
| | ## Implementation Details |
| |
|
| | The implementation employs a straightforward linear policy: |
| |
|
| | ```python |
| | class CMAESAgent: |
| | def __init__(self, env_name): |
| | self.env = gym.make(env_name) |
| | self.observation_space = self.env.observation_space.shape[0] # 4 for CartPole |
| | self.action_space = self.env.action_space.n # 2 for CartPole |
| | self.num_params = self.observation_space * self.action_space # 8 total parameters |
| | self.weights = None |
| | |
| | def get_action(self, observation): |
| | observation = np.array(observation, dtype=np.float32) |
| | action_scores = np.dot(observation, self.weights) |
| | return int(np.argmax(action_scores)) |
| | ``` |
| |
|
| | The model's simplicity demonstrates that CartPole's optimal control policy is approximately linear in the state variables. |
| |
|
| | ## Environmental Impact |
| |
|
| | - **Training time:** ~5 minutes |
| | - **Hardware:** Standard CPU |
| | - **Energy consumption:** Negligible (<0.001 kWh) |
| | - **CO2 emissions:** Minimal (<0.001 kg) |
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| | ```bibtex |
| | @misc{das2025cartpole, |
| | author = {Niladri Das}, |
| | title = {CartPole Solution}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | journal = {Hugging Face Model Hub}, |
| | howpublished = {https://huggingface.co/harpertoken/harpertoken-cartpole} |
| | } |