ActionNet β Autonomous RC Car Driving Model
A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning β a human drives the car while the system records frames and commands, then the model learns to replicate that behavior.
Part of the OpenBot PC Server Project.
Model Description
ActionNet classifies a single 66Γ200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero.
Input: 66Γ200Γ3 RGB image (cropped from 800Γ600, top 40% removed) Output: probability distribution over 9 driving actions
The 9 Actions
| Index | Action | Left Motor | Right Motor | Description |
|---|---|---|---|---|
| 0 | STOP | 0 | 0 | Both motors off |
| 1 | FORWARD | +70 | +70 | Straight ahead |
| 2 | BACKWARD | -70 | -70 | Straight reverse |
| 3 | TURN LEFT | -49 | +49 | Pivot left (in place) |
| 4 | TURN RIGHT | +49 | -49 | Pivot right (in place) |
| 5 | FORWARD+LEFT | +21 | +70 | Arc forward-left |
| 6 | FORWARD+RIGHT | +70 | +21 | Arc forward-right |
| 7 | BACKWARD+LEFT | -21 | -70 | Arc backward-left |
| 8 | BACKWARD+RIGHT | -70 | -21 | Arc backward-right |
Motor values are shown at speed=70 and scale proportionally with the speed setting.
Architecture
The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head.
Layer Output Shape Parameters
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input (B, 3, 66, 200) β
Conv2d(3β24, 5Γ5, stride=2) (B, 24, 31, 98) 1,824
BatchNorm2d(24) 48
ELU β
Conv2d(24β36, 5Γ5, stride=2) (B, 36, 14, 47) 21,636
BatchNorm2d(36) 72
ELU β
Conv2d(36β48, 5Γ5, stride=2) (B, 48, 5, 22) 43,248
BatchNorm2d(48) 96
ELU β
Conv2d(48β64, 3Γ3, stride=1) (B, 64, 3, 20) 27,712
BatchNorm2d(64) 128
ELU β
Conv2d(64β64, 3Γ3, stride=1) (B, 64, 1, 18) 36,928
BatchNorm2d(64) 128
ELU β
Dropout2d(0.15) β
Flatten (B, 1152) β
Dropout(0.35) β
Linear(1152β64) (B, 64) 73,792
ELU β
Dropout(0.35) β
Linear(64β9) (B, 9) 585
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total trainable parameters: ~145,000
Model file size: ~1β2 MB (.pth)
Design Decisions
- BatchNorm after every conv layer β stabilizes training and allows higher learning rates without divergence
- ELU instead of ReLU β avoids dead neurons and produces smoother gradients, which matters when the model is small
- Spatial Dropout2d (15%) β drops entire feature maps instead of individual pixels, forcing the network to spread information across channels
- Two-layer classification head with 35% dropout β the bottleneck at 64 units forces compression and fights overfitting on small datasets
- Kaiming initialization β all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations
- Label smoothing (0.2) β prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization
Preprocessing
The full pipeline from raw camera frame to model input:
Raw 800Γ600 BGR frame from ESP32-CAM
β
βΌ
Crop top 40% of the image
(removes ceiling, sky, and upper walls)
β
βΌ
Convert BGR β RGB
β
βΌ
Resize to 200Γ66 pixels
(using INTER_AREA interpolation)
β
βΌ
ToTensor β normalize to [0, 1] float32
β
βΌ
Final shape: [batch, 3, 66, 200]
The crop_and_resize() function in trainer.py performs this transformation. The exact same function is called during both training and inference (in autopilot.py) to guarantee consistency.
Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky β none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead.
Training Configuration
| Parameter | Value | Notes |
|---|---|---|
| Optimizer | AdamW | weight_decay=5e-3 for L2 regularization |
| Learning Rate | 0.001 | Peak rate, with OneCycleLR schedule |
| LR Schedule | OneCycleLR | 10% warmup, cosine anneal, div_factor=10 |
| Loss Function | CrossEntropyLoss | label_smoothing=0.2 |
| Batch Size | 32 | Fits comfortably in CPU memory |
| Gradient Clipping | max_norm=1.0 | Prevents gradient explosions |
| Early Stopping | 30 epochs patience | Monitored by validation accuracy |
| Class Balancing | WeightedRandomSampler | Inverse-frequency weights per class |
| Train/Val Split | 80% / 20% | Random split |
Data Augmentation
Applied on-the-fly during training:
| Augmentation | Probability | Details |
|---|---|---|
| Horizontal flip | 50% | Action labels are mirrored (LEFTβRIGHT) |
| Random shadow | 50% | Vertical band at random brightness (30β70%) |
| Random brightness | 50% | HSV V-channel scaled 0.6β1.4Γ |
| Gaussian blur | 30% | Kernel 3Γ3 or 5Γ5 |
| Random translation | 40% | Shift Β±10% in X and Y |
| Random erasing | 50% | Rectangular cutout on tensor |
The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels.
Inference
At runtime, the autopilot module:
- Reads the latest camera frame from the MJPEG stream
- Runs
crop_and_resize()β converts to tensor - Forward pass through ActionNet β gets 9 logits
- Applies softmax β picks the action with highest probability
- Uses a 3-frame majority vote to smooth out flickering predictions
- Maps the smoothed action to (left, right) motor commands at the configured speed
- Sends the command to the ESP8266 over WebSocket
The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required.
Hardware Requirements
This model is designed for a specific hardware setup:
| Component | Role |
|---|---|
| ESP32-CAM (OV2640) | Streams 800Γ600 MJPEG video over HTTP |
| ESP8266 (NodeMCU) | Receives motor commands over WebSocket, drives L298N |
| L298N Motor Driver | Controls 2 DC gear motors (differential drive) |
| SG90 Servo (optional) | Camera pan |
| PC (any laptop/desktop) | Runs the server, training, and inference |
The PC does all the heavy lifting. The microcontrollers are just I/O β one for video, one for motors. Total hardware cost is around $25β30 USD.
How to Use This Model
Quickstart
import torch
import torch.nn.functional as F
from torchvision import transforms
from model import ActionNet, action_to_command
# Load
device = torch.device("cpu")
model = ActionNet().to(device)
checkpoint = torch.load("trained_models/autopilot.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# Prepare a 66x200 RGB image as tensor
transform = transforms.ToTensor()
img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device)
# Predict
with torch.no_grad():
logits = model(img_tensor)
probs = F.softmax(logits, dim=1)
action = torch.argmax(probs, dim=1).item()
confidence = probs[0, action].item()
# Convert to motor command
left, right = action_to_command(action, speed=70)
print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}")
Within the Full System
The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot."
See the full README for step-by-step instructions including hardware assembly, firmware upload, and data collection.
Training Your Own Model
- Assemble the hardware (ESP8266 + ESP32-CAM + motors)
- Flash firmware to both microcontrollers
- Start the PC server:
python app.py - Drive the car manually while recording data
- Click "Train" in the dashboard β or the model trains through the API
- The best checkpoint saves automatically to
trained_models/autopilot.pth
Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required.
Limitations
- The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data.
- Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data.
- The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in
trainer.py. - Performance depends heavily on lighting conditions matching between training and inference.
- The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on.
Citation
If you use this project in your work, a mention is appreciated but not required:
OpenBot PC Server Project β Autonomous RC Car with Imitation Learning
https://github.com/YOUR_USERNAME/openbot-pc-server-project
License
MIT License β use it, modify it, ship it.