ActionNet — Autonomous RC Car Driving Model

A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning — a human drives the car while the system records frames and commands, then the model learns to replicate that behavior.

Part of the OpenBot PC Server Project.

Model Description

ActionNet classifies a single 66×200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero.

Input: 66×200×3 RGB image (cropped from 800×600, top 40% removed) Output: probability distribution over 9 driving actions

The 9 Actions

Index	Action	Left Motor	Right Motor	Description
0	STOP	0	0	Both motors off
1	FORWARD	+70	+70	Straight ahead
2	BACKWARD	-70	-70	Straight reverse
3	TURN LEFT	-49	+49	Pivot left (in place)
4	TURN RIGHT	+49	-49	Pivot right (in place)
5	FORWARD+LEFT	+21	+70	Arc forward-left
6	FORWARD+RIGHT	+70	+21	Arc forward-right
7	BACKWARD+LEFT	-21	-70	Arc backward-left
8	BACKWARD+RIGHT	-70	-21	Arc backward-right

Motor values are shown at speed=70 and scale proportionally with the speed setting.

Architecture

The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head.

Layer                          Output Shape      Parameters
─────────────────────────────────────────────────────────────
Input                          (B, 3, 66, 200)   —

Conv2d(3→24, 5×5, stride=2)   (B, 24, 31, 98)   1,824
BatchNorm2d(24)                                   48
ELU                                               —

Conv2d(24→36, 5×5, stride=2)  (B, 36, 14, 47)   21,636
BatchNorm2d(36)                                   72
ELU                                               —

Conv2d(36→48, 5×5, stride=2)  (B, 48, 5, 22)    43,248
BatchNorm2d(48)                                   96
ELU                                               —

Conv2d(48→64, 3×3, stride=1)  (B, 64, 3, 20)    27,712
BatchNorm2d(64)                                   128
ELU                                               —

Conv2d(64→64, 3×3, stride=1)  (B, 64, 1, 18)    36,928
BatchNorm2d(64)                                   128
ELU                                               —

Dropout2d(0.15)                                   —

Flatten                        (B, 1152)          —
Dropout(0.35)                                     —
Linear(1152→64)                (B, 64)            73,792
ELU                                               —
Dropout(0.35)                                     —
Linear(64→9)                   (B, 9)             585

─────────────────────────────────────────────────────────────
Total trainable parameters:    ~145,000
Model file size:               ~1–2 MB (.pth)

Design Decisions

BatchNorm after every conv layer — stabilizes training and allows higher learning rates without divergence
ELU instead of ReLU — avoids dead neurons and produces smoother gradients, which matters when the model is small
Spatial Dropout2d (15%) — drops entire feature maps instead of individual pixels, forcing the network to spread information across channels
Two-layer classification head with 35% dropout — the bottleneck at 64 units forces compression and fights overfitting on small datasets
Kaiming initialization — all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations
Label smoothing (0.2) — prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization

Preprocessing

The full pipeline from raw camera frame to model input:

Raw 800×600 BGR frame from ESP32-CAM
             │
             ▼
    Crop top 40% of the image
    (removes ceiling, sky, and upper walls)
             │
             ▼
    Convert BGR → RGB
             │
             ▼
    Resize to 200×66 pixels
    (using INTER_AREA interpolation)
             │
             ▼
    ToTensor → normalize to [0, 1] float32
             │
             ▼
    Final shape: [batch, 3, 66, 200]

The crop_and_resize() function in trainer.py performs this transformation. The exact same function is called during both training and inference (in autopilot.py) to guarantee consistency.

Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky — none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead.

Training Configuration

Parameter	Value	Notes
Optimizer	AdamW	weight_decay=5e-3 for L2 regularization
Learning Rate	0.001	Peak rate, with OneCycleLR schedule
LR Schedule	OneCycleLR	10% warmup, cosine anneal, div_factor=10
Loss Function	CrossEntropyLoss	label_smoothing=0.2
Batch Size	32	Fits comfortably in CPU memory
Gradient Clipping	max_norm=1.0	Prevents gradient explosions
Early Stopping	30 epochs patience	Monitored by validation accuracy
Class Balancing	WeightedRandomSampler	Inverse-frequency weights per class
Train/Val Split	80% / 20%	Random split

Data Augmentation

Applied on-the-fly during training:

Augmentation	Probability	Details
Horizontal flip	50%	Action labels are mirrored (LEFT↔RIGHT)
Random shadow	50%	Vertical band at random brightness (30–70%)
Random brightness	50%	HSV V-channel scaled 0.6–1.4×
Gaussian blur	30%	Kernel 3×3 or 5×5
Random translation	40%	Shift ±10% in X and Y
Random erasing	50%	Rectangular cutout on tensor

The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels.

Inference

At runtime, the autopilot module:

Reads the latest camera frame from the MJPEG stream
Runs crop_and_resize() → converts to tensor
Forward pass through ActionNet → gets 9 logits
Applies softmax → picks the action with highest probability
Uses a 3-frame majority vote to smooth out flickering predictions
Maps the smoothed action to (left, right) motor commands at the configured speed
Sends the command to the ESP8266 over WebSocket

The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required.

Hardware Requirements

This model is designed for a specific hardware setup:

Component	Role
ESP32-CAM (OV2640)	Streams 800×600 MJPEG video over HTTP
ESP8266 (NodeMCU)	Receives motor commands over WebSocket, drives L298N
L298N Motor Driver	Controls 2 DC gear motors (differential drive)
SG90 Servo (optional)	Camera pan
PC (any laptop/desktop)	Runs the server, training, and inference

The PC does all the heavy lifting. The microcontrollers are just I/O — one for video, one for motors. Total hardware cost is around $25–30 USD.

How to Use This Model

Quickstart

import torch
import torch.nn.functional as F
from torchvision import transforms
from model import ActionNet, action_to_command

# Load
device = torch.device("cpu")
model = ActionNet().to(device)
checkpoint = torch.load("trained_models/autopilot.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Prepare a 66x200 RGB image as tensor
transform = transforms.ToTensor()
img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device)

# Predict
with torch.no_grad():
    logits = model(img_tensor)
    probs = F.softmax(logits, dim=1)
    action = torch.argmax(probs, dim=1).item()
    confidence = probs[0, action].item()

# Convert to motor command
left, right = action_to_command(action, speed=70)
print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}")

Within the Full System

The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot."

See the full README for step-by-step instructions including hardware assembly, firmware upload, and data collection.

Training Your Own Model

Assemble the hardware (ESP8266 + ESP32-CAM + motors)
Flash firmware to both microcontrollers
Start the PC server: python app.py
Drive the car manually while recording data
Click "Train" in the dashboard — or the model trains through the API
The best checkpoint saves automatically to trained_models/autopilot.pth

Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required.

Limitations

The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data.
Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data.
The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in trainer.py.
Performance depends heavily on lighting conditions matching between training and inference.
The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on.

Citation

If you use this project in your work, a mention is appreciated but not required:

OpenBot PC Server Project — Autonomous RC Car with Imitation Learning
https://github.com/YOUR_USERNAME/openbot-pc-server-project

License

MIT License — use it, modify it, ship it.

Downloads last month: -; Downloads are not tracked for this model. How to track