ActionNet β€” Autonomous RC Car Driving Model

A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning β€” a human drives the car while the system records frames and commands, then the model learns to replicate that behavior.

Part of the OpenBot PC Server Project.


Model Description

ActionNet classifies a single 66Γ—200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero.

Input: 66Γ—200Γ—3 RGB image (cropped from 800Γ—600, top 40% removed) Output: probability distribution over 9 driving actions

The 9 Actions

Index Action Left Motor Right Motor Description
0 STOP 0 0 Both motors off
1 FORWARD +70 +70 Straight ahead
2 BACKWARD -70 -70 Straight reverse
3 TURN LEFT -49 +49 Pivot left (in place)
4 TURN RIGHT +49 -49 Pivot right (in place)
5 FORWARD+LEFT +21 +70 Arc forward-left
6 FORWARD+RIGHT +70 +21 Arc forward-right
7 BACKWARD+LEFT -21 -70 Arc backward-left
8 BACKWARD+RIGHT -70 -21 Arc backward-right

Motor values are shown at speed=70 and scale proportionally with the speed setting.


Architecture

The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head.

Layer                          Output Shape      Parameters
─────────────────────────────────────────────────────────────
Input                          (B, 3, 66, 200)   β€”

Conv2d(3β†’24, 5Γ—5, stride=2)   (B, 24, 31, 98)   1,824
BatchNorm2d(24)                                   48
ELU                                               β€”

Conv2d(24β†’36, 5Γ—5, stride=2)  (B, 36, 14, 47)   21,636
BatchNorm2d(36)                                   72
ELU                                               β€”

Conv2d(36β†’48, 5Γ—5, stride=2)  (B, 48, 5, 22)    43,248
BatchNorm2d(48)                                   96
ELU                                               β€”

Conv2d(48β†’64, 3Γ—3, stride=1)  (B, 64, 3, 20)    27,712
BatchNorm2d(64)                                   128
ELU                                               β€”

Conv2d(64β†’64, 3Γ—3, stride=1)  (B, 64, 1, 18)    36,928
BatchNorm2d(64)                                   128
ELU                                               β€”

Dropout2d(0.15)                                   β€”

Flatten                        (B, 1152)          β€”
Dropout(0.35)                                     β€”
Linear(1152β†’64)                (B, 64)            73,792
ELU                                               β€”
Dropout(0.35)                                     β€”
Linear(64β†’9)                   (B, 9)             585

─────────────────────────────────────────────────────────────
Total trainable parameters:    ~145,000
Model file size:               ~1–2 MB (.pth)

Design Decisions

  • BatchNorm after every conv layer β€” stabilizes training and allows higher learning rates without divergence
  • ELU instead of ReLU β€” avoids dead neurons and produces smoother gradients, which matters when the model is small
  • Spatial Dropout2d (15%) β€” drops entire feature maps instead of individual pixels, forcing the network to spread information across channels
  • Two-layer classification head with 35% dropout β€” the bottleneck at 64 units forces compression and fights overfitting on small datasets
  • Kaiming initialization β€” all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations
  • Label smoothing (0.2) β€” prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization

Preprocessing

The full pipeline from raw camera frame to model input:

Raw 800Γ—600 BGR frame from ESP32-CAM
             β”‚
             β–Ό
    Crop top 40% of the image
    (removes ceiling, sky, and upper walls)
             β”‚
             β–Ό
    Convert BGR β†’ RGB
             β”‚
             β–Ό
    Resize to 200Γ—66 pixels
    (using INTER_AREA interpolation)
             β”‚
             β–Ό
    ToTensor β†’ normalize to [0, 1] float32
             β”‚
             β–Ό
    Final shape: [batch, 3, 66, 200]

The crop_and_resize() function in trainer.py performs this transformation. The exact same function is called during both training and inference (in autopilot.py) to guarantee consistency.

Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky β€” none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead.


Training Configuration

Parameter Value Notes
Optimizer AdamW weight_decay=5e-3 for L2 regularization
Learning Rate 0.001 Peak rate, with OneCycleLR schedule
LR Schedule OneCycleLR 10% warmup, cosine anneal, div_factor=10
Loss Function CrossEntropyLoss label_smoothing=0.2
Batch Size 32 Fits comfortably in CPU memory
Gradient Clipping max_norm=1.0 Prevents gradient explosions
Early Stopping 30 epochs patience Monitored by validation accuracy
Class Balancing WeightedRandomSampler Inverse-frequency weights per class
Train/Val Split 80% / 20% Random split

Data Augmentation

Applied on-the-fly during training:

Augmentation Probability Details
Horizontal flip 50% Action labels are mirrored (LEFT↔RIGHT)
Random shadow 50% Vertical band at random brightness (30–70%)
Random brightness 50% HSV V-channel scaled 0.6–1.4Γ—
Gaussian blur 30% Kernel 3Γ—3 or 5Γ—5
Random translation 40% Shift Β±10% in X and Y
Random erasing 50% Rectangular cutout on tensor

The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels.


Inference

At runtime, the autopilot module:

  1. Reads the latest camera frame from the MJPEG stream
  2. Runs crop_and_resize() β†’ converts to tensor
  3. Forward pass through ActionNet β†’ gets 9 logits
  4. Applies softmax β†’ picks the action with highest probability
  5. Uses a 3-frame majority vote to smooth out flickering predictions
  6. Maps the smoothed action to (left, right) motor commands at the configured speed
  7. Sends the command to the ESP8266 over WebSocket

The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required.


Hardware Requirements

This model is designed for a specific hardware setup:

Component Role
ESP32-CAM (OV2640) Streams 800Γ—600 MJPEG video over HTTP
ESP8266 (NodeMCU) Receives motor commands over WebSocket, drives L298N
L298N Motor Driver Controls 2 DC gear motors (differential drive)
SG90 Servo (optional) Camera pan
PC (any laptop/desktop) Runs the server, training, and inference

The PC does all the heavy lifting. The microcontrollers are just I/O β€” one for video, one for motors. Total hardware cost is around $25–30 USD.


How to Use This Model

Quickstart

import torch
import torch.nn.functional as F
from torchvision import transforms
from model import ActionNet, action_to_command

# Load
device = torch.device("cpu")
model = ActionNet().to(device)
checkpoint = torch.load("trained_models/autopilot.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Prepare a 66x200 RGB image as tensor
transform = transforms.ToTensor()
img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device)

# Predict
with torch.no_grad():
    logits = model(img_tensor)
    probs = F.softmax(logits, dim=1)
    action = torch.argmax(probs, dim=1).item()
    confidence = probs[0, action].item()

# Convert to motor command
left, right = action_to_command(action, speed=70)
print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}")

Within the Full System

The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot."

See the full README for step-by-step instructions including hardware assembly, firmware upload, and data collection.


Training Your Own Model

  1. Assemble the hardware (ESP8266 + ESP32-CAM + motors)
  2. Flash firmware to both microcontrollers
  3. Start the PC server: python app.py
  4. Drive the car manually while recording data
  5. Click "Train" in the dashboard β€” or the model trains through the API
  6. The best checkpoint saves automatically to trained_models/autopilot.pth

Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required.


Limitations

  • The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data.
  • Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data.
  • The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in trainer.py.
  • Performance depends heavily on lighting conditions matching between training and inference.
  • The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on.

Citation

If you use this project in your work, a mention is appreciated but not required:

OpenBot PC Server Project β€” Autonomous RC Car with Imitation Learning
https://github.com/YOUR_USERNAME/openbot-pc-server-project

License

MIT License β€” use it, modify it, ship it.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support