BayesRL

non-profit
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

wetsoledrysoulĀ  updated a Space about 7 hours ago
BayesRL/README
wetsoledrysoulĀ  published a Space about 7 hours ago
BayesRL/README
wetsoledrysoulĀ  updated a model about 7 hours ago
BayesRL/Qwen-M3PO-7B
View all activity

Organization Card

BayesRL

Research artifacts for variational / Bayesian approaches to reinforcement learning, centered on parameter-space exploration for RLVR.

Our current release accompanies the paper "Parameter Exploration for RLVR via Variational Learning", which introduces Perturbed Parameter Policy Optimization (3PO): a family of exploration strategies for Reinforcement Learning with Verifiable Rewards (RLVR). Rather than relying only on action-space heuristics (temperature, clipping, entropy bonuses), 3PO samples model weights from an approximate posterior learned with the variational optimizer IVON, turning the amount of weight noise into an explicit control lever for exploration.

šŸ“¦ Code: insait-institute/c3po

The 3PO family

Variant Brief Method Description
B3PO One weight perturbation from the IVON posterior per gradient step, synced to the rollout engine.
M3PO M Monte-Carlo perturbations per step; rollouts and advantages computed per sample, gradients averaged.
C3PO Each GRPO group of G rollouts is split across N independent perturbations (G/N each); advantages are computed over the full, more-diverse group with a Seq-MIS importance-sampling correction.

Collections

  • 3PO Models — Olmo-3 and Qwen2.5-Math 7B/8B checkpoints trained on DAPO-Math-17k with B3PO, M3PO, and C3PO (plus the M3PO+ and decoupled-MC ablations).
  • Warm-started Checkpoints — Olmo-3, Qwen2.5-Math, and Llama-3.1 base models SFT'd with IVON on the Nemotron Post-Training Dataset. IVON learns a posterior (mean + diagonal Hessian) that seeds the 3PO RL runs.

Models & data

Citation

@misc{venkatkrishna2026parameter,
      title={Parameter Exploration for RLVR via Variational Learning},
      author={Vatsal Venkatkrishna and Nico Daheim and Iryna Gurevych},
      year={2026},
}

datasets 0

None public yet