More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
Paper • 2504.02193 • Published • 1
None defined yet.
KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation