Spaces:

Sneha7
/

phi2-helpfulness-grpo-demo

Runtime error

Sneha7 commited on 11 days ago

Commit

96640a6

verified ·

1 Parent(s): e4c07fc

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,36 +1,14 @@
-# 🤝 GRPO Demo with Phi-2 — Helpfulness Reward
-This Space demonstrates **GRPO (Generative REINFORCE with Proximal Optimization)** —
-the RL method used in DeepSeek-R1 — applied to **helpfulness alignment**.
-## 🚀 What happens here
-1. You enter a prompt
-2. Phi-2 generates an answer
-3. A **helpfulness reward function** scores it:
-   - detail
-   - clear reasoning
-   - helpful intent
-   - no refusals
-4. GRPO performs a policy gradient update:
-   \[
-   L = -R \log \pi + \beta KL(\pi || \pi_{\text{ref}})
-   \]
-5. The model improves step-by-step
-6. You see a **reward curve** over time
-## 🧩 Files
-- `policy.py` — loads Phi-2
-- `reward_fn.py` — helpfulness scoring
-- `grpo_train.py` — GRPO update
-- `app.py` — UI
-- `requirements.txt`
 ---
-You can replace the reward with:
-- toxicity
-- factuality
-- math correctness
-- chain-of-thought quality

+---
+title: Phi2 Helpfulness Grpo Demo
+emoji: 🐨
+colorFrom: pink
+colorTo: indigo
+sdk: gradio
+sdk_version: 6.0.2
+app_file: app.py
+pinned: false
+license: mit
+short_description: phi2-helpfulness-grpo-demo
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference