Sneha7 commited on
Commit
96640a6
Β·
verified Β·
1 Parent(s): e4c07fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -34
README.md CHANGED
@@ -1,36 +1,14 @@
1
- # 🀝 GRPO Demo with Phi-2 β€” Helpfulness Reward
2
-
3
- This Space demonstrates **GRPO (Generative REINFORCE with Proximal Optimization)** β€”
4
- the RL method used in DeepSeek-R1 β€” applied to **helpfulness alignment**.
5
-
6
- ## πŸš€ What happens here
7
-
8
- 1. You enter a prompt
9
- 2. Phi-2 generates an answer
10
- 3. A **helpfulness reward function** scores it:
11
- - detail
12
- - clear reasoning
13
- - helpful intent
14
- - no refusals
15
- 4. GRPO performs a policy gradient update:
16
- \[
17
- L = -R \log \pi + \beta KL(\pi || \pi_{\text{ref}})
18
- \]
19
- 5. The model improves step-by-step
20
- 6. You see a **reward curve** over time
21
-
22
- ## 🧩 Files
23
-
24
- - `policy.py` β€” loads Phi-2
25
- - `reward_fn.py` β€” helpfulness scoring
26
- - `grpo_train.py` β€” GRPO update
27
- - `app.py` β€” UI
28
- - `requirements.txt`
29
-
30
  ---
31
 
32
- You can replace the reward with:
33
- - toxicity
34
- - factuality
35
- - math correctness
36
- - chain-of-thought quality
 
1
+ ---
2
+ title: Phi2 Helpfulness Grpo Demo
3
+ emoji: 🐨
4
+ colorFrom: pink
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 6.0.2
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ short_description: phi2-helpfulness-grpo-demo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference