bpHigh Claude Opus 4.7 (1M context) commited on
Commit
8d80d79
Β·
1 Parent(s): 4d2df85

Attribute hand-curated Round-1 tasks to Finch + list ALL 119 tasks in openenv.yaml

Browse files

The Round-1 hand-curated xlsx tasks (task_1..task_10) were also drawn
from the Finch dataset β€” the prior README/dashboard/openenv.yaml language
made it sound like they were a separate source. Fixed:

- README task inventory table: split 'xlsx' row into two Finch sub-rows
(hand-curated Round 1 + stratified Round 2 pull) instead of treating
'Hand-curated' as if it were a non-Finch origin. Footer note now
states explicitly: 'All 60 xlsx tasks come from Finch (FinWorkBench).'

- Dashboard task inventory card label: '.xlsx (Finch + curated)' β†’
'.xlsx (Finch β€” 10 hand-curated + 50 stratified)'.

- openenv.yaml metadata.data_sources entry for Finch now reads
tasks: 60 (was 50) with breakdown: '10 hand-curated (Round 1) +
50 stratified pull (Round 2)'.

Also: enumerate ALL 119 tasks in openenv.yaml, not just the 32 (10 hand
+ 22 eval) it had before. File grew from 412 β†’ 1,607 lines but is
fully comprehensive β€” every task has its family, primary_tag,
difficulty, task_type, max_steps, split, origin, and family-aware
grader description. The 87 train tasks were previously elided with a
'see manifest.jsonl' comment; now they're first-class entries.

Sections within tasks: list, in order:
- Hand-curated Finch (10) β€” split: train, origin: finch_hand_curated
- Train Finch (40) β€” split: train, origin: finch
- Train OSWorld (17) β€” split: train, origin: osworld
- Train PPTArena (30) β€” split: train, origin: pptarena
- Eval Finch (10) β€” split: eval, origin: finch
- Eval OSWorld (4) β€” split: eval, origin: osworld
- Eval PPTArena (8) β€” split: eval, origin: pptarena

Validated: yaml.safe_load parses cleanly; counts match expectations
(train=97, eval=22, xlsx=60, docx=21, pptx=38, total=119).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. README.md +6 -4
  2. openenv.yaml +1259 -64
  3. server/app.py +1 -1
README.md CHANGED
@@ -128,14 +128,16 @@ and `sft_loss_curve.png`. Both runs above were generated this way.
128
 
129
  | Family | Source | Train | Eval | Total | What it tests |
130
  |---|---|---|---|---|---|
131
- | `xlsx` | Hand-curated (Round 1) | 10 | 0 | 10 | Diverse Finch tasks (QA + MODIFY) |
132
- | `xlsx` | [Finch](https://huggingface.co/datasets/FinWorkBench/Finch) | 40 | 10 | 50 | Stratified across 7 task-type tags |
133
  | `docx` | [OSWorld-Verified](https://github.com/xlang-ai/OSWorld) (libreoffice_writer) | 17 | 4 | 21 | 16 distinct evaluator functions ported from `desktop_env/evaluators/metrics/docs.py` |
134
  | `pptx` | [PPTArena](https://github.com/michaelofengend/PPTArena) | 30 | 8 | 38 | 16 distinct edit_types, including singletons (transitions, animations, A/V) |
135
  | **Total** | | **97** | **22** | **119** | |
136
 
137
- The 22-task eval set is stratified β€” at least 1 task per tag bucket β€” so the
138
- benchmark isn't biased toward one task type.
 
 
139
 
140
  ---
141
 
 
128
 
129
  | Family | Source | Train | Eval | Total | What it tests |
130
  |---|---|---|---|---|---|
131
+ | `xlsx` | [Finch](https://huggingface.co/datasets/FinWorkBench/Finch) β€” hand-curated (Round 1) | 10 | 0 | 10 | Diverse Finch tasks hand-picked for the original submission (QA + MODIFY mix) |
132
+ | `xlsx` | [Finch](https://huggingface.co/datasets/FinWorkBench/Finch) β€” stratified pull (Round 2) | 40 | 10 | 50 | Stratified across 7 task-type tags |
133
  | `docx` | [OSWorld-Verified](https://github.com/xlang-ai/OSWorld) (libreoffice_writer) | 17 | 4 | 21 | 16 distinct evaluator functions ported from `desktop_env/evaluators/metrics/docs.py` |
134
  | `pptx` | [PPTArena](https://github.com/michaelofengend/PPTArena) | 30 | 8 | 38 | 16 distinct edit_types, including singletons (transitions, animations, A/V) |
135
  | **Total** | | **97** | **22** | **119** | |
136
 
137
+ All 60 xlsx tasks come from **Finch (FinWorkBench)** β€” the 10 hand-curated
138
+ Round-1 picks plus the 50 stratified Round-2 pull. The 22-task eval set is
139
+ stratified (at least 1 task per tag bucket per family) so the benchmark
140
+ isn't biased toward one task type.
141
 
142
  ---
143
 
openenv.yaml CHANGED
@@ -6,16 +6,15 @@ app: server.app:app
6
  port: 8000
7
 
8
  # Cross-format RL environment for office-document tasks.
9
- # Sources: FinWorkBench/Finch (xlsx) + OSWorld-Verified (docx) + PPTArena (pptx).
10
  #
11
- # 10 hand-curated xlsx + 50 Finch xlsx + 21 OSWorld docx + 38 PPTArena pptx
 
 
12
  # = 119 total tasks Β· 97 train + 22 eval.
13
  #
14
- # Round-1 baseline hand-curated tasks (task_1..task_10) are listed below
15
- # explicitly. The 22-task eval split (10 xlsx + 4 docx + 8 pptx) is also
16
- # enumerated for benchmark reproducibility. The remaining 87 training-only
17
- # tasks are loaded at runtime from data/manifest.jsonl β€” listing them all
18
- # here would balloon this file.
19
 
20
  metadata:
21
  total_tasks: 119
@@ -30,20 +29,23 @@ metadata:
30
  - name: Finch (FinWorkBench)
31
  url: https://huggingface.co/datasets/FinWorkBench/Finch
32
  family: xlsx
33
- tasks: 50
 
34
  - name: OSWorld-Verified (libreoffice_writer)
35
  url: https://github.com/xlang-ai/OSWorld
36
  family: docx
37
  tasks: 21
 
38
  - name: PPTArena
39
  url: https://github.com/michaelofengend/PPTArena
40
  family: pptx
41
  tasks: 38
 
42
  manifest_path: data/manifest.jsonl
43
 
44
  tasks:
45
 
46
- # ── Hand-curated Round-1 tasks (xlsx) ─────────────────────────
47
 
48
  - id: task_1
49
  name: 'Count Plants in Spreadsheet'
@@ -52,6 +54,8 @@ tasks:
52
  difficulty: easy
53
  task_type: QA
54
  max_steps: 15
 
 
55
  grader:
56
  type: programmatic
57
  description: "QA (xlsx) β€” extract numbers from agent's text answer, compare against reference value. 80% numeric match (5% tolerance) + 20% keyword overlap. Score 0.001-0.999."
@@ -63,6 +67,8 @@ tasks:
63
  difficulty: easy
64
  task_type: QA
65
  max_steps: 15
 
 
66
  grader:
67
  type: programmatic
68
  description: "QA (xlsx) β€” extract numbers from agent's text answer, compare against reference value. 80% numeric match (5% tolerance) + 20% keyword overlap. Score 0.001-0.999."
@@ -74,6 +80,8 @@ tasks:
74
  difficulty: medium
75
  task_type: QA
76
  max_steps: 15
 
 
77
  grader:
78
  type: programmatic
79
  description: "QA (xlsx) β€” extract numbers from agent's text answer, compare against reference value. 80% numeric match (5% tolerance) + 20% keyword overlap. Score 0.001-0.999."
@@ -85,6 +93,8 @@ tasks:
85
  difficulty: medium
86
  task_type: MODIFY
87
  max_steps: 15
 
 
88
  grader:
89
  type: programmatic
90
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
@@ -96,6 +106,8 @@ tasks:
96
  difficulty: hard
97
  task_type: MODIFY
98
  max_steps: 15
 
 
99
  grader:
100
  type: programmatic
101
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
@@ -107,6 +119,8 @@ tasks:
107
  difficulty: medium
108
  task_type: MODIFY
109
  max_steps: 15
 
 
110
  grader:
111
  type: programmatic
112
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
@@ -118,6 +132,8 @@ tasks:
118
  difficulty: medium
119
  task_type: MODIFY
120
  max_steps: 15
 
 
121
  grader:
122
  type: programmatic
123
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
@@ -129,6 +145,8 @@ tasks:
129
  difficulty: hard
130
  task_type: MODIFY
131
  max_steps: 15
 
 
132
  grader:
133
  type: programmatic
134
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
@@ -140,6 +158,8 @@ tasks:
140
  difficulty: hard
141
  task_type: MODIFY
142
  max_steps: 15
 
 
143
  grader:
144
  type: programmatic
145
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
@@ -151,11 +171,1150 @@ tasks:
151
  difficulty: medium
152
  task_type: MODIFY
153
  max_steps: 15
 
 
154
  grader:
155
  type: programmatic
156
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
157
 
158
- # ── Eval split β€” xlsx (10 tasks from Finch) ──────────────────
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
  - id: finch_10
161
  name: 'Calculation: Per the headers and established formula logic, populate form'
@@ -164,110 +1323,130 @@ tasks:
164
  difficulty: medium
165
  task_type: MODIFY
166
  max_steps: 15
 
 
167
  grader:
168
  type: programmatic
169
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
170
 
171
- - id: finch_112
172
- name: 'Cross-sheet/file Retrieval: For each record, use the Frequency to place the Rent amount '
173
  family: xlsx
174
- primary_tag: 'Cross-sheet/file Retrieval'
175
- difficulty: easy
176
  task_type: MODIFY
177
  max_steps: 15
 
 
178
  grader:
179
  type: programmatic
180
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
181
 
182
- - id: finch_122
183
- name: 'Summary / Visualization: Create a new sheet named β€œExp by Fun Gen Support Chart5” and'
184
  family: xlsx
185
- primary_tag: 'Summary / Visualization'
186
- difficulty: easy
187
  task_type: MODIFY
188
  max_steps: 15
 
 
189
  grader:
190
  type: programmatic
191
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
192
 
193
- - id: finch_14
194
- name: 'Financial Modeling: Suppose we need to hold a 0.5-year AA(2) municipal investmen'
195
  family: xlsx
196
- primary_tag: 'Financial Modeling'
197
- difficulty: hard
198
  task_type: MODIFY
199
  max_steps: 15
 
 
200
  grader:
201
  type: programmatic
202
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
203
 
204
- - id: finch_154
205
- name: 'Data Entry / Import: Complete the missing Interreg co-financing data in the FR fi'
206
  family: xlsx
207
- primary_tag: 'Data Entry / Import'
208
  difficulty: medium
209
  task_type: MODIFY
210
  max_steps: 15
 
 
211
  grader:
212
  type: programmatic
213
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
214
 
215
- - id: finch_158
216
- name: 'Validation / Review: Audit the workbook and correct the formula errors in place s'
217
  family: xlsx
218
- primary_tag: 'Validation / Review'
219
- difficulty: hard
220
  task_type: MODIFY
221
  max_steps: 15
 
 
222
  grader:
223
  type: programmatic
224
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
225
 
226
- - id: finch_168
227
- name: 'Structuring / Formatting: Insert blank rows between adjacent tables in the workbook to'
228
  family: xlsx
229
- primary_tag: 'Structuring / Formatting'
230
- difficulty: medium
231
  task_type: MODIFY
232
  max_steps: 15
 
 
233
  grader:
234
  type: programmatic
235
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
236
 
237
- - id: finch_35
238
- name: 'Calculation: Summarize the volume and dollar imbalances that exist betwee'
239
  family: xlsx
240
- primary_tag: 'Calculation'
241
  difficulty: medium
242
  task_type: MODIFY
243
  max_steps: 15
 
 
244
  grader:
245
  type: programmatic
246
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
247
 
248
- - id: finch_38
249
- name: 'Calculation: Using the discount rate assumptions in the table and each Sh'
250
  family: xlsx
251
- primary_tag: 'Calculation'
252
- difficulty: medium
253
  task_type: MODIFY
254
  max_steps: 15
 
 
255
  grader:
256
  type: programmatic
257
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
258
 
259
- - id: finch_59
260
- name: 'Structuring / Formatting: Update the TOTAL PHYSICAL GAS tab to mirror the layout on TO'
261
  family: xlsx
262
  primary_tag: 'Structuring / Formatting'
263
  difficulty: medium
264
  task_type: MODIFY
265
  max_steps: 15
 
 
266
  grader:
267
  type: programmatic
268
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
269
 
270
- # ── Eval split β€” docx (4 tasks from OSWorld-Verified writer) ─
271
 
272
  - id: osworld_0810415c
273
  name: 'compare_line_spacing: Make the line spacing of first two paragraph into double lin'
@@ -276,9 +1455,11 @@ tasks:
276
  difficulty: medium
277
  task_type: MODIFY
278
  max_steps: 15
 
 
279
  grader:
280
  type: programmatic
281
- description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator (compare_docx_files / check_tabstops / has_page_numbers_in_footers / etc.). Score 0.001-0.999.'
282
 
283
  - id: osworld_0a0faba3
284
  name: 'check_tabstops: I would like to make the first three words of the sentence l'
@@ -287,9 +1468,11 @@ tasks:
287
  difficulty: medium
288
  task_type: MODIFY
289
  max_steps: 15
 
 
290
  grader:
291
  type: programmatic
292
- description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator (compare_docx_files / check_tabstops / has_page_numbers_in_footers / etc.). Score 0.001-0.999.'
293
 
294
  - id: osworld_0b17a146
295
  name: 'compare_docx_files: Help me change the 2 in "H2O" to a subscript.'
@@ -298,9 +1481,11 @@ tasks:
298
  difficulty: medium
299
  task_type: MODIFY
300
  max_steps: 15
 
 
301
  grader:
302
  type: programmatic
303
- description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator (compare_docx_files / check_tabstops / has_page_numbers_in_footers / etc.). Score 0.001-0.999.'
304
 
305
  - id: osworld_66399b0d
306
  name: 'compare_docx_tables: Could you help me insert a 7(columns)*5(rows) empty table at'
@@ -309,11 +1494,13 @@ tasks:
309
  difficulty: medium
310
  task_type: MODIFY
311
  max_steps: 15
 
 
312
  grader:
313
  type: programmatic
314
- description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator (compare_docx_files / check_tabstops / has_page_numbers_in_footers / etc.). Score 0.001-0.999.'
315
 
316
- # ── Eval split β€” pptx (8 tasks from PPTArena) ────────────────
317
 
318
  - id: pptarena_case_26_match_slide_colors_to_theme
319
  name: 'Theme & Background: Case 26: Match Slide Colors to Theme'
@@ -322,9 +1509,11 @@ tasks:
322
  difficulty: medium
323
  task_type: MODIFY
324
  max_steps: 15
 
 
325
  grader:
326
  type: programmatic
327
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
328
 
329
  - id: pptarena_case_32_arrange_image_and_text
330
  name: 'Images & Pictures: Case 32: Arrange Image and Text'
@@ -333,9 +1522,11 @@ tasks:
333
  difficulty: medium
334
  task_type: MODIFY
335
  max_steps: 15
 
 
336
  grader:
337
  type: programmatic
338
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
339
 
340
  - id: pptarena_case_35_structural_fix
341
  name: 'Text & Typography: Case 35: Structural Fix'
@@ -344,9 +1535,11 @@ tasks:
344
  difficulty: medium
345
  task_type: MODIFY
346
  max_steps: 15
 
 
347
  grader:
348
  type: programmatic
349
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
350
 
351
  - id: pptarena_case_36_add_speaker_notes
352
  name: 'Slide/Section Management & Footers: Case 36: Add Speaker Notes'
@@ -355,9 +1548,11 @@ tasks:
355
  difficulty: medium
356
  task_type: MODIFY
357
  max_steps: 15
 
 
358
  grader:
359
  type: programmatic
360
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
361
 
362
  - id: pptarena_case_40_hindu_center_titles
363
  name: 'Text & Typography: Case 40: Hindu Center Titles'
@@ -366,9 +1561,11 @@ tasks:
366
  difficulty: medium
367
  task_type: MODIFY
368
  max_steps: 15
 
 
369
  grader:
370
  type: programmatic
371
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
372
 
373
  - id: pptarena_case_49_normalize_thousand_separators
374
  name: 'Tables: Case 49: Normalize Thousand Separators'
@@ -377,9 +1574,11 @@ tasks:
377
  difficulty: medium
378
  task_type: MODIFY
379
  max_steps: 15
 
 
380
  grader:
381
  type: programmatic
382
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
383
 
384
  - id: pptarena_case_60_fix_text_placement
385
  name: 'Alignment, Distribution & Z-order: Case 60: Fix Text Placement'
@@ -388,9 +1587,11 @@ tasks:
388
  difficulty: medium
389
  task_type: MODIFY
390
  max_steps: 15
 
 
391
  grader:
392
  type: programmatic
393
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
394
 
395
  - id: pptarena_case_7_update_quarter_two_data_b
396
  name: 'Charts: Case 7: Update Quarter Two Data (B)'
@@ -399,14 +1600,8 @@ tasks:
399
  difficulty: medium
400
  task_type: MODIFY
401
  max_steps: 15
 
 
402
  grader:
403
  type: programmatic
404
- description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% position + 20% size). Score 0.001-0.999.'
405
-
406
- # ── Train-only tasks (not enumerated; total 87) ───────────────────
407
- # 40 Finch xlsx Β· 17 OSWorld docx Β· 30 PPTArena pptx
408
- # Loaded at runtime from data/manifest.jsonl. Use the Python API:
409
- #
410
- # from tasks import TASKS, split_ids
411
- # train_ids = split_ids('train')
412
- # eval_ids = split_ids('eval')
 
6
  port: 8000
7
 
8
  # Cross-format RL environment for office-document tasks.
 
9
  #
10
+ # 60 xlsx (Finch β€” 10 hand-curated Round-1 + 50 stratified Round-2 pull) Β·
11
+ # 21 docx (OSWorld-Verified libreoffice_writer subset) Β·
12
+ # 38 pptx (PPTArena evaluation_pairs_refined.json subset)
13
  # = 119 total tasks Β· 97 train + 22 eval.
14
  #
15
+ # All 119 tasks are enumerated below. Round-1 hand-curated tasks have IDs
16
+ # task_1..task_10; Round-2 stratified pull uses finch_*; OSWorld uses
17
+ # osworld_<uuid8>; PPTArena uses pptarena_<slug>.
 
 
18
 
19
  metadata:
20
  total_tasks: 119
 
29
  - name: Finch (FinWorkBench)
30
  url: https://huggingface.co/datasets/FinWorkBench/Finch
31
  family: xlsx
32
+ tasks: 60
33
+ breakdown: "10 hand-curated (Round 1) + 50 stratified pull (Round 2)"
34
  - name: OSWorld-Verified (libreoffice_writer)
35
  url: https://github.com/xlang-ai/OSWorld
36
  family: docx
37
  tasks: 21
38
+ breakdown: "21 strict-docx (skipping 1 .odt + 1 .pdf input)"
39
  - name: PPTArena
40
  url: https://github.com/michaelofengend/PPTArena
41
  family: pptx
42
  tasks: 38
43
+ breakdown: "38 stratified across 16 edit_types incl. all 5 long-tail singletons"
44
  manifest_path: data/manifest.jsonl
45
 
46
  tasks:
47
 
48
+ # ── Hand-curated Finch tasks (Round 1) ────────────────────────
49
 
50
  - id: task_1
51
  name: 'Count Plants in Spreadsheet'
 
54
  difficulty: easy
55
  task_type: QA
56
  max_steps: 15
57
+ split: train
58
+ origin: finch_hand_curated
59
  grader:
60
  type: programmatic
61
  description: "QA (xlsx) β€” extract numbers from agent's text answer, compare against reference value. 80% numeric match (5% tolerance) + 20% keyword overlap. Score 0.001-0.999."
 
67
  difficulty: easy
68
  task_type: QA
69
  max_steps: 15
70
+ split: train
71
+ origin: finch_hand_curated
72
  grader:
73
  type: programmatic
74
  description: "QA (xlsx) β€” extract numbers from agent's text answer, compare against reference value. 80% numeric match (5% tolerance) + 20% keyword overlap. Score 0.001-0.999."
 
80
  difficulty: medium
81
  task_type: QA
82
  max_steps: 15
83
+ split: train
84
+ origin: finch_hand_curated
85
  grader:
86
  type: programmatic
87
  description: "QA (xlsx) β€” extract numbers from agent's text answer, compare against reference value. 80% numeric match (5% tolerance) + 20% keyword overlap. Score 0.001-0.999."
 
93
  difficulty: medium
94
  task_type: MODIFY
95
  max_steps: 15
96
+ split: train
97
+ origin: finch_hand_curated
98
  grader:
99
  type: programmatic
100
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
 
106
  difficulty: hard
107
  task_type: MODIFY
108
  max_steps: 15
109
+ split: train
110
+ origin: finch_hand_curated
111
  grader:
112
  type: programmatic
113
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
 
119
  difficulty: medium
120
  task_type: MODIFY
121
  max_steps: 15
122
+ split: train
123
+ origin: finch_hand_curated
124
  grader:
125
  type: programmatic
126
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
 
132
  difficulty: medium
133
  task_type: MODIFY
134
  max_steps: 15
135
+ split: train
136
+ origin: finch_hand_curated
137
  grader:
138
  type: programmatic
139
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
 
145
  difficulty: hard
146
  task_type: MODIFY
147
  max_steps: 15
148
+ split: train
149
+ origin: finch_hand_curated
150
  grader:
151
  type: programmatic
152
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
 
158
  difficulty: hard
159
  task_type: MODIFY
160
  max_steps: 15
161
+ split: train
162
+ origin: finch_hand_curated
163
  grader:
164
  type: programmatic
165
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
 
171
  difficulty: medium
172
  task_type: MODIFY
173
  max_steps: 15
174
+ split: train
175
+ origin: finch_hand_curated
176
  grader:
177
  type: programmatic
178
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
179
 
180
+ # ── Train split β€” Finch xlsx (40 tasks, stratified Round-2 pull) ─
181
+
182
+ - id: finch_6
183
+ name: 'Calculation: Please write a structured economic analysis report based on '
184
+ family: xlsx
185
+ primary_tag: 'Calculation'
186
+ difficulty: medium
187
+ task_type: MODIFY
188
+ max_steps: 15
189
+ split: train
190
+ origin: finch
191
+ grader:
192
+ type: programmatic
193
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
194
+
195
+ - id: finch_15
196
+ name: 'Structuring / Formatting: Translate all Chinese text in this Excel workbook (including'
197
+ family: xlsx
198
+ primary_tag: 'Structuring / Formatting'
199
+ difficulty: medium
200
+ task_type: MODIFY
201
+ max_steps: 15
202
+ split: train
203
+ origin: finch
204
+ grader:
205
+ type: programmatic
206
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
207
+
208
+ - id: finch_22
209
+ name: 'Validation / Review: Please review the pivot table on the Replacement Cost sheet '
210
+ family: xlsx
211
+ primary_tag: 'Validation / Review'
212
+ difficulty: hard
213
+ task_type: MODIFY
214
+ max_steps: 15
215
+ split: train
216
+ origin: finch
217
+ grader:
218
+ type: programmatic
219
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
220
+
221
+ - id: finch_32
222
+ name: 'Data Entry / Import: Please prepare a summary of all groups and staffing as of Ma'
223
+ family: xlsx
224
+ primary_tag: 'Data Entry / Import'
225
+ difficulty: medium
226
+ task_type: MODIFY
227
+ max_steps: 15
228
+ split: train
229
+ origin: finch
230
+ grader:
231
+ type: programmatic
232
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
233
+
234
+ - id: finch_33
235
+ name: 'Structuring / Formatting: Gather Enron North America’s Mid Year 2001 performance acros'
236
+ family: xlsx
237
+ primary_tag: 'Structuring / Formatting'
238
+ difficulty: medium
239
+ task_type: MODIFY
240
+ max_steps: 15
241
+ split: train
242
+ origin: finch
243
+ grader:
244
+ type: programmatic
245
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
246
+
247
+ - id: finch_47
248
+ name: 'Calculation: Complete the Income Statement (Purchase method) by calculati'
249
+ family: xlsx
250
+ primary_tag: 'Calculation'
251
+ difficulty: medium
252
+ task_type: MODIFY
253
+ max_steps: 15
254
+ split: train
255
+ origin: finch
256
+ grader:
257
+ type: programmatic
258
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
259
+
260
+ - id: finch_55
261
+ name: 'Summary / Visualization: On the correl_graph sheet, create a time-series line chart c'
262
+ family: xlsx
263
+ primary_tag: 'Summary / Visualization'
264
+ difficulty: easy
265
+ task_type: MODIFY
266
+ max_steps: 15
267
+ split: train
268
+ origin: finch
269
+ grader:
270
+ type: programmatic
271
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
272
+
273
+ - id: finch_62
274
+ name: "Calculation: For EDF MAN, clear the 'Line of Credit Covering Initial Marg"
275
+ family: xlsx
276
+ primary_tag: 'Calculation'
277
+ difficulty: medium
278
+ task_type: MODIFY
279
+ max_steps: 15
280
+ split: train
281
+ origin: finch
282
+ grader:
283
+ type: programmatic
284
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
285
+
286
+ - id: finch_63
287
+ name: 'Cross-sheet/file Retrieval: Using RepIS-Qtrly as the base, please create the RepIS-Annua'
288
+ family: xlsx
289
+ primary_tag: 'Cross-sheet/file Retrieval'
290
+ difficulty: easy
291
+ task_type: MODIFY
292
+ max_steps: 15
293
+ split: train
294
+ origin: finch
295
+ grader:
296
+ type: programmatic
297
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
298
+
299
+ - id: finch_65
300
+ name: 'Validation / Review: Review the Inv & WC Value Adj summary tab and add the missin'
301
+ family: xlsx
302
+ primary_tag: 'Validation / Review'
303
+ difficulty: hard
304
+ task_type: MODIFY
305
+ max_steps: 15
306
+ split: train
307
+ origin: finch
308
+ grader:
309
+ type: programmatic
310
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
311
+
312
+ - id: finch_66
313
+ name: 'Calculation: Calculate the Interest Payment fpr enron and fill the corren'
314
+ family: xlsx
315
+ primary_tag: 'Calculation'
316
+ difficulty: medium
317
+ task_type: MODIFY
318
+ max_steps: 15
319
+ split: train
320
+ origin: finch
321
+ grader:
322
+ type: programmatic
323
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
324
+
325
+ - id: finch_68
326
+ name: 'Cross-sheet/file Retrieval: Complete the Summary worksheet by entering the missing data '
327
+ family: xlsx
328
+ primary_tag: 'Cross-sheet/file Retrieval'
329
+ difficulty: easy
330
+ task_type: MODIFY
331
+ max_steps: 15
332
+ split: train
333
+ origin: finch
334
+ grader:
335
+ type: programmatic
336
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
337
+
338
+ - id: finch_76
339
+ name: 'Structuring / Formatting: Reformat the table by bolding the titles and inserting row b'
340
+ family: xlsx
341
+ primary_tag: 'Structuring / Formatting'
342
+ difficulty: medium
343
+ task_type: MODIFY
344
+ max_steps: 15
345
+ split: train
346
+ origin: finch
347
+ grader:
348
+ type: programmatic
349
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
350
+
351
+ - id: finch_77
352
+ name: 'Calculation: Calculate the headcount for each of the three groups in the '
353
+ family: xlsx
354
+ primary_tag: 'Calculation'
355
+ difficulty: medium
356
+ task_type: MODIFY
357
+ max_steps: 15
358
+ split: train
359
+ origin: finch
360
+ grader:
361
+ type: programmatic
362
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
363
+
364
+ - id: finch_78
365
+ name: 'Cross-sheet/file Retrieval: Review the summary tab against each of the individual sheets'
366
+ family: xlsx
367
+ primary_tag: 'Cross-sheet/file Retrieval'
368
+ difficulty: easy
369
+ task_type: MODIFY
370
+ max_steps: 15
371
+ split: train
372
+ origin: finch
373
+ grader:
374
+ type: programmatic
375
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
376
+
377
+ - id: finch_82
378
+ name: "Structuring / Formatting: On the 'simplecorr' sheet, create a table whose column heade"
379
+ family: xlsx
380
+ primary_tag: 'Structuring / Formatting'
381
+ difficulty: medium
382
+ task_type: MODIFY
383
+ max_steps: 15
384
+ split: train
385
+ origin: finch
386
+ grader:
387
+ type: programmatic
388
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
389
+
390
+ - id: finch_86
391
+ name: 'Data Entry / Import: Complete the asset allocation schedule using the provided as'
392
+ family: xlsx
393
+ primary_tag: 'Data Entry / Import'
394
+ difficulty: medium
395
+ task_type: MODIFY
396
+ max_steps: 15
397
+ split: train
398
+ origin: finch
399
+ grader:
400
+ type: programmatic
401
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
402
+
403
+ - id: finch_90
404
+ name: 'Structuring / Formatting: Add a top border to all values in the Summary tab that are c'
405
+ family: xlsx
406
+ primary_tag: 'Structuring / Formatting'
407
+ difficulty: medium
408
+ task_type: MODIFY
409
+ max_steps: 15
410
+ split: train
411
+ origin: finch
412
+ grader:
413
+ type: programmatic
414
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
415
+
416
+ - id: finch_93
417
+ name: 'Calculation: Complete both the Flat and Peak tables by using the provided'
418
+ family: xlsx
419
+ primary_tag: 'Calculation'
420
+ difficulty: medium
421
+ task_type: MODIFY
422
+ max_steps: 15
423
+ split: train
424
+ origin: finch
425
+ grader:
426
+ type: programmatic
427
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
428
+
429
+ - id: finch_98
430
+ name: 'Data Entry / Import: Use publicly available market/financial data to populate She'
431
+ family: xlsx
432
+ primary_tag: 'Data Entry / Import'
433
+ difficulty: medium
434
+ task_type: MODIFY
435
+ max_steps: 15
436
+ split: train
437
+ origin: finch
438
+ grader:
439
+ type: programmatic
440
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
441
+
442
+ - id: finch_99
443
+ name: 'Structuring / Formatting: Based on the Canada – Non-Commercial roster, prepare a headc'
444
+ family: xlsx
445
+ primary_tag: 'Structuring / Formatting'
446
+ difficulty: medium
447
+ task_type: MODIFY
448
+ max_steps: 15
449
+ split: train
450
+ origin: finch
451
+ grader:
452
+ type: programmatic
453
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
454
+
455
+ - id: finch_105
456
+ name: 'Data Entry / Import: Add the 2/11/2000 column on the Feb 00 tab by mirroring the '
457
+ family: xlsx
458
+ primary_tag: 'Data Entry / Import'
459
+ difficulty: medium
460
+ task_type: MODIFY
461
+ max_steps: 15
462
+ split: train
463
+ origin: finch
464
+ grader:
465
+ type: programmatic
466
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
467
+
468
+ - id: finch_109
469
+ name: 'Calculation: Calculate the total FTE percentage by region and by business'
470
+ family: xlsx
471
+ primary_tag: 'Calculation'
472
+ difficulty: medium
473
+ task_type: MODIFY
474
+ max_steps: 15
475
+ split: train
476
+ origin: finch
477
+ grader:
478
+ type: programmatic
479
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
480
+
481
+ - id: finch_124
482
+ name: 'Cross-sheet/file Retrieval: Complete the content in the summary sheet based on other spr'
483
+ family: xlsx
484
+ primary_tag: 'Cross-sheet/file Retrieval'
485
+ difficulty: easy
486
+ task_type: MODIFY
487
+ max_steps: 15
488
+ split: train
489
+ origin: finch
490
+ grader:
491
+ type: programmatic
492
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
493
+
494
+ - id: finch_125
495
+ name: 'Calculation: You are given an Excel table (Figure 1.19) showing, for IDA-'
496
+ family: xlsx
497
+ primary_tag: 'Calculation'
498
+ difficulty: medium
499
+ task_type: MODIFY
500
+ max_steps: 15
501
+ split: train
502
+ origin: finch
503
+ grader:
504
+ type: programmatic
505
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
506
+
507
+ - id: finch_128
508
+ name: 'Summary / Visualization: Prepare a stacked area chart titled "Existing and Proposed D'
509
+ family: xlsx
510
+ primary_tag: 'Summary / Visualization'
511
+ difficulty: easy
512
+ task_type: MODIFY
513
+ max_steps: 15
514
+ split: train
515
+ origin: finch
516
+ grader:
517
+ type: programmatic
518
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
519
+
520
+ - id: finch_129
521
+ name: 'Summary / Visualization: Create a stacked area chart titled β€œRolling 55 Day Payables '
522
+ family: xlsx
523
+ primary_tag: 'Summary / Visualization'
524
+ difficulty: easy
525
+ task_type: MODIFY
526
+ max_steps: 15
527
+ split: train
528
+ origin: finch
529
+ grader:
530
+ type: programmatic
531
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
532
+
533
+ - id: finch_133
534
+ name: 'Validation / Review: Audit the consolidated 2002 plan workbook and correct the fo'
535
+ family: xlsx
536
+ primary_tag: 'Validation / Review'
537
+ difficulty: hard
538
+ task_type: MODIFY
539
+ max_steps: 15
540
+ split: train
541
+ origin: finch
542
+ grader:
543
+ type: programmatic
544
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
545
+
546
+ - id: finch_138
547
+ name: 'Structuring / Formatting: Add a new worksheet titled β€œP&C” and build a Property & Casu'
548
+ family: xlsx
549
+ primary_tag: 'Structuring / Formatting'
550
+ difficulty: medium
551
+ task_type: MODIFY
552
+ max_steps: 15
553
+ split: train
554
+ origin: finch
555
+ grader:
556
+ type: programmatic
557
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
558
+
559
+ - id: finch_139
560
+ name: 'Financial Modeling: Using the Cleburne Plant Damage Sensitivities, evaluate the '
561
+ family: xlsx
562
+ primary_tag: 'Financial Modeling'
563
+ difficulty: hard
564
+ task_type: MODIFY
565
+ max_steps: 15
566
+ split: train
567
+ origin: finch
568
+ grader:
569
+ type: programmatic
570
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
571
+
572
+ - id: finch_142
573
+ name: 'Calculation: Under the assumptions of Scenario 1, calculate and populateβ€”'
574
+ family: xlsx
575
+ primary_tag: 'Calculation'
576
+ difficulty: medium
577
+ task_type: MODIFY
578
+ max_steps: 15
579
+ split: train
580
+ origin: finch
581
+ grader:
582
+ type: programmatic
583
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
584
+
585
+ - id: finch_143
586
+ name: 'Calculation: Apply Scenario 2 to calculate the current positions, the 30-'
587
+ family: xlsx
588
+ primary_tag: 'Calculation'
589
+ difficulty: medium
590
+ task_type: MODIFY
591
+ max_steps: 15
592
+ split: train
593
+ origin: finch
594
+ grader:
595
+ type: programmatic
596
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
597
+
598
+ - id: finch_144
599
+ name: 'Calculation: Using the daily Crude Oil and Natural Gas prices recorded in'
600
+ family: xlsx
601
+ primary_tag: 'Calculation'
602
+ difficulty: medium
603
+ task_type: MODIFY
604
+ max_steps: 15
605
+ split: train
606
+ origin: finch
607
+ grader:
608
+ type: programmatic
609
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
610
+
611
+ - id: finch_147
612
+ name: 'Data Entry / Import: Fill in the cells highlighted with a blue background, and th'
613
+ family: xlsx
614
+ primary_tag: 'Data Entry / Import'
615
+ difficulty: medium
616
+ task_type: MODIFY
617
+ max_steps: 15
618
+ split: train
619
+ origin: finch
620
+ grader:
621
+ type: programmatic
622
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
623
+
624
+ - id: finch_153
625
+ name: 'Structuring / Formatting: On the correlation sheet, add derived columns from the BSCTM'
626
+ family: xlsx
627
+ primary_tag: 'Structuring / Formatting'
628
+ difficulty: medium
629
+ task_type: MODIFY
630
+ max_steps: 15
631
+ split: train
632
+ origin: finch
633
+ grader:
634
+ type: programmatic
635
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
636
+
637
+ - id: finch_155
638
+ name: 'Validation / Review: Revise the data of 2002 allocation in HR sheet to reflect th'
639
+ family: xlsx
640
+ primary_tag: 'Validation / Review'
641
+ difficulty: hard
642
+ task_type: MODIFY
643
+ max_steps: 15
644
+ split: train
645
+ origin: finch
646
+ grader:
647
+ type: programmatic
648
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
649
+
650
+ - id: finch_159
651
+ name: 'Structuring / Formatting: Complete the orange-highlighted cells on the Timing Tracking'
652
+ family: xlsx
653
+ primary_tag: 'Structuring / Formatting'
654
+ difficulty: medium
655
+ task_type: MODIFY
656
+ max_steps: 15
657
+ split: train
658
+ origin: finch
659
+ grader:
660
+ type: programmatic
661
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
662
+
663
+ - id: finch_166
664
+ name: 'Calculation: Finalize the Position Sensitivities for Gas (in US$) by calc'
665
+ family: xlsx
666
+ primary_tag: 'Calculation'
667
+ difficulty: medium
668
+ task_type: MODIFY
669
+ max_steps: 15
670
+ split: train
671
+ origin: finch
672
+ grader:
673
+ type: programmatic
674
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
675
+
676
+ - id: finch_167
677
+ name: 'Financial Modeling: Based on the assumptions in the table, build out a complete '
678
+ family: xlsx
679
+ primary_tag: 'Financial Modeling'
680
+ difficulty: hard
681
+ task_type: MODIFY
682
+ max_steps: 15
683
+ split: train
684
+ origin: finch
685
+ grader:
686
+ type: programmatic
687
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
688
+
689
+ - id: finch_170
690
+ name: 'Calculation: According to the specifications in the Strips sheet, aggrega'
691
+ family: xlsx
692
+ primary_tag: 'Calculation'
693
+ difficulty: medium
694
+ task_type: MODIFY
695
+ max_steps: 15
696
+ split: train
697
+ origin: finch
698
+ grader:
699
+ type: programmatic
700
+ description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
701
+
702
+ # ── Train split β€” OSWorld docx (17 tasks) ────────────────────
703
+
704
+ - id: osworld_0e47de2a
705
+ name: 'has_page_numbers_in_footers: Add page number for every page at the bottom left'
706
+ family: docx
707
+ primary_tag: 'has_page_numbers_in_footers'
708
+ difficulty: medium
709
+ task_type: MODIFY
710
+ max_steps: 15
711
+ split: train
712
+ origin: osworld
713
+ grader:
714
+ type: programmatic
715
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
716
+
717
+ - id: osworld_0e763496
718
+ name: 'compare_font_names: Change the font to "Times New Roman" throughout the text.'
719
+ family: docx
720
+ primary_tag: 'compare_font_names'
721
+ difficulty: medium
722
+ task_type: MODIFY
723
+ max_steps: 15
724
+ split: train
725
+ origin: osworld
726
+ grader:
727
+ type: programmatic
728
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
729
+
730
+ - id: osworld_3ef2b351
731
+ name: 'is_first_line_centered: Help me center align the heading in LibreOffice.'
732
+ family: docx
733
+ primary_tag: 'is_first_line_centered'
734
+ difficulty: medium
735
+ task_type: MODIFY
736
+ max_steps: 15
737
+ split: train
738
+ origin: osworld
739
+ grader:
740
+ type: programmatic
741
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
742
+
743
+ - id: osworld_6ada715d
744
+ name: 'compare_docx_images: Copy the screenshot 1.png from the desktop to where my curso'
745
+ family: docx
746
+ primary_tag: 'compare_docx_images'
747
+ difficulty: medium
748
+ task_type: MODIFY
749
+ max_steps: 15
750
+ split: train
751
+ origin: osworld
752
+ grader:
753
+ type: programmatic
754
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
755
+
756
+ - id: osworld_6f81754e
757
+ name: 'compare_unique_train_records: A certain railway company in Hong Kong uses a signaling syst'
758
+ family: docx
759
+ primary_tag: 'compare_unique_train_records'
760
+ difficulty: medium
761
+ task_type: MODIFY
762
+ max_steps: 15
763
+ split: train
764
+ origin: osworld
765
+ grader:
766
+ type: programmatic
767
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
768
+
769
+ - id: osworld_72b810ef
770
+ name: "evaluate_strike_through_last_paragraph: I am peer-reviewing my friend's course outline. I think the "
771
+ family: docx
772
+ primary_tag: 'evaluate_strike_through_last_paragraph'
773
+ difficulty: medium
774
+ task_type: MODIFY
775
+ max_steps: 15
776
+ split: train
777
+ origin: osworld
778
+ grader:
779
+ type: programmatic
780
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
781
+
782
+ - id: osworld_8472fece
783
+ name: 'evaluate_colored_words_in_tables: I am writing a word list for a dyslexic kid. To ease things '
784
+ family: docx
785
+ primary_tag: 'evaluate_colored_words_in_tables'
786
+ difficulty: medium
787
+ task_type: MODIFY
788
+ max_steps: 15
789
+ split: train
790
+ origin: osworld
791
+ grader:
792
+ type: programmatic
793
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
794
+
795
+ - id: osworld_88fe4b2d
796
+ name: 'compare_docx_files: I am making a guideline for students of my course and would '
797
+ family: docx
798
+ primary_tag: 'compare_docx_files'
799
+ difficulty: medium
800
+ task_type: MODIFY
801
+ max_steps: 15
802
+ split: train
803
+ origin: osworld
804
+ grader:
805
+ type: programmatic
806
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
807
+
808
+ - id: osworld_936321ce
809
+ name: 'compare_docx_tables: Could you help me convert the text seperated by commas to a '
810
+ family: docx
811
+ primary_tag: 'compare_docx_tables'
812
+ difficulty: medium
813
+ task_type: MODIFY
814
+ max_steps: 15
815
+ split: train
816
+ origin: osworld
817
+ grader:
818
+ type: programmatic
819
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
820
+
821
+ - id: osworld_adf5e2c3
822
+ name: 'compare_docx_files: Help me adding "Steinberg, F. M., Bearden, M. M., & Keen, C.'
823
+ family: docx
824
+ primary_tag: 'compare_docx_files'
825
+ difficulty: medium
826
+ task_type: MODIFY
827
+ max_steps: 15
828
+ split: train
829
+ origin: osworld
830
+ grader:
831
+ type: programmatic
832
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
833
+
834
+ - id: osworld_b21acd93
835
+ name: 'compare_line_spacing: I have been practicing professional writing lately. Now I am'
836
+ family: docx
837
+ primary_tag: 'compare_line_spacing'
838
+ difficulty: medium
839
+ task_type: MODIFY
840
+ max_steps: 15
841
+ split: train
842
+ origin: osworld
843
+ grader:
844
+ type: programmatic
845
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
846
+
847
+ - id: osworld_bb8ccc78
848
+ name: 'infeasible: Share this document with my team and let us edit it together'
849
+ family: docx
850
+ primary_tag: 'infeasible'
851
+ difficulty: medium
852
+ task_type: MODIFY
853
+ max_steps: 15
854
+ split: train
855
+ origin: osworld
856
+ grader:
857
+ type: programmatic
858
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
859
+
860
+ - id: osworld_d53ff5ee
861
+ name: 'compare_docx_files: I am currently engaged in text processing and require assist'
862
+ family: docx
863
+ primary_tag: 'compare_docx_files'
864
+ difficulty: medium
865
+ task_type: MODIFY
866
+ max_steps: 15
867
+ split: train
868
+ origin: osworld
869
+ grader:
870
+ type: programmatic
871
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
872
+
873
+ - id: osworld_e246f6d8
874
+ name: 'check_italic_font_size_14: I found Italic font very hard to discern from the normal tex'
875
+ family: docx
876
+ primary_tag: 'check_italic_font_size_14'
877
+ difficulty: medium
878
+ task_type: MODIFY
879
+ max_steps: 15
880
+ split: train
881
+ origin: osworld
882
+ grader:
883
+ type: programmatic
884
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
885
+
886
+ - id: osworld_e528b65e
887
+ name: 'compare_docx_files: Please help me make the first letter of each word to upperca'
888
+ family: docx
889
+ primary_tag: 'compare_docx_files'
890
+ difficulty: medium
891
+ task_type: MODIFY
892
+ max_steps: 15
893
+ split: train
894
+ origin: osworld
895
+ grader:
896
+ type: programmatic
897
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
898
+
899
+ - id: osworld_ecc2413d
900
+ name: 'contains_page_break: Hey, can you throw in a blank page right after this one?'
901
+ family: docx
902
+ primary_tag: 'contains_page_break'
903
+ difficulty: medium
904
+ task_type: MODIFY
905
+ max_steps: 15
906
+ split: train
907
+ origin: osworld
908
+ grader:
909
+ type: programmatic
910
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
911
+
912
+ - id: osworld_f178a4a9
913
+ name: 'find_default_font: Make Times New Roman the default Font'
914
+ family: docx
915
+ primary_tag: 'find_default_font'
916
+ difficulty: medium
917
+ task_type: MODIFY
918
+ max_steps: 15
919
+ split: train
920
+ origin: osworld
921
+ grader:
922
+ type: programmatic
923
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
924
+
925
+ # ── Train split β€” PPTArena pptx (30 tasks) ───────────────────
926
+
927
+ - id: pptarena_case_100_animation_canonicalization_bullet_sequencing
928
+ name: 'Object Animations: Case 100: Animation Canonicalization & Bullet Sequencin'
929
+ family: pptx
930
+ primary_tag: 'Object Animations'
931
+ difficulty: medium
932
+ task_type: MODIFY
933
+ max_steps: 15
934
+ split: train
935
+ origin: pptarena
936
+ grader:
937
+ type: programmatic
938
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
939
+
940
+ - id: pptarena_case_13_italicize_subheadings_d
941
+ name: 'Text & Typography: Case 13: Italicize Subheadings (D)'
942
+ family: pptx
943
+ primary_tag: 'Text & Typography'
944
+ difficulty: medium
945
+ task_type: MODIFY
946
+ max_steps: 15
947
+ split: train
948
+ origin: pptarena
949
+ grader:
950
+ type: programmatic
951
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
952
+
953
+ - id: pptarena_case_16_curate_multi_panel_photo_layout
954
+ name: 'Alignment, Distribution & Z-order: Case 16: Curate Multi-Panel Photo Layout'
955
+ family: pptx
956
+ primary_tag: 'Alignment, Distribution & Z-order'
957
+ difficulty: medium
958
+ task_type: MODIFY
959
+ max_steps: 15
960
+ split: train
961
+ origin: pptarena
962
+ grader:
963
+ type: programmatic
964
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
965
+
966
+ - id: pptarena_case_17_build_ensemble_category_boards
967
+ name: 'Shapes & Drawing: Case 17: Build Ensemble Category Boards'
968
+ family: pptx
969
+ primary_tag: 'Shapes & Drawing'
970
+ difficulty: medium
971
+ task_type: MODIFY
972
+ max_steps: 15
973
+ split: train
974
+ origin: pptarena
975
+ grader:
976
+ type: programmatic
977
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
978
+
979
+ - id: pptarena_case_23_add_thank_you_slide
980
+ name: 'Slide Layout & Placeholders: Case 23: Add Thank You Slide'
981
+ family: pptx
982
+ primary_tag: 'Slide Layout & Placeholders'
983
+ difficulty: medium
984
+ task_type: MODIFY
985
+ max_steps: 15
986
+ split: train
987
+ origin: pptarena
988
+ grader:
989
+ type: programmatic
990
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
991
+
992
+ - id: pptarena_case_27_correct_images
993
+ name: 'Images & Pictures: Case 27: Correct Images'
994
+ family: pptx
995
+ primary_tag: 'Images & Pictures'
996
+ difficulty: medium
997
+ task_type: MODIFY
998
+ max_steps: 15
999
+ split: train
1000
+ origin: pptarena
1001
+ grader:
1002
+ type: programmatic
1003
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1004
+
1005
+ - id: pptarena_case_29_convert_bar_chart_to_pie_chart
1006
+ name: 'Charts: Case 29: Convert Bar Chart to Pie Chart'
1007
+ family: pptx
1008
+ primary_tag: 'Charts'
1009
+ difficulty: medium
1010
+ task_type: MODIFY
1011
+ max_steps: 15
1012
+ split: train
1013
+ origin: pptarena
1014
+ grader:
1015
+ type: programmatic
1016
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1017
+
1018
+ - id: pptarena_case_31_fix_text_overflow
1019
+ name: 'Text & Typography: Case 31: Fix Text Overflow'
1020
+ family: pptx
1021
+ primary_tag: 'Text & Typography'
1022
+ difficulty: medium
1023
+ task_type: MODIFY
1024
+ max_steps: 15
1025
+ split: train
1026
+ origin: pptarena
1027
+ grader:
1028
+ type: programmatic
1029
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1030
+
1031
+ - id: pptarena_case_37_add_transitions
1032
+ name: 'Slide Transitions: Case 37: Add Transitions'
1033
+ family: pptx
1034
+ primary_tag: 'Slide Transitions'
1035
+ difficulty: medium
1036
+ task_type: MODIFY
1037
+ max_steps: 15
1038
+ split: train
1039
+ origin: pptarena
1040
+ grader:
1041
+ type: programmatic
1042
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1043
+
1044
+ - id: pptarena_case_38_flip_theme_scheme
1045
+ name: 'Theme & Background: Case 38: Flip Theme Scheme'
1046
+ family: pptx
1047
+ primary_tag: 'Theme & Background'
1048
+ difficulty: medium
1049
+ task_type: MODIFY
1050
+ max_steps: 15
1051
+ split: train
1052
+ origin: pptarena
1053
+ grader:
1054
+ type: programmatic
1055
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1056
+
1057
+ - id: pptarena_case_43_transitionary_slides
1058
+ name: 'Slide/Section Management & Footers: Case 43: Transitionary Slides'
1059
+ family: pptx
1060
+ primary_tag: 'Slide/Section Management & Footers'
1061
+ difficulty: medium
1062
+ task_type: MODIFY
1063
+ max_steps: 15
1064
+ split: train
1065
+ origin: pptarena
1066
+ grader:
1067
+ type: programmatic
1068
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1069
+
1070
+ - id: pptarena_case_51_currency_symbol_swap_eurusd
1071
+ name: 'Tables: Case 51: Currency Symbol Swap (EUR→USD)'
1072
+ family: pptx
1073
+ primary_tag: 'Tables'
1074
+ difficulty: medium
1075
+ task_type: MODIFY
1076
+ max_steps: 15
1077
+ split: train
1078
+ origin: pptarena
1079
+ grader:
1080
+ type: programmatic
1081
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1082
+
1083
+ - id: pptarena_case_58_bullets_normalize_levels
1084
+ name: 'Text & Typography: Case 58: Bullets Normalize Levels'
1085
+ family: pptx
1086
+ primary_tag: 'Text & Typography'
1087
+ difficulty: medium
1088
+ task_type: MODIFY
1089
+ max_steps: 15
1090
+ split: train
1091
+ origin: pptarena
1092
+ grader:
1093
+ type: programmatic
1094
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1095
+
1096
+ - id: pptarena_case_59_convert_hyper_link
1097
+ name: 'Hyperlinks & Action Settings: Case 59: Convert Hyper Link'
1098
+ family: pptx
1099
+ primary_tag: 'Hyperlinks & Action Settings'
1100
+ difficulty: medium
1101
+ task_type: MODIFY
1102
+ max_steps: 15
1103
+ split: train
1104
+ origin: pptarena
1105
+ grader:
1106
+ type: programmatic
1107
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1108
+
1109
+ - id: pptarena_case_61_sort_by_score_and_crop_image_169
1110
+ name: 'Tables: Case 61: Sort By Score And Crop Image 169'
1111
+ family: pptx
1112
+ primary_tag: 'Tables'
1113
+ difficulty: medium
1114
+ task_type: MODIFY
1115
+ max_steps: 15
1116
+ split: train
1117
+ origin: pptarena
1118
+ grader:
1119
+ type: programmatic
1120
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1121
+
1122
+ - id: pptarena_case_63_spatial_constraint_layout
1123
+ name: 'Alignment, Distribution & Z-order: Case 63: Spatial Constraint Layout'
1124
+ family: pptx
1125
+ primary_tag: 'Alignment, Distribution & Z-order'
1126
+ difficulty: medium
1127
+ task_type: MODIFY
1128
+ max_steps: 15
1129
+ split: train
1130
+ origin: pptarena
1131
+ grader:
1132
+ type: programmatic
1133
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1134
+
1135
+ - id: pptarena_case_67_wcag_accessibility_master_cleanup
1136
+ name: 'Accessibility & Semantics: Case 67: WCAG Accessibility & Master Cleanup'
1137
+ family: pptx
1138
+ primary_tag: 'Accessibility & Semantics'
1139
+ difficulty: medium
1140
+ task_type: MODIFY
1141
+ max_steps: 15
1142
+ split: train
1143
+ origin: pptarena
1144
+ grader:
1145
+ type: programmatic
1146
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1147
+
1148
+ - id: pptarena_case_68_swimlane_flow_process_canonicalization
1149
+ name: 'SmartArt & Diagrams: Case 68: Swimlane Flow Process Canonicalization'
1150
+ family: pptx
1151
+ primary_tag: 'SmartArt & Diagrams'
1152
+ difficulty: medium
1153
+ task_type: MODIFY
1154
+ max_steps: 15
1155
+ split: train
1156
+ origin: pptarena
1157
+ grader:
1158
+ type: programmatic
1159
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1160
+
1161
+ - id: pptarena_case_73_dynamic_data_label_placement
1162
+ name: 'Charts: Case 73: Dynamic Data Label Placement'
1163
+ family: pptx
1164
+ primary_tag: 'Charts'
1165
+ difficulty: medium
1166
+ task_type: MODIFY
1167
+ max_steps: 15
1168
+ split: train
1169
+ origin: pptarena
1170
+ grader:
1171
+ type: programmatic
1172
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1173
+
1174
+ - id: pptarena_case_76_aesthetic_slide_makeover
1175
+ name: 'Theme & Background: Case 76: Aesthetic Slide Makeover'
1176
+ family: pptx
1177
+ primary_tag: 'Theme & Background'
1178
+ difficulty: medium
1179
+ task_type: MODIFY
1180
+ max_steps: 15
1181
+ split: train
1182
+ origin: pptarena
1183
+ grader:
1184
+ type: programmatic
1185
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1186
+
1187
+ - id: pptarena_case_81_add_company_logo
1188
+ name: 'Images & Pictures: Case 81: Add Company Logo'
1189
+ family: pptx
1190
+ primary_tag: 'Images & Pictures'
1191
+ difficulty: medium
1192
+ task_type: MODIFY
1193
+ max_steps: 15
1194
+ split: train
1195
+ origin: pptarena
1196
+ grader:
1197
+ type: programmatic
1198
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1199
+
1200
+ - id: pptarena_case_84_add_progress_bar
1201
+ name: 'Shapes & Drawing: Case 84: Add Progress Bar'
1202
+ family: pptx
1203
+ primary_tag: 'Shapes & Drawing'
1204
+ difficulty: medium
1205
+ task_type: MODIFY
1206
+ max_steps: 15
1207
+ split: train
1208
+ origin: pptarena
1209
+ grader:
1210
+ type: programmatic
1211
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1212
+
1213
+ - id: pptarena_case_85_arabic_translate_ltr
1214
+ name: 'Text & Typography: Case 85: Arabic Translate LTR'
1215
+ family: pptx
1216
+ primary_tag: 'Text & Typography'
1217
+ difficulty: medium
1218
+ task_type: MODIFY
1219
+ max_steps: 15
1220
+ split: train
1221
+ origin: pptarena
1222
+ grader:
1223
+ type: programmatic
1224
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1225
+
1226
+ - id: pptarena_case_87_merge_near_duplicate_slides
1227
+ name: 'Slide/Section Management & Footers: Case 87: Merge Near-Duplicate Slides'
1228
+ family: pptx
1229
+ primary_tag: 'Slide/Section Management & Footers'
1230
+ difficulty: medium
1231
+ task_type: MODIFY
1232
+ max_steps: 15
1233
+ split: train
1234
+ origin: pptarena
1235
+ grader:
1236
+ type: programmatic
1237
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1238
+
1239
+ - id: pptarena_case_90_screenshot_to_editable_text_ub_title_slide
1240
+ name: 'Slide Layout & Placeholders: Case 90: Screenshot-to-Editable Text (UB Title Slide)'
1241
+ family: pptx
1242
+ primary_tag: 'Slide Layout & Placeholders'
1243
+ difficulty: medium
1244
+ task_type: MODIFY
1245
+ max_steps: 15
1246
+ split: train
1247
+ origin: pptarena
1248
+ grader:
1249
+ type: programmatic
1250
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1251
+
1252
+ - id: pptarena_case_91_add_qr_code
1253
+ name: 'Images & Pictures: Case 91: Add QR Code'
1254
+ family: pptx
1255
+ primary_tag: 'Images & Pictures'
1256
+ difficulty: medium
1257
+ task_type: MODIFY
1258
+ max_steps: 15
1259
+ split: train
1260
+ origin: pptarena
1261
+ grader:
1262
+ type: programmatic
1263
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1264
+
1265
+ - id: pptarena_case_93_multi_edit_cascade_copernicus_climate_highlights
1266
+ name: 'Charts: Case 93: Multi-Edit Cascade (Copernicus Climate Highlig'
1267
+ family: pptx
1268
+ primary_tag: 'Charts'
1269
+ difficulty: medium
1270
+ task_type: MODIFY
1271
+ max_steps: 15
1272
+ split: train
1273
+ origin: pptarena
1274
+ grader:
1275
+ type: programmatic
1276
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1277
+
1278
+ - id: pptarena_case_95_master_layout_rebind
1279
+ name: 'Template & Master-Level Edits: Case 95: Master & Layout Rebind'
1280
+ family: pptx
1281
+ primary_tag: 'Template & Master-Level Edits'
1282
+ difficulty: medium
1283
+ task_type: MODIFY
1284
+ max_steps: 15
1285
+ split: train
1286
+ origin: pptarena
1287
+ grader:
1288
+ type: programmatic
1289
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1290
+
1291
+ - id: pptarena_case_98_cross_slide_conditional_formatting_status_timeline
1292
+ name: 'SmartArt & Diagrams: Case 98: Cross-Slide Conditional Formatting (Status β†’ T'
1293
+ family: pptx
1294
+ primary_tag: 'SmartArt & Diagrams'
1295
+ difficulty: medium
1296
+ task_type: MODIFY
1297
+ max_steps: 15
1298
+ split: train
1299
+ origin: pptarena
1300
+ grader:
1301
+ type: programmatic
1302
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1303
+
1304
+ - id: pptarena_case_99_embed_configure_video_playback
1305
+ name: 'Audio & Video: Case 99: Embed & Configure Video Playback'
1306
+ family: pptx
1307
+ primary_tag: 'Audio & Video'
1308
+ difficulty: medium
1309
+ task_type: MODIFY
1310
+ max_steps: 15
1311
+ split: train
1312
+ origin: pptarena
1313
+ grader:
1314
+ type: programmatic
1315
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1316
+
1317
+ # ── Eval split β€” Finch xlsx (10 tasks) ───────────────────────
1318
 
1319
  - id: finch_10
1320
  name: 'Calculation: Per the headers and established formula logic, populate form'
 
1323
  difficulty: medium
1324
  task_type: MODIFY
1325
  max_steps: 15
1326
+ split: eval
1327
+ origin: finch
1328
  grader:
1329
  type: programmatic
1330
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1331
 
1332
+ - id: finch_14
1333
+ name: 'Financial Modeling: Suppose we need to hold a 0.5-year AA(2) municipal investmen'
1334
  family: xlsx
1335
+ primary_tag: 'Financial Modeling'
1336
+ difficulty: hard
1337
  task_type: MODIFY
1338
  max_steps: 15
1339
+ split: eval
1340
+ origin: finch
1341
  grader:
1342
  type: programmatic
1343
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1344
 
1345
+ - id: finch_35
1346
+ name: 'Calculation: Summarize the volume and dollar imbalances that exist betwee'
1347
  family: xlsx
1348
+ primary_tag: 'Calculation'
1349
+ difficulty: medium
1350
  task_type: MODIFY
1351
  max_steps: 15
1352
+ split: eval
1353
+ origin: finch
1354
  grader:
1355
  type: programmatic
1356
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1357
 
1358
+ - id: finch_38
1359
+ name: 'Calculation: Using the discount rate assumptions in the table and each Sh'
1360
  family: xlsx
1361
+ primary_tag: 'Calculation'
1362
+ difficulty: medium
1363
  task_type: MODIFY
1364
  max_steps: 15
1365
+ split: eval
1366
+ origin: finch
1367
  grader:
1368
  type: programmatic
1369
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1370
 
1371
+ - id: finch_59
1372
+ name: 'Structuring / Formatting: Update the TOTAL PHYSICAL GAS tab to mirror the layout on TO'
1373
  family: xlsx
1374
+ primary_tag: 'Structuring / Formatting'
1375
  difficulty: medium
1376
  task_type: MODIFY
1377
  max_steps: 15
1378
+ split: eval
1379
+ origin: finch
1380
  grader:
1381
  type: programmatic
1382
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1383
 
1384
+ - id: finch_112
1385
+ name: 'Cross-sheet/file Retrieval: For each record, use the Frequency to place the Rent amount '
1386
  family: xlsx
1387
+ primary_tag: 'Cross-sheet/file Retrieval'
1388
+ difficulty: easy
1389
  task_type: MODIFY
1390
  max_steps: 15
1391
+ split: eval
1392
+ origin: finch
1393
  grader:
1394
  type: programmatic
1395
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1396
 
1397
+ - id: finch_122
1398
+ name: 'Summary / Visualization: Create a new sheet named β€œExp by Fun Gen Support Chart5” and'
1399
  family: xlsx
1400
+ primary_tag: 'Summary / Visualization'
1401
+ difficulty: easy
1402
  task_type: MODIFY
1403
  max_steps: 15
1404
+ split: eval
1405
+ origin: finch
1406
  grader:
1407
  type: programmatic
1408
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1409
 
1410
+ - id: finch_154
1411
+ name: 'Data Entry / Import: Complete the missing Interreg co-financing data in the FR fi'
1412
  family: xlsx
1413
+ primary_tag: 'Data Entry / Import'
1414
  difficulty: medium
1415
  task_type: MODIFY
1416
  max_steps: 15
1417
+ split: eval
1418
+ origin: finch
1419
  grader:
1420
  type: programmatic
1421
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1422
 
1423
+ - id: finch_158
1424
+ name: 'Validation / Review: Audit the workbook and correct the formula errors in place s'
1425
  family: xlsx
1426
+ primary_tag: 'Validation / Review'
1427
+ difficulty: hard
1428
  task_type: MODIFY
1429
  max_steps: 15
1430
+ split: eval
1431
+ origin: finch
1432
  grader:
1433
  type: programmatic
1434
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1435
 
1436
+ - id: finch_168
1437
+ name: 'Structuring / Formatting: Insert blank rows between adjacent tables in the workbook to'
1438
  family: xlsx
1439
  primary_tag: 'Structuring / Formatting'
1440
  difficulty: medium
1441
  task_type: MODIFY
1442
  max_steps: 15
1443
+ split: eval
1444
+ origin: finch
1445
  grader:
1446
  type: programmatic
1447
  description: 'MODIFY (xlsx) β€” 30% sheet-name match + 70% cell-level diff against gold reference (2% numeric tolerance). Score 0.001-0.999.'
1448
 
1449
+ # ── Eval split β€” OSWorld docx (4 tasks) ──────────────────────
1450
 
1451
  - id: osworld_0810415c
1452
  name: 'compare_line_spacing: Make the line spacing of first two paragraph into double lin'
 
1455
  difficulty: medium
1456
  task_type: MODIFY
1457
  max_steps: 15
1458
+ split: eval
1459
+ origin: osworld
1460
  grader:
1461
  type: programmatic
1462
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
1463
 
1464
  - id: osworld_0a0faba3
1465
  name: 'check_tabstops: I would like to make the first three words of the sentence l'
 
1468
  difficulty: medium
1469
  task_type: MODIFY
1470
  max_steps: 15
1471
+ split: eval
1472
+ origin: osworld
1473
  grader:
1474
  type: programmatic
1475
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
1476
 
1477
  - id: osworld_0b17a146
1478
  name: 'compare_docx_files: Help me change the 2 in "H2O" to a subscript.'
 
1481
  difficulty: medium
1482
  task_type: MODIFY
1483
  max_steps: 15
1484
+ split: eval
1485
+ origin: osworld
1486
  grader:
1487
  type: programmatic
1488
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
1489
 
1490
  - id: osworld_66399b0d
1491
  name: 'compare_docx_tables: Could you help me insert a 7(columns)*5(rows) empty table at'
 
1494
  difficulty: medium
1495
  task_type: MODIFY
1496
  max_steps: 15
1497
+ split: eval
1498
+ origin: osworld
1499
  grader:
1500
  type: programmatic
1501
+ description: 'MODIFY (docx) β€” 3-layer: validity gate (python-docx parse) + 40% paragraph diff + 60% per-task OSWorld evaluator. Score 0.001-0.999.'
1502
 
1503
+ # ── Eval split β€” PPTArena pptx (8 tasks) ─────────────────────
1504
 
1505
  - id: pptarena_case_26_match_slide_colors_to_theme
1506
  name: 'Theme & Background: Case 26: Match Slide Colors to Theme'
 
1509
  difficulty: medium
1510
  task_type: MODIFY
1511
  max_steps: 15
1512
+ split: eval
1513
+ origin: pptarena
1514
  grader:
1515
  type: programmatic
1516
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1517
 
1518
  - id: pptarena_case_32_arrange_image_and_text
1519
  name: 'Images & Pictures: Case 32: Arrange Image and Text'
 
1522
  difficulty: medium
1523
  task_type: MODIFY
1524
  max_steps: 15
1525
+ split: eval
1526
+ origin: pptarena
1527
  grader:
1528
  type: programmatic
1529
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1530
 
1531
  - id: pptarena_case_35_structural_fix
1532
  name: 'Text & Typography: Case 35: Structural Fix'
 
1535
  difficulty: medium
1536
  task_type: MODIFY
1537
  max_steps: 15
1538
+ split: eval
1539
+ origin: pptarena
1540
  grader:
1541
  type: programmatic
1542
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1543
 
1544
  - id: pptarena_case_36_add_speaker_notes
1545
  name: 'Slide/Section Management & Footers: Case 36: Add Speaker Notes'
 
1548
  difficulty: medium
1549
  task_type: MODIFY
1550
  max_steps: 15
1551
+ split: eval
1552
+ origin: pptarena
1553
  grader:
1554
  type: programmatic
1555
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1556
 
1557
  - id: pptarena_case_40_hindu_center_titles
1558
  name: 'Text & Typography: Case 40: Hindu Center Titles'
 
1561
  difficulty: medium
1562
  task_type: MODIFY
1563
  max_steps: 15
1564
+ split: eval
1565
+ origin: pptarena
1566
  grader:
1567
  type: programmatic
1568
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1569
 
1570
  - id: pptarena_case_49_normalize_thousand_separators
1571
  name: 'Tables: Case 49: Normalize Thousand Separators'
 
1574
  difficulty: medium
1575
  task_type: MODIFY
1576
  max_steps: 15
1577
+ split: eval
1578
+ origin: pptarena
1579
  grader:
1580
  type: programmatic
1581
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1582
 
1583
  - id: pptarena_case_60_fix_text_placement
1584
  name: 'Alignment, Distribution & Z-order: Case 60: Fix Text Placement'
 
1587
  difficulty: medium
1588
  task_type: MODIFY
1589
  max_steps: 15
1590
+ split: eval
1591
+ origin: pptarena
1592
  grader:
1593
  type: programmatic
1594
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
1595
 
1596
  - id: pptarena_case_7_update_quarter_two_data_b
1597
  name: 'Charts: Case 7: Update Quarter Two Data (B)'
 
1600
  difficulty: medium
1601
  task_type: MODIFY
1602
  max_steps: 15
1603
+ split: eval
1604
+ origin: pptarena
1605
  grader:
1606
  type: programmatic
1607
+ description: 'MODIFY (pptx) β€” 2-layer: validity gate (python-pptx parse) + 20% slide-count + 80% avg per-shape composite (40% text + 20% style + 20% pos + 20% size). Score 0.001-0.999.'
 
 
 
 
 
 
 
 
server/app.py CHANGED
@@ -637,7 +637,7 @@ def build_dashboard() -> gr.Blocks:
637
  with gr.Row():
638
  for label, n in [
639
  ("total tasks", total_tasks),
640
- (".xlsx (Finch + curated)", counts.get("xlsx", 0)),
641
  (".docx (OSWorld-Verified)", counts.get("docx", 0)),
642
  (".pptx (PPTArena)", counts.get("pptx", 0)),
643
  ]:
 
637
  with gr.Row():
638
  for label, n in [
639
  ("total tasks", total_tasks),
640
+ (".xlsx (Finch β€” 10 hand-curated + 50 stratified)", counts.get("xlsx", 0)),
641
  (".docx (OSWorld-Verified)", counts.get("docx", 0)),
642
  (".pptx (PPTArena)", counts.get("pptx", 0)),
643
  ]: