About the evaluation

by qinchuanhui - opened Apr 13, 2025

Apr 13, 2025

Thanks for your great work.

But I failed to reproduce your evaluation scores on LCB. Could you please tell me the evaluation framework you used and the setting up configs (e.g., backend engine version, prompt, generation configs)?

Thank you very much.

ryanmarten

Bespoke Labs org Apr 16, 2025

We originally used the SkyT1 evaluation code to produce the scores reported here.
Later on we have revised our evaluation setup to run multiple evaluations across multiple seeds and average the performance - this produces much more reliable scores.
We have released our improved evaluation code and suggest that for future experimentation.

qinchuanhui

Apr 16, 2025

Got it ! Thanks for your response.

qinchuanhui changed discussion status to closed Apr 16, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment