About the evaluation

#1
by qinchuanhui - opened

Thanks for your great work.

But I failed to reproduce your evaluation scores on LCB. Could you please tell me the evaluation framework you used and the setting up configs (e.g., backend engine version, prompt, generation configs)?

Thank you very much.

Bespoke Labs org

We originally used the SkyT1 evaluation code to produce the scores reported here.
Later on we have revised our evaluation setup to run multiple evaluations across multiple seeds and average the performance - this produces much more reliable scores.
We have released our improved evaluation code and suggest that for future experimentation.

Got it ! Thanks for your response.

qinchuanhui changed discussion status to closed

Sign up or log in to comment