About the evaluation
#1
by
qinchuanhui
- opened
Thanks for your great work.
But I failed to reproduce your evaluation scores on LCB. Could you please tell me the evaluation framework you used and the setting up configs (e.g., backend engine version, prompt, generation configs)?
Thank you very much.
We originally used the SkyT1 evaluation code to produce the scores reported here.
Later on we have revised our evaluation setup to run multiple evaluations across multiple seeds and average the performance - this produces much more reliable scores.
We have released our improved evaluation code and suggest that for future experimentation.
Got it ! Thanks for your response.
qinchuanhui
changed discussion status to
closed