SciEval-Leaderboard / Large Language Model Scientific Capability.csv
naonaowyh's picture
initial leaderboard
fc80ff8 verified
Model,Type,Parameters,Knowl. Und.,Code Gen.,Symbolic Reason.,Hypoth. Gen.,Overall
Claude 4.5 Sonnet,Close,,60.67 ,21.73 ,40.36 ,56.10 ,44.72
Claude4-1-Opus,Close,,60.87 ,25.32 ,38.69 ,29.47 ,38.58
GPT-4o,Close,,60.84 ,17.67 ,32.09 ,33.04 ,35.91
GPT-5,Close,,74.05 ,29.21 ,39.91 ,45.67 ,47.21
GPT-o3,Close,,76.05 ,25.26 ,38.14 ,34.14 ,43.40
Gemini-2.5-Flash,Close,,50.46 ,18.28 ,32.07 ,40.86 ,35.42
Gemini-2.5-Pro,Close,,59.34 ,24.77 ,34.96 ,50.73 ,42.45
Grok-2-vision-1212,Close,,50.14 ,20.60 ,28.21 ,49.63 ,37.14
Ling-flash-2.0,Open,100B,53.39 ,25.60 ,37.98 ,50.29 ,41.81
Seed1.6-vision,Close,,65.78 ,21.49 ,39.24 ,45.00 ,42.88
DeepSeek-R1,Open,685B,45.17 ,0.06 ,20.00 ,49.73 ,28.74
GLM-4.5V,Open,106B,52.78 ,3.24 ,13.43 ,42.23 ,27.92
InternS1,Open,241B,66.14 ,17.08 ,31.62 ,37.45 ,38.07
Kimi-k2,Open,1040B,62.49 ,20.86 ,38.59 ,42.28 ,41.06
Llama 4 Maverick,Open,400B,57.22 ,18.26 ,38.97 ,38.31 ,38.19
Qwen3-VL-235B-A22B,Open,235B,65.98 ,18.00 ,49.93 ,40.62 ,43.63
Qwen3-Max,Open,1000B,63.14 ,43.97 ,41.04 ,42.12 ,47.57
GPT-5.1,Close,,69.23 ,25.63 ,32.44 ,41.45 ,42.19
Gemini-3-Pro,Close,,66.06 ,29.57 ,45.19 ,61.51 ,50.58