AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios Paper • 2601.20613 • Published 11 days ago • 10
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models Paper • 2512.07783 • Published Dec 8, 2025 • 38
P1: Mastering Physics Olympiads with Reinforcement Learning Paper • 2511.13612 • Published Nov 17, 2025 • 134
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds Paper • 2511.08892 • Published Nov 12, 2025 • 209
xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations Paper • 2506.13651 • Published Jun 16, 2025 • 8
xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations Paper • 2506.13651 • Published Jun 16, 2025 • 8
MMaDA: Multimodal Large Diffusion Language Models Paper • 2505.15809 • Published May 21, 2025 • 97
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published May 19, 2025 • 45
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models Paper • 2504.15279 • Published Apr 21, 2025 • 78