Imagination Helps Visual Reasoning, But Not Yet in Latent Space Paper • 2602.22766 • Published Feb 26 • 44
FileGram: Grounding Agent Personalization in File-System Behavioral Traces Paper • 2604.04901 • Published 7 days ago • 39
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper • 2604.05015 • Published 7 days ago • 227
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces Paper • 2604.05172 • Published 7 days ago • 22
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings Paper • 2604.04323 • Published 7 days ago • 37
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation Paper • 2604.03922 • Published 8 days ago • 51
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents Paper • 2604.06132 • Published 6 days ago • 110
ClawArena: Benchmarking AI Agents in Evolving Information Environments Paper • 2604.04202 • Published 8 days ago • 33
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers Paper • 2603.24414 • Published 18 days ago • 182
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG Paper • 2603.23497 • Published 19 days ago • 91
Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models Paper • 2603.22212 • Published 20 days ago • 126
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild Paper • 2603.17187 • Published 26 days ago • 137
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning Paper • 2603.17024 • Published 26 days ago • 109
Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching Paper • 2602.12280 • Published Feb 12 • 34
Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding Paper • 2603.13366 • Published Mar 9 • 94
Beyond Language Modeling: An Exploration of Multimodal Pretraining Paper • 2603.03276 • Published Mar 3 • 103
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? Paper • 2406.17806 • Published Jun 22, 2024 • 2