CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies Paper • 2606.16613 • Published 14 days ago • 8
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness Paper • 2604.02986 • Published Apr 3 • 3
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests Paper • 2606.07379 • Published 24 days ago • 5
How Can I Publish My LLM Benchmark Without Giving the True Answers Away? Paper • 2505.18102 • Published May 23, 2025 • 2
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests Paper • 2606.07379 • Published 24 days ago • 5
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests Paper • 2606.07379 • Published 24 days ago • 5
How Can I Publish My LLM Benchmark Without Giving the True Answers Away? Paper • 2505.18102 • Published May 23, 2025 • 2