MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models Paper • 2510.16641 • Published Oct 18 • 4
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games Paper • 2509.01052 • Published Sep 1 • 21
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games Paper • 2509.01052 • Published Sep 1 • 21
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games Paper • 2509.01052 • Published Sep 1 • 21 • 1
ChartCap: Mitigating Hallucination of Dense Chart Captioning Paper • 2508.03164 • Published Aug 5 • 6
ChartCap: Mitigating Hallucination of Dense Chart Captioning Paper • 2508.03164 • Published Aug 5 • 6
ChartCap: Mitigating Hallucination of Dense Chart Captioning Paper • 2508.03164 • Published Aug 5 • 6 • 2
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Paper • 2506.03610 • Published Jun 4 • 9
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games Paper • 2506.03610 • Published Jun 4 • 9
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 3
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 3 • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 3 • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates Paper • 2505.22943 • Published May 28 • 3