Rethinking Chain-of-Thought Reasoning for Videos Paper • 2512.09616 • Published 25 days ago • 17
facebook/dinov3-vitl16-pretrain-lvd1689m Image Feature Extraction • 0.3B • Updated Aug 19, 2025 • 313k • 103
Making Long-Context Language Models Better Multi-Hop Reasoners Paper • 2408.03246 • Published Aug 6, 2024
Fine-grained Spatiotemporal Grounding on Egocentric Videos Paper • 2508.00518 • Published Aug 1, 2025 • 4
Fine-grained Spatiotemporal Grounding on Egocentric Videos Paper • 2508.00518 • Published Aug 1, 2025 • 4
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors Paper • 2505.24625 • Published May 30, 2025 • 9
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding Paper • 2412.00493 • Published Nov 30, 2024 • 17
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning Paper • 2412.03248 • Published Dec 4, 2024 • 26