OpenGVLab

community

https://github.com/opengvlab

opengvlab

OpenGVLab

Activity Feed Request to join this org

AI & ML interests

Computer Vision

Recent Activity

heroding77 authored a paper 19 days ago

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

heroding77 authored a paper 19 days ago

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

yangxue authored a paper about 1 month ago

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

View all activity

Papers

RIVER: A Real-Time Interaction Benchmark for Video LLMs

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

View all Papers

yuezhengrong

authored a paper 3 days ago

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Paper • 2605.07915 • Published 9 days ago • 8

yuezhengrong

submitted a paper to Daily Papers 5 days ago

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Paper • 2605.07915 • Published 9 days ago • 8

yuezhengrong

authored 9 papers 6 days ago

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

Paper • 2503.10200 • Published Mar 13, 2025

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Paper • 2509.21100 • Published Sep 25, 2025 • 1

UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Paper • 2510.10575 • Published Oct 12, 2025 • 2

Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Paper • 2510.08157 • Published Oct 9, 2025

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Paper • 2511.19524 • Published Nov 24, 2025

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

Paper • 2605.06376 • Published 10 days ago • 25

prithivMLmods

posted an update 15 days ago

Post

5087

Multimodal-Edge Demo, a node-based inference canvas demo, is now live on Spaces. It features node-based Transformers for fast inference across 10+ edge-device multimodal models on the Hub, all within a single space. The series includes models from Qwen3.5, Qwen3-VL, Gemma 4, and the LFM 2.5 VL model series, with support for reasoning and grounding tasks.

🤗 Demo: prithivMLmods/Multimodal-Edge-Node
🔗 GitHub: https://github.com/PRITHIVSAKTHIUR/Multimodal-Edge-Node
✅ Multimodal Apps Collections: https://huggingface.co/collections/prithivMLmods/hall-of-multimodal-apps

🤗 > To learn more, visit the app page or the respective model pages.

heroding77

authored 2 papers 19 days ago

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

Paper • 2604.15093 • Published about 1 month ago • 28

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Paper • 2603.25040 • Published Mar 26 • 132

prithivMLmods

posted an update 23 days ago

Post

1873

Now, a collection of various compression schemes for Qwen3.6 and the abliterated version 1 of dense models is available on the Hub. Check it out via the links below. 👇

🔗 Qwen3.6-MoE: https://huggingface.co/collections/prithivMLmods/qwen36-35b-a3b-compressions
🔗 Qwen3.6-27B Compressions: https://huggingface.co/collections/prithivMLmods/qwen36-27b-compressions

🤗 > To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update 28 days ago

Post

4186

HY-World-2.0 — A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds is now available on Spaces, and it works both as native Gradio components and in Gradio server mode.

> HY-World-2.0-Demo: prithivMLmods/HY-World-2.0-Demo
> HY-World-2.0 [Server Mode]: prithivMLmods/HY-World-2.0-Demo
> Featuring 3D reconstruction and Gaussian splats with the Rerun viewer, along with camera poses, depth maps, and surface normals.
> In Server Mode, Gradio is served via FastAPI, with FastAPI remaining the top-level server.
> Model: tencent/HY-World-2.0
> GitHub: https://github.com/PRITHIVSAKTHIUR/HY-World-2.0-Demo

🤗To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update about 1 month ago

Post

6211

A new comparator on Spaces showcases Standard FLUX.2 Decoder vs. FLUX.2 Small Decoder. The Small Decoder is ~1.4× faster, uses ~1.4× less VRAM, and maintains near-identical image quality. It has ~28M parameters with narrower channels [96, 192, 384, 384] vs. [128, 256, 512, 512], and the demo supports sequence generation by running both decoders simultaneously and comparing the results side by side.

🤗 Comparator: https://huggingface.co/spaces/prithivMLmods/Flux.2-4B-Decoder-Comparator
🔗 FLUX.2-small-decoder: black-forest-labs/FLUX.2-small-decoder
🔗 GitHub: https://github.com/PRITHIVSAKTHIUR/Flux.2-4B-Encoder-Comparator
🚁 Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

🤗 > App built on the Gradio SDK. To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update about 1 month ago

Post

4236

Now, a collection of various compression schemes for Gemma 4 and the abliterated version 1 of dense models is available on the Hub. Check it out via the links below. 👇

🔗Gemma 4 Compression(s)- https://huggingface.co/collections/prithivMLmods/gemma-4-compressions
🔗Gemma 4 Uncensored [MAX] + Compression(s) - [`β ]- https://huggingface.co/collections/prithivMLmods/gemma-4-uncensored-max-compressions
🔗Gemma 4 Compression(s) - MoE- https://huggingface.co/collections/prithivMLmods/gemma-4-compressions-moe
🔗Gemma-4 F32 GGUF- https://huggingface.co/collections/prithivMLmods/gemma-4-f32-gguf

🤗 > To learn more, visit the app page or the respective model pages.

prithivMLmods

posted an update about 1 month ago

Post

2329

Now the demo for image detection based on SAM3 and Gemma-4 (*Filter) is available on Spaces, using full-fledged Transformers inference with multimodal reasoning for processed images. It also supports video segmentation (mask), video segmentation (annotation), and image click segmentation.

🤗 Demo Space: prithivMLmods/SAM3-Gemma4-CUDA
🥽 SAM3: facebook/sam3
🔗 gemma-4-E2B-it: google/gemma-4-E2B-it

To learn more, visit the app page or the respective model pages.

1 reply

yangxue

authored a paper about 1 month ago

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Paper • 2604.05015 • Published Apr 6 • 235

AI & ML interests

Recent Activity

Papers

Team members 118

OpenGVLab's activity