AV LLMs
A collection of Audio, Video and Visual LLMs.
-
Text-to-Speech • Updated • 488 -
OpenVoice
🤗1.13kGenerate speech in a cloned voice from a short audio sample
-
dataautogpt3/ProteusV0.3
Text-to-Image • Updated • 37.8k • 96 -
ByteDance/SDXL-Lightning
Text-to-Image • Updated • 211k • 2.15k -
openai/whisper-large-v3
Automatic Speech Recognition • 2B • Updated • 4.94M • • 5.68k -
stabilityai/TripoSR
Image-to-3D • Updated • 199k • 613 -
Efficient-Large-Model/VILA-7b
Text Generation • 7B • Updated • 594 • 27 -
google/paligemma-3b-pt-896
Image-Text-to-Text • 3B • Updated • 715 • 124 -
microsoft/Phi-3-vision-128k-instruct
Text Generation • Updated • 201k • 971 -
stabilityai/stable-audio-open-1.0
Text-to-Audio • Updated • 21.3k • 1.46k -
OpenVLA: An Open-Source Vision-Language-Action Model
Paper • 2406.09246 • Published • 47 -
aiola/whisper-medusa-v1
Updated • 134 • 178 -
merve/idefics3llama-vqav2
Updated • 8 -
black-forest-labs/FLUX.1-schnell
Text-to-Image • Updated • 689k • • 4.87k -
Llama3.1 S V0.2 Checkpoint 2024 08 20
😻115Convert text to audio and vice versa
-
gpt-omni/mini-omni
Text-to-Speech • Updated • 438 -
fishaudio/fish-speech-1.4
Text-to-Speech • Updated • 947 • 457 -
Tonic's GOT OCR
📲179GOT - OCR (from : UCAS, Beijing)
-
stepfun-ai/GOT-OCR2_0
Image-Text-to-Text • Updated • 126k • 1.54k -
apple/coreml-sam2-large
Mask Generation • Updated • 87 • 34 -
coreml-projects/sam-2-studio
Updated • 28 -
mistralai/Pixtral-12B-2409
Updated • 4.58k • 689 -
allenai/Molmo-72B-0924
Image-Text-to-Text • 73B • Updated • 4.38k • 298 -
openai/whisper-large-v3-turbo
Automatic Speech Recognition • 0.8B • Updated • 7.01M • • 3.01k -
Revai/reverb-asr
Automatic Speech Recognition • Updated • 6 • 93 -
GOT Online
💬360Extract text from images using various OCR modes
-
facebook/vfusion3d
Image-to-3D • 0.5B • Updated • 25 • 65 -
facebook/cotracker
Updated • 503 • 36 -
rhymes-ai/Aria
Image-Text-to-Text • 25B • Updated • 100k • 637 -
SWivid/F5-TTS
Text-to-Speech • Updated • 482k • 1.17k -
Ichigo Llama3.1 S Instruct
🏢64Generate text from audio recordings
-
kyutai/moshiko-mlx-q4
Updated • 665 • 29 -
kyutai/moshiko-mlx-q8
Updated • 4.05k • 5 -
Open VLM Video Leaderboard
🌎134VLMEvalKit Eval Results in video understanding benchmark
-
jimmycarter/LibreFLUX
Text-to-Image • Updated • 37 • 173 -
microsoft/OmniParser
Image-Text-to-Text • Updated • 267 • 1.71k -
Aya Models
🌍337Interact with the Aya family of models.
-
CohereLabs/aya-expanse-32b
Text Generation • 32B • Updated • 12.5k • • 293 -
stabilityai/stable-diffusion-3.5-medium
Text-to-Image • Updated • 507k • • 935 -
OuteAI/OuteTTS-0.1-350M
Text-to-Speech • Updated • 86 • 302 -
vidore/colpali
Visual Document Retrieval • Updated • 2.32k • 477 -
vidore/colpali-v1.2
Visual Document Retrieval • Updated • 29.8k • 112 -
si-pbc/hertz-dev
Audio-to-Audio • Updated • 215 -
Talk To Ultravox
⚡38Talk to Fixie.ai's Ultravox with WebRTC ⚡️
-
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper • 2411.10440 • Published • 131 -
Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text • 11B • Updated • 1.64k • 158 -
google/paligemma-3b-pt-224
Image-Text-to-Text • Updated • 138k • 442 -
apple/coreml-mobileclip
Updated • 987 • 53 -
InstantX/InstantIR
Image-to-Image • Updated • 6 • 180 -
InstantIR
🖼85diffusion-based Image Restoration model
-
Flux IP Adapter
🖼170Prompt with Images in flux[dev]
-
Image Preferences - Argilla annotation space
🖼39A community project to create an image preferences dataset.
-
fishaudio/fish-speech-1.5
Text-to-Speech • Updated • 6.34k • 743 -
meta-llama/Llama-3.3-70B-Instruct
Text Generation • 71B • Updated • 868k • • 2.76k -
Paligemma2 Vqav2
🐨48PaliGemma2 LoRA finetuned on VQAv2
-
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper • 2412.04467 • Published • 118 -
fancyfeast/llama-joycaption-alpha-two-hf-llava
Updated • 316k • 208 -
taohu/mask
Updated • 5 -
[MASK] is All You Need
Paper • 2412.06787 • Published • 2 -
Open VLM Leaderboard
🌎1.01kVLMEvalKit Evaluation Results Collection
-
microsoft/LLM2CLIP-Llama3.2-1B-EVA02-L-14-336
Zero-Shot Image Classification • Updated • 11 -
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper • 2411.04997 • Published • 39 -
Generative Powers of Ten
Paper • 2312.02149 • Published • 8 -
StoryStar
💬24Fantasy story generator
-
GoodiesHere/Apollo-LMMs-Apollo-7B-t32
Video-Text-to-Text • Updated • 43 • 57 -
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper • 2412.10360 • Published • 147 -
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text • 8B • Updated • 3.38M • 1.27k -
XiaoduoAILab/Xmodel_VLM
Text Generation • 2B • Updated • 174 • 13 -
nvidia/Cosmos-1.0-Diffusion-14B-Text2World
Updated • 526 • 60 -
nvidia/Cosmos-1.0-Autoregressive-12B
Updated • 1 • 30 -
nvidia/Cosmos-1.0-Autoregressive-13B-Video2World
Updated • 6 • 32 -
nvidia/Cosmos-1.0-Diffusion-7B-Text2World
Text-to-Video • Updated • 3.71k • 233 -
nvidia/Cosmos-1.0-Diffusion-14B-Video2World
Updated • 9 • 57 -
Stable Point-Aware 3D
⚡467Generate 3D models from images
-
hexgrad/Kokoro-82M
Text-to-Speech • Updated • 9.88M • • 6.14k -
Kokoro TTS
❤3.31kUpgraded to v1.0!
-
openbmb/MiniCPM-o-2_6
Any-to-Any • 9B • Updated • 289k • 1.29k -
TTS Spaces Arena
🤗483Blind vote on HF TTS models!
-
google/paligemma2-10b-pt-896
Image-Text-to-Text • Updated • 198 • 32 -
NovaSky-AI/Sky-T1-32B-Preview
Text Generation • 33B • Updated • 81 • • 548 -
MiniMaxAI/MiniMax-VL-01
Image-Text-to-Text • Updated • 153k • 283 -
SmolVLM
📊66Generate descriptions from images and text prompts
-
HKUSTAudio/Llasa-3B
Text-to-Speech • 4B • Updated • 175 • 526 -
HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text • 0.5B • Updated • 177k • 193 -
deepseek-ai/Janus-Pro-7B
Any-to-Any • Updated • 32.6k • 3.6k -
Kokoro TTS Zero
🎴308✨[With v1.0.0] Accelerated TTS on Kokoro-82M
-
kyutai/hibiki-2b-mlx-bf16
Translation • Updated • 13 • 22 -
kyutai/hibiki-2b-pytorch-bf16
Translation • Updated • 144 • 61 -
ARTPARK-IISc/Vaani
Viewer • Updated • 22.6M • 10.2k • 117 -
Zyphra/Zonos-v0.1-hybrid
Text-to-Speech • Updated • 2.46k • 1.11k -
Zyphra/Zonos-v0.1-transformer
Text-to-Speech • Updated • 9.32k • 431 -
microsoft/OmniParser-v2.0
Updated • 1.17k • 1.33k -
Paligemma2 Mix
🌖97Generate text and segment images using PaliGemma 2
-
google/paligemma2-3b-mix-448
Image-Text-to-Text • Updated • 6.62k • 58 -
google/paligemma2-3b-mix-224
Image-Text-to-Text • 3B • Updated • 41k • 52 -
google/paligemma2-28b-mix-224
Image-Text-to-Text • 28B • Updated • 86 • 5 -
google/paligemma2-28b-mix-448
Image-Text-to-Text • Updated • 390 • 28 -
google/paligemma2-10b-mix-224
Image-Text-to-Text • 10B • Updated • 593 • 11 -
google/paligemma2-10b-mix-448
Image-Text-to-Text • Updated • 1.29k • 36 -
stepfun-ai/stepvideo-t2v
Text-to-Video • Updated • 53 • 477 -
stepfun-ai/stepvideo-t2v-turbo
Updated • 98 -
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Paper • 2502.10248 • Published • 57 -
HuggingFaceTB/SmolVLM2-2.2B-Instruct
Image-Text-to-Text • Updated • 238k • 317 -
nvidia/canary-1b
Automatic Speech Recognition • Updated • 2.64k • 457 -
Wan-AI/Wan2.1-I2V-14B-720P
Image-to-Video • Updated • 21.8k • • 577 -
fastrtc/kokoro-onnx
Updated • 13 -
Fastphone
🐠2Download and run a Hugging Face app
-
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition • 6B • Updated • 395k • 1.6k -
microsoft/Magma-8B
Robotics • 9B • Updated • 728 • 415 -
Magma UI
📚47Magma-8B model for UI Agents
-
Di♪♪Rhythm
🎶687Blazingly Fast and Embarrassingly Simple Song Generation
-
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
Paper • 2503.01183 • Published • 29 -
ASLP-lab/DiffRhythm-vae
Updated • 42 -
ASLP-lab/DiffRhythm-base
Updated • 64 • 171 -
Large Language Diffusion Models
Paper • 2502.09992 • Published • 127 -
GSAI-ML/LLaDA-8B-Instruct
Text Generation • Updated • 519k • 358 -
unsloth/gemma-3-12b-pt
Image-Text-to-Text • 12B • Updated • 171 • 5 -
google/gemma-3-27b-it
Image-Text-to-Text • 27B • Updated • 625k • • 1.97k -
sesame/csm-1b
Text-to-Speech • Updated • 187k • 2.37k -
unsloth/gemma-3-27b-it-GGUF
Image-Text-to-Text • 27B • Updated • 26.8k • 199 -
docling-project/SmolDocling-256M-preview
Image-Text-to-Text • Updated • 30.2k • 1.61k -
starvector/starvector-8b-im2svg
Text Generation • Updated • 3.34k • 550 -
starvector/starvector-1b-im2svg
Text Generation • 1B • Updated • 30.1k • 190 -
Tokenize Image as a Set
Paper • 2503.16425 • Published • 16 -
kyutai/moshika-vis-pytorch-bf16
Updated • 58 -
kyutai/Babillage
Viewer • Updated • 465k • 241 • 13 -
ByteDance/InfiniteYou
Text-to-Image • Updated • 987 • 642 -
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper • 2503.16418 • Published • 36 -
openfree/flux-chatgpt-ghibli-lora
Text-to-Image • Updated • 836 • • 511 -
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
Paper • 2504.00595 • Published • 37 -
weizhiwang/Open-Qwen2VL
Image-Text-to-Text • Updated • 27 • 21 -
ostris/Flex.1-alpha-Redux
Text-to-Image • Updated • 164 • 117 -
unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit
Image-Text-to-Text • 112B • Updated • 684 • 80 -
unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-8bit
Image-Text-to-Text • 109B • Updated • 30 • 9 -
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 207 -
canopylabs/3b-hi-ft-research_release
Text-to-Speech • 3B • Updated • 520 • 25 -
canopylabs/3b-es_it-ft-research_release
Text-to-Speech • 3B • Updated • 1.52k • 18 -
nvidia/C-RADIOv2-g
Image Feature Extraction • 1B • Updated • 71 • 12 -
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 309 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 117k • 84 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text • Updated • 42.4k • 234 -
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Paper • 2504.05303 • Published • 5 -
Dia 1.6B
👯1.78kGenerate realistic dialogue from a script, using Dia!
-
nari-labs/Dia-1.6B
Text-to-Speech • Updated • 42.5k • • 2.86k -
Describe Anything: Detailed Localized Image and Video Captioning
Paper • 2504.16072 • Published • 64 -
nvidia/DAM-3B-Self-Contained
Image-Text-to-Text • Updated • 676 • 24 -
nvidia/DAM-3B-Video
Image-Text-to-Text • Updated • 529 • 58 -
nvidia/DAM-3B
Image-Text-to-Text • Updated • 22.2k • 130 -
Qwen/Qwen2.5-Omni-3B
Any-to-Any • Updated • 940k • 335 -
MMaDA: Multimodal Large Diffusion Language Models
Paper • 2505.15809 • Published • 98 -
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper • 2505.18129 • Published • 62 -
PlayDiffusion
🎨120Generate modified audio from text and voice
-
lerobot/smolvla_base
Robotics • Updated • 31.6k • 378 -
stockmark/Stockmark-2-VL-100B-beta
Image-Text-to-Text • 96B • Updated • 19 • 22 -
Qwen/Qwen2.5-Omni-7B
Any-to-Any • Updated • 663k • 1.89k -
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 172 -
Chatterbox TTS
🍿1.73kExpressive Zeroshot TTS
-
ResembleAI/chatterbox
Text-to-Speech • Updated • 2.23M • • 1.58k -
ByteDance/Dolphin
Image-Text-to-Text • Updated • 294 • 515 -
nanonets/Nanonets-OCR-s
Image-Text-to-Text • 4B • Updated • 173k • 1.59k -
Nanonets Ocr S
👁35https://nanonets.com/research/nanonets-ocr-s/
-
calcuis/cosmos-predict2-gguf
Text-to-Image • 14B • Updated • 4.48k • 32 -
Arrexel/pattern-diffusion
Text-to-Image • Updated • 20 • 111 -
numind/NuMarkdown-8B-Thinking
Image-to-Text • Updated • 92.7k • 452 -
Qwen/Qwen-Image
Text-to-Image • Updated • 172k • • 2.49k -
rednote-hilab/dots.ocr
Image-Text-to-Text • 3B • Updated • 196k • 1.3k -
Runware/Qwen-Image-Edit
Image-to-Image • Updated • 339 • 17 -
Qwen Image Edit
✒825Edit images based on your written instructions
-
Qwen/Qwen-Image-Edit
Image-to-Image • Updated • 63.3k • • 2.39k -
zju-community/matchanything_eloftr
16.1M • Updated • 7.79k • 84 -
MatchAnything
🏢256Find similar images and match them across collections
-
microsoft/VibeVoice-1.5B
Text-to-Speech • 3B • Updated • 255k • 2.37k -
bytedance-research/USO
Text-to-Image • Updated • 228 • 191 -
FastVLM WebGPU
🍎446Real-time video captioning powered by FastVLM
-
onnx-community/FastVLM-0.5B-ONNX
Image-Text-to-Text • Updated • 381 • 108 -
apple/FastVLM-0.5B
Text Generation • 0.8B • Updated • 14.6k • 391 -
Qwen/Qwen3-Omni-30B-A3B-Instruct
Any-to-Any • 35B • Updated • 992k • 922 -
smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI
Image-Text-to-Text • 2B • Updated • 99 • 65 -
facebook/sam2.1-hiera-large
Mask Generation • 0.2B • Updated • 83k • 137 -
PaddlePaddle/PaddleOCR-VL
Image-Text-to-Text • 1.0B • Updated • 10.3k • 1.6k -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 124 -
PaddleOCR-VL Online Demo
📈243Extract text, tables, formulas, and charts from images
-
nanonets/Nanonets-OCR2-3B
Image-Text-to-Text • 4B • Updated • 367k • 505 -
nanonets/Nanonets-OCR2-1.5B-exp
Image-Text-to-Text • 2B • Updated • 935 • 48 -
deepseek-ai/DeepSeek-OCR
Image-Text-to-Text • 3B • Updated • 2.86M • 3.23k -
lightonai/LightOnOCR-1B-1025
Image-to-Text • Updated • 167k • 249 -
Qwen Image Edit Camera Control
🎬2.22kFast 4 step inference with Qwen Image Edit 2509
-
Qwen Image Edit Camera Control
🎬35Fast 4 step inference with Qwen Image Edit 2509
-
depth-anything/DA3NESTED-GIANT-LARGE
Depth Estimation • 2B • Updated • 106k • 45 -
microsoft/Fara-7B
Image-Text-to-Text • Updated • 18.6k • 494 -
tencent/HunyuanOCR
Image-Text-to-Text • 1.0B • Updated • 190k • 749 -
ApsaraStackMaaS/EvoQwen2.5-VL-Retriever-7B-v1
Visual Document Retrieval • 8B • Updated • 36 • 17 -
apple/starflow
Updated • 283 -
microsoft/VibeVoice-Realtime-0.5B
Text-to-Speech • 1B • Updated • 839k • 1.22k -
ServiceNow-AI/Apriel-1.5-15b-Thinker
Image-Text-to-Text • Updated • 202 • 468 -
zai-org/GLM-4.6V-Flash
Image-Text-to-Text • 10B • Updated • 42.6k • • 601 -
Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
Paper • 2512.15603 • Published • 69 -
Qwen/Qwen-Image-Edit-2511
Image-to-Image • Updated • 204k • • 983 -
microsoft/TRELLIS.2-4B
Image-to-3D • Updated • 843 -
inclusionAI/TwinFlow-Z-Image-Turbo
Text-to-Image • Updated • 19 • 213 -
Phr00t/Qwen-Image-Edit-Rapid-AIO
Text-to-Image • Updated • 2.01k -
LiquidAI/LFM2.5-Audio-1.5B
Audio-to-Audio • 1B • Updated • 743 • 397 -
LiquidAI/LFM2.5-VL-1.6B
Image-Text-to-Text • 2B • Updated • 111k • 283 -
black-forest-labs/FLUX.2-klein-9B
Image-to-Image • Updated • 141k • • 720 -
black-forest-labs/FLUX.2-dev
Image-to-Image • Updated • 208k • • 1.65k -
nvidia/personaplex-7b-v1
Audio-to-Audio • Updated • 456k • 2.49k -
rootsautomation/GutenOCR-3B
Image-Text-to-Text • 4B • Updated • 420 • 26 -
rootsautomation/GutenOCR-7B
Image-Text-to-Text • 8B • Updated • 148 • 25 -
microsoft/VibeVoice-ASR
Automatic Speech Recognition • 9B • Updated • 519k • 1.13k -
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Text-to-Speech • 2B • Updated • 469k • 339 -
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Text-to-Speech • 2B • Updated • 1.59M • 1.47k -
Qwen3-TTS Demo
🎙1.92kGenerate speech audio from text with custom or cloned voices
-
moonshotai/Kimi-K2.5
Image-Text-to-Text • 1.1T • Updated • 1.88M • • 2.78k -
ibm-granite/granite-vision-3.3-2b-chart2csv-preview
Image-Text-to-Text • 3B • Updated • 1.23k • 16 -
Qwen3-ASR Demo
🎙136Transcribe audio to text with multi-language timestamps
-
Qwen/Qwen3-ASR-1.7B
Automatic Speech Recognition • 2B • Updated • 2.02M • 800 -
Qwen/Qwen3.5-35B-A3B
Image-Text-to-Text • 36B • Updated • 3.38M • • 1.42k -
Robot Learning: A Tutorial
Paper • 2510.12403 • Published • 130 -
YatharthS/LuxTTS
Text-to-Speech • Updated • 7.85k • 191 -
LuxTTS
🚀106Space for LuxTTS: a 150x realtime voice cloning TTS model
-
google/gemma-4-31B
Image-Text-to-Text • 33B • Updated • 379k • 366 -
google/gemma-4-31B-it
Image-Text-to-Text • 33B • Updated • 9.12M • • 2.61k -
google/gemma-4-E2B-it
Any-to-Any • 5B • Updated • 3.39M • 602 -
google/gemma-4-E4B-it
Any-to-Any • 8B • Updated • 5.66M • 980 -
Lightricks/LTX-Video
Image-to-Video • Updated • 387k • • 2.17k -
unsloth/Qwen3.6-27B-GGUF
Image-Text-to-Text • 27B • Updated • 1.47M • 663 -
Qwen/Qwen3.6-27B
Image-Text-to-Text • 28B • Updated • 2.45M • • 1.26k -
Qwen/Qwen3.6-35B-A3B
Image-Text-to-Text • 36B • Updated • 3.86M • • 1.74k -
facebook/sapiens2
Updated • 130 -
google/gemma-4-E4B-it-assistant
Any-to-Any • 78.8M • Updated • 51.8k • 77