Multimodal Papers
updated
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper
• 2401.01885
• Published
• 28
Media2Face: Co-speech Facial Animation Generation With Multi-Modality
Guidance
Paper
• 2401.15687
• Published
• 24
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
• 2312.17172
• Published
• 30
MouSi: Poly-Visual-Expert Vision-Language Models
Paper
• 2401.17221
• Published
• 9
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
• 2401.15947
• Published
• 53
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published
• 11
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative
Editing
Paper
• 2310.12404
• Published
• 15
Multimodal Foundation Models: From Specialists to General-Purpose
Assistants
Paper
• 2309.10020
• Published
• 41
Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants
Paper
• 2310.00653
• Published
• 3
Sequence to Sequence Learning with Neural Networks
Paper
• 1409.3215
• Published
• 3
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
• 2402.05935
• Published
• 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
• 2311.05437
• Published
• 51
Paper
• 2309.16609
• Published
• 38
Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models
Paper
• 2311.07919
• Published
• 10
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Paper
• 2402.08017
• Published
• 27
World Model on Million-Length Video And Language With RingAttention
Paper
• 2402.08268
• Published
• 40
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
• 2402.13232
• Published
• 16
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published
• 45