SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
Abstract
SAMA presents a factorized approach to video editing that separates semantic anchoring from motion modeling, enabling instruction-guided edits with preserved motion through pre-trained motion restoration tasks and sparse anchor frame prediction.
Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.
Community
SAMA factorizes instruction-guided video editing into semantic anchoring and motion alignment, improving edit precision while preserving temporal dynamics from the source video. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems. Code, models, and datasets will be released.
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/sama-factorized-semantic-anchoring-and-motion-alignment-for-instruction-guided-video-editing-8052-de200caf
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing (2026)
- Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance (2026)
- DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing (2026)
- NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing (2026)
- Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing (2026)
- ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer (2026)
- PISCO: Precise Video Instance Insertion with Sparse Control (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper