Spaces:
Running
Running
| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
| ## Project Overview | |
| InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes: | |
| - **Image-to-Video**: Transform static portraits into talking videos using audio input | |
| - **Video Dubbing**: Re-sync existing videos with new audio while maintaining natural movements | |
| Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results. | |
| ## Architecture | |
| ### Core Components | |
| **Main Application** (`app.py`) | |
| - Gradio interface with ZeroGPU support via `@spaces.GPU(duration=180)` decorator | |
| - Two-tab interface: Image-to-Video and Video Dubbing | |
| - Lazy model loading on first inference to minimize startup time | |
| - Global `ModelManager` and `GPUManager` instances for resource management | |
| **Model Pipeline** (`wan/multitalk.py`) | |
| - `InfiniteTalkPipeline`: Main generation pipeline using Wan2.1-I2V-14B model | |
| - Supports two resolutions: 480p (640x640) and 720p (960x960) | |
| - Uses diffusion-based generation with audio conditioning | |
| - Implements chunked processing for long videos to manage memory | |
| **Audio Processing** (`src/audio_analysis/wav2vec2.py`) | |
| - Custom `Wav2Vec2Model` extending HuggingFace's implementation | |
| - Extracts audio embeddings with temporal interpolation via `linear_interpolation` | |
| - Processes audio at 16kHz with loudness normalization (pyloudnorm) | |
| - Stacks hidden states from all encoder layers for rich audio representation | |
| **Model Management** (`utils/model_loader.py`) | |
| - `ModelManager`: Handles lazy loading and caching of models from HuggingFace Hub | |
| - Downloads three model types: | |
| - Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy) | |
| - InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk) | |
| - Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base) | |
| - Models cached in `HF_HOME` or `/data/.huggingface` | |
| **GPU Management** (`utils/gpu_manager.py`) | |
| - `GPUManager`: Monitors memory usage and performs cleanup | |
| - Calculates ZeroGPU duration based on video length and resolution | |
| - Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p) | |
| - Recommends chunking for videos requiring >50GB memory | |
| **Configuration** (`wan/configs/__init__.py`) | |
| - `WAN_CONFIGS`: Model configurations for different tasks (t2v, i2v, infinitetalk) | |
| - `SIZE_CONFIGS`: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960) | |
| - `SUPPORTED_SIZES`: Valid resolution options per model type | |
| ### Data Flow | |
| 1. **Audio Processing**: Audio file β librosa load β loudness normalization β Wav2Vec2 feature extraction β audio embeddings (shape: [seq_len, batch, dim]) | |
| 2. **Input Processing**: Image/video β PIL/cache_video β frame extraction β resize and center crop to target resolution | |
| 3. **Generation**: InfiniteTalk pipeline combines visual input + audio embeddings β diffusion sampling β video tensor | |
| 4. **Output**: Video tensor β save_video_ffmpeg with audio track β MP4 file | |
| ### Key Design Patterns | |
| - **Lazy Loading**: Models only loaded on first inference to reduce cold start time | |
| - **Memory Management**: Aggressive cleanup with `torch.cuda.empty_cache()` and `gc.collect()` after generation | |
| - **ZeroGPU Integration**: `@spaces.GPU` decorator with calculated duration based on video length | |
| - **Offloading**: Models can be offloaded to CPU between forward passes to save VRAM | |
| ## Development Commands | |
| ### Docker Build and Run | |
| ```bash | |
| # Build Docker image | |
| docker build -t infinitetalk . | |
| # Run locally | |
| docker run -p 7860:7860 --gpus all infinitetalk | |
| ``` | |
| ### Python Environment | |
| ```bash | |
| # Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility) | |
| pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 | |
| pip install flash-attn==2.7.4.post1 --no-build-isolation # Optional, may fail on some systems | |
| pip install -r requirements.txt | |
| # Run application | |
| python app.py | |
| ``` | |
| ### System Dependencies | |
| Required packages (see `packages.txt`): | |
| - ffmpeg (video processing) | |
| - build-essential (compilation) | |
| - libsndfile1 (audio I/O) | |
| - git (model downloads) | |
| ## Important Implementation Details | |
| ### Resolution Handling | |
| - User selects "480p" or "720p" in UI | |
| - Internally mapped to `infinitetalk-480` (640x640) or `infinitetalk-720` (960x960) | |
| - `sample_shift` parameter: 7 for 480p, 11 for 720p (controls diffusion sampling) | |
| ### Audio Embedding Format | |
| Audio embeddings must be saved as `.pt` files in the format expected by the pipeline: | |
| ```python | |
| audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0) | |
| audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d") # Shape: [seq_len, batch, dim] | |
| torch.save(audio_embeddings, emb_path) | |
| ``` | |
| ### Pipeline Input Format | |
| The `generate_infinitetalk` method expects: | |
| ```python | |
| input_clip = { | |
| "prompt": "", # Empty for talking head | |
| "cond_video": image_or_video_path, | |
| "cond_audio": {"person1": embedding_path}, | |
| "video_audio": audio_wav_path | |
| } | |
| ``` | |
| ### ZeroGPU Duration Calculation | |
| ```python | |
| base_time = 60 # Model loading | |
| processing_rate = 2.5 (480p) or 3.5 (720p) # Seconds per video second | |
| duration = int((base_time + video_duration * processing_rate) * 1.2) # 20% safety margin | |
| duration = min(duration, 300) # Cap at 300s for free tier | |
| ``` | |
| ### Memory Optimization | |
| - Use `offload_model=True` in pipeline to offload between forwards | |
| - Enable VRAM management for low-memory scenarios: `pipeline.enable_vram_management()` | |
| - Flash-attention (if available) reduces memory usage significantly | |
| - Chunked processing for videos >15s (480p) or >10s (720p) | |
| ## HuggingFace Space Deployment | |
| This project is designed for HuggingFace Spaces with ZeroGPU: | |
| - SDK: `docker` (specified in README.md frontmatter) | |
| - Hardware: `zero-gpu` (H200 with 70GB VRAM) | |
| - Port: `7860` (Gradio default) | |
| - First generation downloads ~15GB of models (2-3 minutes) | |
| - Subsequent generations: ~40s for 10s video at 480p | |
| See `DEPLOYMENT.md` for detailed deployment instructions and troubleshooting. | |
| ## Common Pitfalls | |
| 1. **Flash-attn compilation**: May fail on some systems. The Dockerfile handles this gracefully with `|| echo "Warning..."` fallback | |
| 2. **PyTorch version**: Must use 2.5.1+ for xfuser's `torch.distributed.tensor.experimental` support | |
| 3. **Audio sample rate**: Must be 16kHz for Wav2Vec2 model | |
| 4. **Frame format**: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling | |
| 5. **Model paths**: InfiniteTalk weights must be loaded separately from base Wan model | |
| 6. **TOKENIZERS_PARALLELISM**: Set to 'false' to avoid deadlocks in multi-threaded environments | |
| ## File Structure | |
| ``` | |
| βββ app.py # Main Gradio application | |
| βββ Dockerfile # Docker build configuration | |
| βββ requirements.txt # Python dependencies | |
| βββ packages.txt # System dependencies | |
| βββ utils/ | |
| β βββ model_loader.py # Model download and loading | |
| β βββ gpu_manager.py # GPU memory management | |
| βββ wan/ | |
| β βββ multitalk.py # InfiniteTalk pipeline | |
| β βββ configs/ # Model configurations | |
| β βββ modules/ # Model architecture (VAE, DiT, etc.) | |
| β βββ utils/ # Video/audio utilities | |
| βββ src/ | |
| βββ audio_analysis/ | |
| βββ wav2vec2.py # Audio encoder with interpolation | |
| ``` | |