Spaces:

ShalomKing
/

infinitetalk

Running

App Files Files Community

infinitetalk / CLAUDE.md

ShalomKing

Upload folder using huggingface_hub

f076b1f verified 15 days ago

preview code

raw

history blame contribute delete

7.58 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Project Overview

	InfiniteTalk is a talking video generator that creates realistic talking head videos with accurate lip-sync. It supports two modes:
	- Image-to-Video: Transform static portraits into talking videos using audio input
	- Video Dubbing: Re-sync existing videos with new audio while maintaining natural movements

	Built on the Wan2.1 diffusion model with specialized audio conditioning for photorealistic results.

	## Architecture

	### Core Components

	Main Application (`app.py`)
	- Gradio interface with ZeroGPU support via `@spaces.GPU(duration=180)` decorator
	- Two-tab interface: Image-to-Video and Video Dubbing
	- Lazy model loading on first inference to minimize startup time
	- Global `ModelManager` and `GPUManager` instances for resource management

	Model Pipeline (`wan/multitalk.py`)
	- `InfiniteTalkPipeline`: Main generation pipeline using Wan2.1-I2V-14B model
	- Supports two resolutions: 480p (640x640) and 720p (960x960)
	- Uses diffusion-based generation with audio conditioning
	- Implements chunked processing for long videos to manage memory

	Audio Processing (`src/audio_analysis/wav2vec2.py`)
	- Custom `Wav2Vec2Model` extending HuggingFace's implementation
	- Extracts audio embeddings with temporal interpolation via `linear_interpolation`
	- Processes audio at 16kHz with loudness normalization (pyloudnorm)
	- Stacks hidden states from all encoder layers for rich audio representation

	Model Management (`utils/model_loader.py`)
	- `ModelManager`: Handles lazy loading and caching of models from HuggingFace Hub
	- Downloads three model types:
	- Wan2.1-I2V-14B: Main video generation model (Kijai/WanVideo_comfy)
	- InfiniteTalk weights: Specialized talking head weights (MeiGen-AI/InfiniteTalk)
	- Wav2Vec2: Audio encoder (TencentGameMate/chinese-wav2vec2-base)
	- Models cached in `HF_HOME` or `/data/.huggingface`

	GPU Management (`utils/gpu_manager.py`)
	- `GPUManager`: Monitors memory usage and performs cleanup
	- Calculates ZeroGPU duration based on video length and resolution
	- Memory estimation: ~20GB base + 0.8GB/s (480p) or 1.5GB/s (720p)
	- Recommends chunking for videos requiring >50GB memory

	Configuration (`wan/configs/__init__.py`)
	- `WAN_CONFIGS`: Model configurations for different tasks (t2v, i2v, infinitetalk)
	- `SIZE_CONFIGS`: Resolution mappings (infinitetalk-480: 640x640, infinitetalk-720: 960x960)
	- `SUPPORTED_SIZES`: Valid resolution options per model type

	### Data Flow

	1. Audio Processing: Audio file → librosa load → loudness normalization → Wav2Vec2 feature extraction → audio embeddings (shape: [seq_len, batch, dim])
	2. Input Processing: Image/video → PIL/cache_video → frame extraction → resize and center crop to target resolution
	3. Generation: InfiniteTalk pipeline combines visual input + audio embeddings → diffusion sampling → video tensor
	4. Output: Video tensor → save_video_ffmpeg with audio track → MP4 file

	### Key Design Patterns

	- Lazy Loading: Models only loaded on first inference to reduce cold start time
	- Memory Management: Aggressive cleanup with `torch.cuda.empty_cache()` and `gc.collect()` after generation
	- ZeroGPU Integration: `@spaces.GPU` decorator with calculated duration based on video length
	- Offloading: Models can be offloaded to CPU between forward passes to save VRAM

	## Development Commands

	### Docker Build and Run
	```bash
	# Build Docker image
	docker build -t infinitetalk .

	# Run locally
	docker run -p 7860:7860 --gpus all infinitetalk
	```

	### Python Environment
	```bash
	# Install dependencies (requires PyTorch 2.5.1+ for xfuser compatibility)
	pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
	pip install flash-attn==2.7.4.post1 --no-build-isolation # Optional, may fail on some systems
	pip install -r requirements.txt

	# Run application
	python app.py
	```

	### System Dependencies
	Required packages (see `packages.txt`):
	- ffmpeg (video processing)
	- build-essential (compilation)
	- libsndfile1 (audio I/O)
	- git (model downloads)

	## Important Implementation Details

	### Resolution Handling
	- User selects "480p" or "720p" in UI
	- Internally mapped to `infinitetalk-480` (640x640) or `infinitetalk-720` (960x960)
	- `sample_shift` parameter: 7 for 480p, 11 for 720p (controls diffusion sampling)

	### Audio Embedding Format
	Audio embeddings must be saved as `.pt` files in the format expected by the pipeline:
	```python
	audio_embeddings = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
	audio_embeddings = rearrange(audio_embeddings, "b s d -> s b d") # Shape: [seq_len, batch, dim]
	torch.save(audio_embeddings, emb_path)
	```

	### Pipeline Input Format
	The `generate_infinitetalk` method expects:
	```python
	input_clip = {
	"prompt": "", # Empty for talking head
	"cond_video": image_or_video_path,
	"cond_audio": {"person1": embedding_path},
	"video_audio": audio_wav_path
	}
	```

	### ZeroGPU Duration Calculation
	```python
	base_time = 60 # Model loading
	processing_rate = 2.5 (480p) or 3.5 (720p) # Seconds per video second
	duration = int((base_time + video_duration * processing_rate) * 1.2) # 20% safety margin
	duration = min(duration, 300) # Cap at 300s for free tier
	```

	### Memory Optimization
	- Use `offload_model=True` in pipeline to offload between forwards
	- Enable VRAM management for low-memory scenarios: `pipeline.enable_vram_management()`
	- Flash-attention (if available) reduces memory usage significantly
	- Chunked processing for videos >15s (480p) or >10s (720p)

	## HuggingFace Space Deployment

	This project is designed for HuggingFace Spaces with ZeroGPU:
	- SDK: `docker` (specified in README.md frontmatter)
	- Hardware: `zero-gpu` (H200 with 70GB VRAM)
	- Port: `7860` (Gradio default)
	- First generation downloads ~15GB of models (2-3 minutes)
	- Subsequent generations: ~40s for 10s video at 480p

	See `DEPLOYMENT.md` for detailed deployment instructions and troubleshooting.

	## Common Pitfalls

	1. Flash-attn compilation: May fail on some systems. The Dockerfile handles this gracefully with `\|\| echo "Warning..."` fallback
	2. PyTorch version: Must use 2.5.1+ for xfuser's `torch.distributed.tensor.experimental` support
	3. Audio sample rate: Must be 16kHz for Wav2Vec2 model
	4. Frame format: Pipeline expects 4n+1 frames (e.g., 81 frames) for proper temporal modeling
	5. Model paths: InfiniteTalk weights must be loaded separately from base Wan model
	6. TOKENIZERS_PARALLELISM: Set to 'false' to avoid deadlocks in multi-threaded environments

	## File Structure

	```
	├── app.py # Main Gradio application
	├── Dockerfile # Docker build configuration
	├── requirements.txt # Python dependencies
	├── packages.txt # System dependencies
	├── utils/
	│ ├── model_loader.py # Model download and loading
	│ └── gpu_manager.py # GPU memory management
	├── wan/
	│ ├── multitalk.py # InfiniteTalk pipeline
	│ ├── configs/ # Model configurations
	│ ├── modules/ # Model architecture (VAE, DiT, etc.)
	│ └── utils/ # Video/audio utilities
	└── src/
	└── audio_analysis/
	└── wav2vec2.py # Audio encoder with interpolation
	```