ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval
Abstract
A fashion-specialized vision-language model achieves superior retrieval performance through full fine-tuning with knowledge distillation and weight interpolation, outperforming existing methods on a new benchmark while addressing structural biases in existing datasets.
Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.
Community
Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- What Matters for Grocery Product Retrieval with Open Source Vision Language Models (2026)
- FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning (2026)
- Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval (2026)
- Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction (2026)
- Beyond Retrieval: A Multitask Benchmark and Model for Code Search (2026)
- Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models (2026)
- VietFashion: Benchmarking Sketch–Text Composed Image Retrieval for Cultural Outfits (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.27708 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 2
srpone/hm-eval
srpone/zooclaw-fashion-eval
Spaces citing this paper 0
No Space linking this paper