Instructions to use VillanovaAI/Villanova-2B-VL-2512-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use VillanovaAI/Villanova-2B-VL-2512-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="VillanovaAI/Villanova-2B-VL-2512-Preview", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("VillanovaAI/Villanova-2B-VL-2512-Preview", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use VillanovaAI/Villanova-2B-VL-2512-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "VillanovaAI/Villanova-2B-VL-2512-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-VL-2512-Preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/VillanovaAI/Villanova-2B-VL-2512-Preview

SGLang

How to use VillanovaAI/Villanova-2B-VL-2512-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "VillanovaAI/Villanova-2B-VL-2512-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-VL-2512-Preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "VillanovaAI/Villanova-2B-VL-2512-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "VillanovaAI/Villanova-2B-VL-2512-Preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use VillanovaAI/Villanova-2B-VL-2512-Preview with Docker Model Runner:
```
docker model run hf.co/VillanovaAI/Villanova-2B-VL-2512-Preview
```

matteogabburo commited on Dec 30, 2025

Commit

46d882e

verified ·

1 Parent(s): 9d55c3f

Upload folder using huggingface_hub

Browse files

Files changed (17) hide show

.gitattributes +1 -0
added_tokens.json +3 -0
chat_template.jinja +3 -0
config.json +156 -0
configuration_villanova.py +96 -0
generation_config.json +11 -0
image_processing_villanova.py +219 -0
model-00001-of-00002.safetensors +3 -0
model-00002-of-00002.safetensors +3 -0
model.safetensors.index.json +481 -0
modeling_villanova.py +598 -0
preprocessor_config.json +28 -0
processing_villanova.py +205 -0
special_tokens_map.json +47 -0
tokenizer.json +3 -0
tokenizer.model +3 -0
tokenizer_config.json +1113 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "<image>": 256000
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,3 @@

+{% for message in messages %}{% if message['role'] == 'user' %}user: {{ message['content'] }}
+{% elif message['role'] == 'assistant' %}assistant: {{ message['content'] }}
+{% endif %}{% endfor %}{% if add_generation_prompt %}assistant: {% endif %}

config.json ADDED Viewed

	@@ -0,0 +1,156 @@

+{
+  "return_dict": true,
+  "output_hidden_states": false,
+  "torchscript": false,
+  "dtype": null,
+  "pruned_heads": {},
+  "tie_word_embeddings": true,
+  "chunk_size_feed_forward": 0,
+  "is_encoder_decoder": false,
+  "is_decoder": false,
+  "cross_attention_hidden_size": null,
+  "add_cross_attention": false,
+  "tie_encoder_decoder": false,
+  "architectures": [
+    "VillanovaVLMForConditionalGeneration"
+  ],
+  "finetuning_task": null,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1"
+  },
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1
+  },
+  "task_specific_params": null,
+  "problem_type": null,
+  "tokenizer_class": null,
+  "prefix": null,
+  "bos_token_id": null,
+  "pad_token_id": null,
+  "eos_token_id": null,
+  "sep_token_id": null,
+  "decoder_start_token_id": null,
+  "max_length": 20,
+  "min_length": 0,
+  "do_sample": false,
+  "early_stopping": false,
+  "num_beams": 1,
+  "temperature": 1.0,
+  "top_k": 50,
+  "top_p": 1.0,
+  "typical_p": 1.0,
+  "repetition_penalty": 1.0,
+  "length_penalty": 1.0,
+  "no_repeat_ngram_size": 0,
+  "encoder_no_repeat_ngram_size": 0,
+  "bad_words_ids": null,
+  "num_return_sequences": 1,
+  "output_scores": false,
+  "return_dict_in_generate": false,
+  "forced_bos_token_id": null,
+  "forced_eos_token_id": null,
+  "remove_invalid_values": false,
+  "exponential_decay_length_penalty": null,
+  "suppress_tokens": null,
+  "begin_suppress_tokens": null,
+  "num_beam_groups": 1,
+  "diversity_penalty": 0.0,
+  "_name_or_path": "villanova-vlm",
+  "transformers_version": "4.57.3",
+  "tf_legacy_loss": false,
+  "use_bfloat16": false,
+  "vision_config": {
+    "hidden_size": 1024,
+    "image_size": 384,
+    "patch_size": 16,
+    "num_patches": 576,
+    "num_hidden_layers": 24,
+    "num_attention_heads": 16,
+    "intermediate_size": 4096,
+    "model_name": "ViT-L-16-SigLIP-384",
+    "pretrained": "webli"
+  },
+  "projector_config": {
+    "num_layers": 2,
+    "input_size": 1024,
+    "output_size": 2560,
+    "hidden_size": 2560,
+    "activation": "gelu",
+    "use_layer_norm": false,
+    "bias": true,
+    "output_scale": 1.0
+  },
+  "text_config": {
+    "architectures": [
+      "VillanovaVLM"
+    ],
+    "auto_map": {
+      "AutoConfig": "villanova_config.VillanovaVLMConfig",
+      "AutoModelForCausalLM": "villanova_vlm.VillanovaVLM"
+    },
+    "bos_token_id": 1,
+    "dtype": "float32",
+    "eos_token_id": 2,
+    "freeze_vision_encoder": true,
+    "image_seq_length": 576,
+    "image_token_index": 256000,
+    "model_type": "villanova_vlm",
+    "pad_token_id": 2,
+    "projector_hidden_act": "gelu",
+    "projector_hidden_size": 2560,
+    "projector_num_layers": 2,
+    "projector_output_scale": 1.0,
+    "projector_use_output_norm": false,
+    "text_config": {
+      "_name_or_path": "/media/storage/store1/gabburo/models/villanova-sal-2b-w_const_pretrain_dcos1000k-1100k_to3e-6_v2-step=1099999",
+      "architectures": [
+        "LlamaForCausalLM"
+      ],
+      "attention_bias": false,
+      "attention_dropout": 0.0,
+      "dtype": "bfloat16",
+      "head_dim": 128,
+      "hidden_act": "silu",
+      "hidden_size": 2560,
+      "initializer_range": 0.014,
+      "intermediate_size": 10240,
+      "max_position_embeddings": 4096,
+      "mlp_bias": false,
+      "model_type": "llama",
+      "num_attention_heads": 20,
+      "num_hidden_layers": 18,
+      "num_key_value_heads": 4,
+      "pretraining_tp": 1,
+      "rms_norm_eps": 1e-05,
+      "rope_scaling": null,
+      "rope_theta": 10000,
+      "tie_word_embeddings": true,
+      "use_cache": true,
+      "vocab_size": 256001
+    },
+    "transformers_version": "4.57.3",
+    "vision_config": {
+      "backend": "openclip",
+      "encoder_name": "ViT-L-16-SigLIP-384",
+      "hidden_size": 1024,
+      "image_size": 384,
+      "num_patches": 576,
+      "patch_size": 16,
+      "pretrained": "webli"
+    },
+    "vision_feature_layer": -1,
+    "vision_feature_select_strategy": "default"
+  },
+  "image_token_index": 256000,
+  "vocab_size": 256001,
+  "hidden_size": 2560,
+  "model_type": "villanova",
+  "output_attentions": false,
+  "auto_map": {
+    "AutoConfig": "configuration_villanova.VillanovaConfig",
+    "AutoModelForImageTextToText": "modeling_villanova.VillanovaVLMForConditionalGeneration",
+    "AutoProcessor": "processing_villanova.VillanovaProcessor"
+  }
+}

configuration_villanova.py ADDED Viewed

	@@ -0,0 +1,96 @@

+"""Villanova VLM Configuration for HuggingFace.
+This is a standalone configuration file for use with trust_remote_code=True.
+It contains no imports from aithlas_trainer to ensure self-containment.
+"""
+from typing import Any
+from transformers import PretrainedConfig
+class VillanovaTextConfig(PretrainedConfig):
+    """Text/LLM configuration wrapper for Villanova VLM.
+    This wraps the LLM config dict to provide the to_dict() method
+    required by transformers' GenerationConfig.
+    """
+    model_type = "villanova_text"
+    def __init__(self, **kwargs: Any) -> None:
+        super().__init__(**kwargs)
+class VillanovaConfig(PretrainedConfig):
+    """Configuration class for Villanova VLM.
+    This configuration extends HuggingFace's PretrainedConfig to enable
+    loading with AutoConfig and trust_remote_code=True.
+    Args:
+        vision_config: Vision encoder configuration dict
+        projector_config: MLP projector configuration dict
+        text_config: LLM configuration dict
+        image_token_index: Token ID for <image> placeholder
+        vocab_size: Vocabulary size (from LLM)
+        hidden_size: LLM hidden dimension
+    Example:
+        >>> config = VillanovaConfig.from_pretrained("VillanovaAI/Villanova-2B-VL-2512-Preview")
+        >>> print(config.vision_config)
+    """
+    model_type = "villanova"
+    def __init__(
+        self,
+        vision_config: dict[str, Any] | None = None,
+        projector_config: dict[str, Any] | None = None,
+        text_config: dict[str, Any] | None = None,
+        image_token_index: int = 32000,
+        vocab_size: int | None = None,
+        hidden_size: int | None = None,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(**kwargs)
+        # Default vision config (ViT-L-14-CLIPA-336)
+        self.vision_config = vision_config or {
+            "hidden_size": 1024,
+            "image_size": 336,
+            "patch_size": 14,
+            "num_patches": 576,
+            "num_hidden_layers": 24,
+            "num_attention_heads": 16,
+            "intermediate_size": 4096,
+            "model_name": "ViT-L-14-CLIPA-336",
+            "pretrained": "datacomp1b",
+        }
+        # Default projector config (2-layer MLP with GELU, no LayerNorm like LLaVA)
+        self.projector_config = projector_config or {
+            "num_layers": 2,
+            "input_size": 1024,
+            "output_size": 2048,
+            "hidden_size": 2048,
+            "activation": "gelu",
+            "use_layer_norm": False,  # No LayerNorm on output (like LLaVA)
+            "bias": True,
+        }
+        # Text/LLM config - wrap as PretrainedConfig for compatibility with GenerationConfig
+        text_config_dict = text_config or {}
+        self.text_config = VillanovaTextConfig(**text_config_dict)
+        # Special tokens
+        self.image_token_index = image_token_index
+        # Derive from text_config if not provided
+        self.vocab_size = vocab_size or text_config_dict.get("vocab_size", 32000)
+        self.hidden_size = hidden_size or text_config_dict.get("hidden_size", 2048)
+        # Update projector output size to match LLM hidden size
+        if self.projector_config.get("output_size") != self.hidden_size:
+            self.projector_config["output_size"] = self.hidden_size
+            self.projector_config["hidden_size"] = self.hidden_size

generation_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2,
+  "max_length": 2048,
+  "do_sample": false,
+  "temperature": 1.0,
+  "top_p": 1.0,
+  "top_k": 50
+}

image_processing_villanova.py ADDED Viewed

	@@ -0,0 +1,219 @@

+"""Villanova VLM Image Processor for HuggingFace.
+This is a standalone image processor file for use with trust_remote_code=True.
+It contains no imports from aithlas_trainer to ensure self-containment.
+"""
+from typing import Any
+import numpy as np
+from PIL import Image
+from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
+from transformers.image_utils import (
+    ChannelDimension,
+    ImageInput,
+    make_list_of_images,
+    to_numpy_array,
+    valid_images,
+)
+class VillanovaImageProcessor(BaseImageProcessor):
+    """Image processor for Villanova VLM.
+    Processes images for the ViT-L-14-CLIPA-336 vision encoder:
+    - Resize to 336x336
+    - Normalize with ImageNet statistics (as used by OpenCLIP CLIPA models)
+    - Convert to RGB if needed
+    Args:
+        do_resize: Whether to resize images
+        size: Target size {"height": 336, "width": 336}
+        resample: PIL resampling filter (default: BILINEAR as used by OpenCLIP)
+        do_rescale: Whether to rescale pixel values
+        rescale_factor: Rescale factor (1/255)
+        do_normalize: Whether to normalize
+        image_mean: Normalization mean (ImageNet: [0.485, 0.456, 0.406])
+        image_std: Normalization std (ImageNet: [0.229, 0.224, 0.225])
+        do_convert_rgb: Convert to RGB if needed
+    Example:
+        >>> processor = VillanovaImageProcessor()
+        >>> image = Image.open("image.jpg")
+        >>> inputs = processor(image, return_tensors="pt")
+        >>> print(inputs.pixel_values.shape)
+        torch.Size([1, 3, 336, 336])
+    """
+    model_input_names = ["pixel_values"]
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: dict[str, int] | None = None,
+        resample: int = 2,  # PIL.Image.BILINEAR (as used by OpenCLIP)
+        do_rescale: bool = True,
+        rescale_factor: float = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: list[float] | None = None,
+        image_std: list[float] | None = None,
+        do_convert_rgb: bool = True,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.do_resize = do_resize
+        self.size = size or {"height": 336, "width": 336}
+        self.resample = resample
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        # ImageNet normalization (same as OpenCLIP ViT-L-14-CLIPA-336)
+        self.image_mean = image_mean or [0.485, 0.456, 0.406]
+        self.image_std = image_std or [0.229, 0.224, 0.225]
+        self.do_convert_rgb = do_convert_rgb
+    def resize(
+        self,
+        image: np.ndarray,
+        size: dict[str, int],
+        resample: int = 2,
+        data_format: ChannelDimension | None = None,
+        **kwargs: Any,
+    ) -> np.ndarray:
+        """Resize image to target size."""
+        height, width = size["height"], size["width"]
+        # Convert to PIL for resizing
+        if isinstance(image, np.ndarray):
+            pil_image = Image.fromarray(image.astype(np.uint8))
+        else:
+            pil_image = image
+        # Resize
+        resized = pil_image.resize((width, height), resample=resample)
+        # Convert back to numpy
+        return np.array(resized)
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: float,
+        data_format: ChannelDimension | None = None,
+        **kwargs: Any,
+    ) -> np.ndarray:
+        """Rescale pixel values."""
+        return image.astype(np.float32) * scale
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: list[float],
+        std: list[float],
+        data_format: ChannelDimension | None = None,
+        **kwargs: Any,
+    ) -> np.ndarray:
+        """Normalize image with mean and std."""
+        mean = np.array(mean, dtype=np.float32)
+        std = np.array(std, dtype=np.float32)
+        # Ensure image is float
+        image = image.astype(np.float32)
+        # Normalize (assuming HWC format)
+        if image.ndim == 3:
+            image = (image - mean) / std
+        return image
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool | None = None,
+        size: dict[str, int] | None = None,
+        resample: int | None = None,
+        do_rescale: bool | None = None,
+        rescale_factor: float | None = None,
+        do_normalize: bool | None = None,
+        image_mean: list[float] | None = None,
+        image_std: list[float] | None = None,
+        do_convert_rgb: bool | None = None,
+        return_tensors: str | None = None,
+        data_format: ChannelDimension = ChannelDimension.FIRST,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        """Preprocess images for the model.
+        Args:
+            images: Single image or list of images
+            do_resize: Override resize setting
+            size: Override target size
+            resample: Override resampling filter
+            do_rescale: Override rescale setting
+            rescale_factor: Override rescale factor
+            do_normalize: Override normalize setting
+            image_mean: Override mean
+            image_std: Override std
+            do_convert_rgb: Override RGB conversion
+            return_tensors: Output tensor format ("pt", "np", etc.)
+            data_format: Channel dimension format
+        Returns:
+            BatchFeature with pixel_values
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        resample = resample if resample is not None else self.resample
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+        # Handle single image
+        images = make_list_of_images(images)
+        if not valid_images(images):
+            raise ValueError("Invalid image input")
+        processed_images = []
+        for image in images:
+            # Convert to RGB if needed
+            if do_convert_rgb:
+                if isinstance(image, Image.Image):
+                    image = image.convert("RGB")
+                elif isinstance(image, np.ndarray):
+                    if image.shape[-1] == 4:  # RGBA
+                        image = image[..., :3]
+                    elif image.ndim == 2:  # Grayscale
+                        image = np.stack([image] * 3, axis=-1)
+            # Convert to numpy
+            image = to_numpy_array(image)
+            # Resize
+            if do_resize:
+                image = self.resize(image, size, resample)
+            # Rescale
+            if do_rescale:
+                image = self.rescale(image, rescale_factor)
+            # Normalize
+            if do_normalize:
+                image = self.normalize(image, image_mean, image_std)
+            # Convert to CHW format
+            if data_format == ChannelDimension.FIRST:
+                image = np.transpose(image, (2, 0, 1))
+            processed_images.append(image)
+        # Stack into batch
+        pixel_values = np.stack(processed_images, axis=0)
+        data = {"pixel_values": pixel_values}
+        return BatchFeature(data=data, tensor_type=return_tensors)

model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea2b1700a60c63cfb1f33fffcb4c16e6e43420e53f095b4cb74a3a101105e2b4
+size 4981776984

model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6a2523a1e6bf438d62fd5644da5f6b3dbc2f176e4e8c9e28ca1a5b8916d07df9
+size 377510136

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,481 @@

+{
+  "metadata": {
+    "total_size": 5359228928
+  },
+  "weight_map": {
+    "vision_encoder.trunk.pos_embed": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.patch_embed.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.patch_embed.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.norm1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.norm2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.blocks.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.norm.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.norm.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.latent": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.q.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.q.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.kv.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.kv.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.proj.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.norm.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.norm.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.mlp.fc1.bias": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_encoder.trunk.attn_pool.mlp.fc2.bias": "model-00001-of-00002.safetensors",
+    "projector.mlp.0.weight": "model-00001-of-00002.safetensors",
+    "projector.mlp.0.bias": "model-00001-of-00002.safetensors",
+    "projector.mlp.2.weight": "model-00001-of-00002.safetensors",
+    "projector.mlp.2.bias": "model-00001-of-00002.safetensors",
+    "language_model.model.embed_tokens.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "language_model.model.layers.8.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.8.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.8.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.8.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.8.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.8.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.8.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.8.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.layers.9.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
+    "language_model.model.norm.weight": "model-00002-of-00002.safetensors"
+  }
+}

modeling_villanova.py ADDED Viewed

	@@ -0,0 +1,598 @@

+"""Villanova VLM Model for HuggingFace.
+This is a standalone model file for use with trust_remote_code=True.
+It contains no imports from aithlas_trainer to ensure self-containment.
+"""
+from typing import Any
+import torch
+import torch.nn as nn
+from transformers import AutoModelForCausalLM, PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from .configuration_villanova import VillanovaConfig
+class ViTEncoder(nn.Module):
+    """Vision encoder for Villanova VLM using OpenCLIP.
+    Supports both:
+    - OpenCLIP CLIPA models (ViT-L-14-CLIPA-336) with direct visual transformer
+    - SigLIP models (ViT-L-16-SigLIP-384) wrapped via TimmModel
+    The model is loaded from OpenCLIP pretrained weights (not from safetensors).
+    IMPORTANT: Uses manual forward pass to match training code exactly.
+    Do NOT use output_tokens=True as it produces different outputs.
+    """
+    def __init__(self, config: dict[str, Any]) -> None:
+        super().__init__()
+        self.hidden_size = config.get("hidden_size", 1024)
+        # Support both old key (model_name) and new key (encoder_name)
+        self.model_name = config.get("encoder_name") or config.get("model_name", "ViT-L-14-CLIPA-336")
+        self.pretrained = config.get("pretrained", "datacomp1b")
+        # Placeholder - will be loaded lazily
+        self._clip_model: nn.Module | None = None
+        self._is_siglip: bool = "SigLIP" in self.model_name
+    def _ensure_clip_loaded(self) -> None:
+        """Load OpenCLIP model if not already loaded."""
+        if self._clip_model is None:
+            import open_clip
+            model, _, _ = open_clip.create_model_and_transforms(
+                self.model_name,
+                pretrained=self.pretrained,
+            )
+            # Use model.visual directly
+            self._clip_model = model.visual
+            self._clip_model.eval()
+            # Freeze all parameters
+            for param in self._clip_model.parameters():
+                param.requires_grad = False
+    def _forward_siglip(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """Forward pass for SigLIP models (TimmModel wrapper)."""
+        visual = self._clip_model
+        trunk = visual.trunk  # VisionTransformer from timm
+        # Patch embedding
+        x = trunk.patch_embed(pixel_values)  # (B, num_patches, hidden_dim)
+        # Add positional embedding (SigLIP may or may not have cls_token)
+        if trunk.cls_token is not None and trunk.cls_token.numel() > 0:
+            cls_tokens = trunk.cls_token.expand(x.shape[0], -1, -1)
+            x = torch.cat([cls_tokens, x], dim=1)
+        # Add positional embedding
+        x = x + trunk.pos_embed
+        # Optional: position dropout (usually identity)
+        x = trunk.pos_drop(x)
+        # Optional: patch dropout (usually identity)
+        if hasattr(trunk, "patch_drop") and trunk.patch_drop is not None:
+            x = trunk.patch_drop(x)
+        # Optional: pre-norm (some models have this)
+        if hasattr(trunk, "norm_pre") and trunk.norm_pre is not None:
+            x = trunk.norm_pre(x)
+        # Apply transformer blocks
+        x = trunk.blocks(x)
+        # Final norm
+        x = trunk.norm(x)
+        # Remove CLS token if present, return only patch tokens
+        if trunk.cls_token is not None and trunk.cls_token.numel() > 0:
+            patch_tokens = x[:, 1:, :]
+        else:
+            patch_tokens = x
+        return patch_tokens
+    def _forward_clipa(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """Forward pass for CLIPA models (standard OpenCLIP)."""
+        visual = self._clip_model
+        # Step 1: Get patch embeddings via conv1
+        x = visual.conv1(pixel_values)  # (B, hidden_dim, grid, grid)
+        x = x.reshape(x.shape[0], x.shape[1], -1)  # (B, hidden_dim, num_patches)
+        x = x.permute(0, 2, 1)  # (B, num_patches, hidden_dim)
+        # Step 2: Add positional embeddings (including CLS position)
+        if hasattr(visual, "positional_embedding"):
+            # OpenCLIP style: add CLS token and positional embeddings
+            cls_pos = visual.class_embedding.expand(x.shape[0], 1, -1)
+            x = torch.cat([cls_pos, x], dim=1)
+            x = x + visual.positional_embedding.unsqueeze(0)
+        elif hasattr(visual, "pos_embed"):
+            # Alternative style
+            x = x + visual.pos_embed[:, 1:, :]
+        # Step 3: Apply layer norm before transformer
+        x = visual.ln_pre(x)
+        # Step 4: Apply transformer (expects seq_len first)
+        x = x.permute(1, 0, 2)  # (seq_len, B, hidden_dim)
+        x = visual.transformer(x)
+        x = x.permute(1, 0, 2)  # (B, seq_len, hidden_dim)
+        # Step 5: Apply post-transformer layer norm (CRITICAL for correct output scale)
+        x = visual.ln_post(x)
+        # Step 6: Remove CLS token, return only patch tokens
+        patch_tokens = x[:, 1:, :]
+        return patch_tokens
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """Encode images to visual embeddings.
+        Uses MANUAL forward pass through OpenCLIP vision encoder to match
+        training code exactly. This is critical for correct inference.
+        Args:
+            pixel_values: Image tensor (batch_size, 3, H, W)
+        Returns:
+            Visual embeddings (batch_size, num_patches, hidden_size)
+        """
+        self._ensure_clip_loaded()
+        visual = self._clip_model
+        # Convert model to input dtype if needed (critical for matching training behavior)
+        input_dtype = pixel_values.dtype
+        model_dtype = next(visual.parameters()).dtype
+        if model_dtype != input_dtype:
+            self._clip_model = visual.to(dtype=input_dtype)
+            visual = self._clip_model
+        # Move model to same device as input
+        if next(visual.parameters()).device != pixel_values.device:
+            self._clip_model = visual.to(pixel_values.device)
+            visual = self._clip_model
+        with torch.no_grad():
+            if self._is_siglip:
+                return self._forward_siglip(pixel_values)
+            else:
+                return self._forward_clipa(pixel_values)
+class MLPProjector(nn.Module):
+    """MLP Projector to map vision features to LLM embedding space.
+    2-layer MLP with GELU activation (no output LayerNorm by default).
+    Structure matches the VillanovaVLM training checkpoint format:
+    - mlp.0: Linear(input_size, hidden_size)
+    - mlp.1: GELU (no params)
+    - mlp.2: Linear(hidden_size, output_size)
+    - output_norm: Identity() by default (no LayerNorm, like LLaVA)
+    NOTE: LLaVA does NOT use LayerNorm on projector output.
+    LLM embeddings have std≈0.008, LayerNorm forces std≈1, causing 140x scale mismatch.
+    """
+    def __init__(self, config: dict[str, Any]) -> None:
+        super().__init__()
+        input_size = config.get("input_size", 1024)
+        output_size = config.get("output_size", 2048)
+        hidden_size = config.get("hidden_size", output_size)
+        use_layer_norm = config.get("use_layer_norm", False)
+        bias = config.get("bias", True)
+        # Scale factor for output. Default 1.0 to match training behavior.
+        # Note: If training used output_scale, it should be set in config.
+        self.output_scale = config.get("output_scale", 1.0)
+        # Build MLP layers to match checkpoint structure
+        self.mlp = nn.Sequential(
+            nn.Linear(input_size, hidden_size, bias=bias),
+            nn.GELU(),
+            nn.Linear(hidden_size, output_size, bias=bias),
+        )
+        # Output normalization (separate from mlp to match checkpoint keys)
+        if use_layer_norm:
+            self.output_norm = nn.LayerNorm(output_size)
+        else:
+            self.output_norm = nn.Identity()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Project vision features to LLM space."""
+        x = self.mlp(x)
+        x = self.output_norm(x)
+        # Scale to match LLM embedding magnitude
+        if self.output_scale != 1.0:
+            x = x * self.output_scale
+        return x
+class VillanovaVLMForConditionalGeneration(PreTrainedModel):
+    """Villanova Vision-Language Model for conditional generation.
+    Combines ViT-L-14-CLIPA-336 vision encoder, 2-layer MLP projector,
+    and Villanova 2B language model.
+    Example:
+        >>> from transformers import AutoModelForImageTextToText, AutoProcessor
+        >>> model = AutoModelForImageTextToText.from_pretrained(
+        ...     "VillanovaAI/Villanova-2B-VL-2512-Preview",
+        ...     trust_remote_code=True,
+        ... )
+        >>> processor = AutoProcessor.from_pretrained(
+        ...     "VillanovaAI/Villanova-2B-VL-2512-Preview",
+        ...     trust_remote_code=True,
+        ... )
+    """
+    config_class = VillanovaConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["MLPProjector"]
+    def __init__(self, config: VillanovaConfig) -> None:
+        super().__init__(config)
+        # Vision encoder
+        self.vision_encoder = ViTEncoder(config.vision_config)
+        # Projector
+        self.projector = MLPProjector(config.projector_config)
+        # Language model (will be loaded separately)
+        self.language_model: PreTrainedModel | None = None
+        # Image token index
+        self.image_token_index = config.image_token_index
+        self.post_init()
+    def get_input_embeddings(self) -> nn.Module | None:
+        """Get input embeddings from language model."""
+        if self.language_model is not None:
+            return self.language_model.get_input_embeddings()
+        return None
+    def set_input_embeddings(self, value: nn.Module) -> None:
+        """Set input embeddings in language model."""
+        if self.language_model is not None:
+            self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self) -> nn.Module | None:
+        """Get output embeddings from language model."""
+        if self.language_model is not None:
+            return self.language_model.get_output_embeddings()
+        return None
+    def set_output_embeddings(self, new_embeddings: nn.Module) -> None:
+        """Set output embeddings in language model."""
+        if self.language_model is not None:
+            self.language_model.set_output_embeddings(new_embeddings)
+    def _merge_input_ids_with_image_features(
+        self,
+        input_ids: torch.Tensor,
+        image_features: torch.Tensor,
+        attention_mask: torch.Tensor | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor | None]:
+        """Merge text embeddings with image features at <image> token positions.
+        This uses the EXPANSION approach (like LLaVA): a single <image> token in the
+        input is replaced with all 576 visual feature tokens. The sequence length
+        increases by (num_patches - 1).
+        For training compatibility, we expand the single <image> token to num_patches
+        copies, then replace each with the corresponding visual feature.
+        """
+        batch_size = input_ids.shape[0]
+        num_patches = image_features.shape[1]
+        # Get text embeddings
+        text_embeddings = self.get_input_embeddings()(input_ids)
+        # Find image token positions
+        image_token_mask = input_ids == self.image_token_index
+        new_embeddings_list = []
+        new_attention_mask_list = [] if attention_mask is not None else None
+        for b in range(batch_size):
+            image_positions = torch.where(image_token_mask[b])[0]
+            num_image_tokens = len(image_positions)
+            if num_image_tokens == 0:
+                # No image tokens - keep original embeddings
+                new_embeddings_list.append(text_embeddings[b])
+                if attention_mask is not None:
+                    new_attention_mask_list.append(attention_mask[b])
+            elif num_image_tokens == 1:
+                # Single <image> token - expand to num_patches visual features
+                pos = image_positions[0].item()
+                before = text_embeddings[b, :pos]
+                after = text_embeddings[b, pos + 1:]
+                # Insert all visual features at the single <image> position
+                merged = torch.cat([before, image_features[b], after], dim=0)
+                new_embeddings_list.append(merged)
+                if attention_mask is not None:
+                    mask_before = attention_mask[b, :pos]
+                    mask_after = attention_mask[b, pos + 1:]
+                    image_mask = torch.ones(num_patches, dtype=attention_mask.dtype, device=attention_mask.device)
+                    merged_mask = torch.cat([mask_before, image_mask, mask_after], dim=0)
+                    new_attention_mask_list.append(merged_mask)
+            else:
+                # Multiple <image> tokens - replace each with corresponding visual feature
+                # This matches the training behavior when tokens are pre-expanded
+                output = text_embeddings[b].clone()
+                actual_patches = min(num_patches, num_image_tokens)
+                for i in range(actual_patches):
+                    pos = image_positions[i].item()
+                    output[pos] = image_features[b, i]
+                new_embeddings_list.append(output)
+                if attention_mask is not None:
+                    new_attention_mask_list.append(attention_mask[b])
+        # Pad to same length
+        max_len = max(e.shape[0] for e in new_embeddings_list)
+        padded_embeddings = torch.zeros(
+            batch_size, max_len, text_embeddings.shape[-1],
+            dtype=text_embeddings.dtype, device=text_embeddings.device
+        )
+        for b, emb in enumerate(new_embeddings_list):
+            padded_embeddings[b, :emb.shape[0]] = emb
+        padded_attention_mask = None
+        if new_attention_mask_list is not None:
+            padded_attention_mask = torch.zeros(
+                batch_size, max_len, dtype=attention_mask.dtype, device=attention_mask.device
+            )
+            for b, mask in enumerate(new_attention_mask_list):
+                padded_attention_mask[b, :mask.shape[0]] = mask
+        return padded_embeddings, padded_attention_mask
+    def forward(
+        self,
+        input_ids: torch.Tensor | None = None,
+        pixel_values: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        labels: torch.Tensor | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+        past_key_values: tuple | None = None,
+        use_cache: bool | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        return_dict: bool | None = None,
+        **kwargs: Any,
+    ) -> CausalLMOutputWithPast | tuple:
+        """Forward pass."""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if self.language_model is None:
+            raise RuntimeError("Language model not initialized")
+        # Process image if provided
+        if pixel_values is not None and inputs_embeds is None:
+            image_features = self.vision_encoder(pixel_values)
+            # Cast to projector dtype (vision encoder may output float32)
+            image_features = image_features.to(self.projector.mlp[0].weight.dtype)
+            image_features = self.projector(image_features)
+            inputs_embeds, attention_mask = self._merge_input_ids_with_image_features(
+                input_ids, image_features, attention_mask
+            )
+            input_ids = None
+        return self.language_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+    def generate(
+        self,
+        input_ids: torch.Tensor | None = None,
+        pixel_values: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        max_new_tokens: int = 256,
+        do_sample: bool = False,
+        temperature: float = 1.0,
+        top_p: float = 1.0,
+        top_k: int = 50,
+        **kwargs: Any,
+    ) -> torch.Tensor:
+        """Generate text conditioned on image and prompt."""
+        if self.language_model is None:
+            raise RuntimeError("Language model not initialized")
+        if pixel_values is not None:
+            image_features = self.vision_encoder(pixel_values)
+            # Cast to projector dtype (vision encoder may output float32)
+            image_features = image_features.to(self.projector.mlp[0].weight.dtype)
+            image_features = self.projector(image_features)
+            inputs_embeds, attention_mask = self._merge_input_ids_with_image_features(
+                input_ids, image_features, attention_mask
+            )
+            # Get token IDs from text_config or kwargs
+            text_config = self.config.text_config
+            pad_token_id = kwargs.pop("pad_token_id", None) or getattr(text_config, "pad_token_id", None)
+            eos_token_id = kwargs.pop("eos_token_id", None) or getattr(text_config, "eos_token_id", None)
+            return self.language_model.generate(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                max_new_tokens=max_new_tokens,
+                do_sample=do_sample,
+                temperature=temperature,
+                top_p=top_p,
+                top_k=top_k,
+                pad_token_id=pad_token_id,
+                eos_token_id=eos_token_id,
+                **kwargs,
+            )
+        return self.language_model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_new_tokens,
+            do_sample=do_sample,
+            temperature=temperature,
+            top_p=top_p,
+            top_k=top_k,
+            **kwargs,
+        )
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        *model_args: Any,
+        config: VillanovaConfig | None = None,
+        torch_dtype: torch.dtype | str | None = None,
+        device_map: str | dict | None = None,
+        **kwargs: Any,
+    ) -> "VillanovaVLMForConditionalGeneration":
+        """Load pretrained model."""
+        from pathlib import Path
+        from safetensors.torch import load_file
+        from transformers import AutoConfig
+        # Remove trust_remote_code from kwargs to avoid passing it twice
+        kwargs.pop("trust_remote_code", None)
+        # Handle dtype/torch_dtype - newer transformers uses 'dtype' instead of 'torch_dtype'
+        if torch_dtype is None:
+            torch_dtype = kwargs.pop("dtype", None)
+        else:
+            kwargs.pop("dtype", None)  # Remove if both were passed
+        # Load config
+        if config is None:
+            config = AutoConfig.from_pretrained(
+                pretrained_model_name_or_path,
+                trust_remote_code=True,
+                **kwargs,
+            )
+        # Handle torch_dtype string conversion
+        if torch_dtype is not None:
+            if isinstance(torch_dtype, str):
+                torch_dtype = getattr(torch, torch_dtype.replace("torch.", ""))
+        # Create model
+        model = cls(config)
+        # Create LLM from text_config
+        # Get the text config dict
+        text_config_dict = config.text_config.to_dict() if hasattr(config.text_config, "to_dict") else dict(config.text_config)
+        # Check for nested text_config (used in VillanovaVLM training format)
+        if "text_config" in text_config_dict and isinstance(text_config_dict["text_config"], dict):
+            # Use the nested text_config which contains the actual LLM config
+            llm_config_dict = dict(text_config_dict["text_config"])
+        else:
+            llm_config_dict = text_config_dict
+        # Get model type from config to determine which model class to use
+        model_type = llm_config_dict.pop("model_type", "llama")
+        # Remove non-config keys
+        for key in ["_name_or_path", "transformers_version", "torch_dtype", "dtype"]:
+            llm_config_dict.pop(key, None)
+        # Create the LLM config and model
+        from transformers import AutoConfig as HFAutoConfig, AutoModelForCausalLM as HFAutoModelForCausalLM
+        llm_config = HFAutoConfig.for_model(model_type, **llm_config_dict)
+        model.language_model = HFAutoModelForCausalLM.from_config(llm_config, torch_dtype=torch_dtype)
+        # Load all weights from safetensors
+        model_path = Path(pretrained_model_name_or_path)
+        if model_path.exists():
+            safetensors_files = sorted(model_path.glob("*.safetensors"))
+        else:
+            from huggingface_hub import hf_hub_download, list_repo_files
+            try:
+                # Get list of safetensor files from the repo
+                repo_files = list_repo_files(pretrained_model_name_or_path)
+                sf_files = [f for f in repo_files if f.endswith(".safetensors")]
+                safetensors_files = []
+                for sf in sf_files:
+                    sf_path = hf_hub_download(pretrained_model_name_or_path, sf)
+                    safetensors_files.append(Path(sf_path))
+            except Exception:
+                safetensors_files = []
+        vision_state_dict = {}
+        projector_state_dict = {}
+        llm_state_dict = {}
+        for sf_file in safetensors_files:
+            state_dict = load_file(sf_file)
+            for key, value in state_dict.items():
+                # Convert dtype if needed
+                if torch_dtype is not None:
+                    value = value.to(torch_dtype)
+                if key.startswith("vision_encoder."):
+                    new_key = key.replace("vision_encoder.", "")
+                    vision_state_dict[new_key] = value
+                elif key.startswith("projector."):
+                    new_key = key.replace("projector.", "")
+                    projector_state_dict[new_key] = value
+                elif key.startswith("language_model."):
+                    # LLM weights - strip the language_model. prefix
+                    new_key = key.replace("language_model.", "")
+                    llm_state_dict[new_key] = value
+                else:
+                    # LLM weights without prefix (legacy format)
+                    llm_state_dict[key] = value
+        # Load weights into model components
+        # Note: vision_encoder uses OpenCLIP pretrained weights, not from safetensors
+        if projector_state_dict:
+            model.projector.load_state_dict(projector_state_dict, strict=False)
+        if llm_state_dict:
+            model.language_model.load_state_dict(llm_state_dict, strict=False)
+        # Convert model to target dtype AFTER loading weights
+        # load_state_dict doesn't change the model's dtype, so we must convert explicitly
+        if torch_dtype is not None:
+            model.projector = model.projector.to(dtype=torch_dtype)
+            model.language_model = model.language_model.to(dtype=torch_dtype)
+        # Handle device_map
+        if device_map is not None:
+            import accelerate
+            if device_map == "auto":
+                # Infer device map automatically
+                device_map = accelerate.infer_auto_device_map(
+                    model,
+                    max_memory=None,
+                    no_split_module_classes=["MLPProjector", "ViTEncoder"],
+                )
+            if isinstance(device_map, dict):
+                model = accelerate.dispatch_model(model, device_map=device_map)
+            else:
+                # Simple device placement
+                model = model.to(device_map)
+        return model

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "processor_class": "VillanovaProcessor",
+  "image_processor_type": "VillanovaImageProcessor",
+  "auto_map": {
+    "AutoProcessor": "processing_villanova.VillanovaProcessor",
+    "AutoImageProcessor": "image_processing_villanova.VillanovaImageProcessor"
+  },
+  "do_resize": true,
+  "size": {
+    "height": 384,
+    "width": 384
+  },
+  "resample": 3,
+  "do_rescale": true,
+  "rescale_factor": 0.00392156862745098,
+  "do_normalize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "do_convert_rgb": true
+}

processing_villanova.py ADDED Viewed

	@@ -0,0 +1,205 @@

+"""Villanova VLM Processor for HuggingFace.
+This is a standalone processor file for use with trust_remote_code=True.
+It contains no imports from aithlas_trainer to ensure self-containment.
+"""
+from typing import Any
+from PIL import Image
+from transformers import AutoTokenizer
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
+from .image_processing_villanova import VillanovaImageProcessor
+class VillanovaProcessor:
+    """Unified processor for Villanova VLM.
+    Combines VillanovaImageProcessor and the LLM tokenizer for easy
+    preprocessing of image-text pairs.
+    Args:
+        image_processor: VillanovaImageProcessor instance
+        tokenizer: LLM tokenizer instance
+    Example:
+        >>> processor = VillanovaProcessor.from_pretrained("VillanovaAI/Villanova-2B-VL-2512-Preview")
+        >>> image = Image.open("image.jpg")
+        >>> inputs = processor(images=image, text="Describe this image.", return_tensors="pt")
+        >>> print(inputs.keys())
+        dict_keys(['pixel_values', 'input_ids', 'attention_mask'])
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "VillanovaImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(
+        self,
+        image_processor: VillanovaImageProcessor | None = None,
+        tokenizer: Any | None = None,
+        **kwargs: Any,
+    ) -> None:
+        if image_processor is None:
+            image_processor = VillanovaImageProcessor()
+        self.image_processor = image_processor
+        self.tokenizer = tokenizer
+    def __call__(
+        self,
+        images: Image.Image | list[Image.Image] | None = None,
+        text: TextInput | PreTokenizedInput | list[TextInput] | None = None,
+        padding: bool | str = False,
+        truncation: bool | None = None,
+        max_length: int | None = None,
+        return_tensors: str | None = None,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        """Process images and/or text for the model.
+        Args:
+            images: Single image or list of images (PIL.Image, path, or URL)
+            text: Single text or list of texts
+            padding: Padding strategy
+            truncation: Whether to truncate
+            max_length: Maximum sequence length
+            return_tensors: Output tensor format ("pt", "np", etc.)
+        Returns:
+            BatchFeature with pixel_values, input_ids, attention_mask
+        Raises:
+            ValueError: If neither images nor text is provided
+        """
+        if images is None and text is None:
+            raise ValueError("You must provide either images or text or both")
+        result = BatchFeature()
+        # Process images
+        if images is not None:
+            image_features = self.image_processor(
+                images,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            result.update(image_features)
+        # Process text
+        if text is not None:
+            text_features = self.tokenizer(
+                text,
+                padding=padding,
+                truncation=truncation,
+                max_length=max_length,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            result.update(text_features)
+        return result
+    def batch_decode(self, *args: Any, **kwargs: Any) -> list[str]:
+        """Decode token IDs to text.
+        Delegates to the tokenizer's batch_decode method.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args: Any, **kwargs: Any) -> str:
+        """Decode token IDs to text.
+        Delegates to the tokenizer's decode method.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+    def apply_chat_template(
+        self,
+        conversation: list[dict],
+        add_generation_prompt: bool = False,
+        **kwargs: Any,
+    ) -> str:
+        """Apply chat template to conversation.
+        Args:
+            conversation: List of message dicts with "role" and "content"
+            add_generation_prompt: Whether to add generation prompt
+        Returns:
+            Formatted prompt string
+        Example:
+            >>> messages = [{"role": "user", "content": "<image>\\nDescribe this."}]
+            >>> prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+        """
+        return self.tokenizer.apply_chat_template(
+            conversation,
+            add_generation_prompt=add_generation_prompt,
+            tokenize=False,
+            **kwargs,
+        )
+    @property
+    def model_input_names(self) -> list[str]:
+        """Get model input names."""
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        **kwargs: Any,
+    ) -> "VillanovaProcessor":
+        """Load processor from pretrained model.
+        Args:
+            pretrained_model_name_or_path: Model ID or local path
+        Returns:
+            VillanovaProcessor instance
+        """
+        # Remove trust_remote_code from kwargs to avoid passing it twice
+        kwargs.pop("trust_remote_code", None)
+        image_processor = VillanovaImageProcessor.from_pretrained(
+            pretrained_model_name_or_path,
+            **kwargs,
+        )
+        tokenizer = AutoTokenizer.from_pretrained(
+            pretrained_model_name_or_path,
+            trust_remote_code=True,
+            **kwargs,
+        )
+        return cls(image_processor=image_processor, tokenizer=tokenizer)
+    def save_pretrained(
+        self,
+        save_directory: str,
+        **kwargs: Any,
+    ) -> None:
+        """Save processor to directory.
+        Args:
+            save_directory: Directory to save to
+        """
+        self.image_processor.save_pretrained(save_directory, **kwargs)
+        self.tokenizer.save_pretrained(save_directory, **kwargs)
+    @classmethod
+    def register_for_auto_class(cls, auto_class: str = "AutoProcessor") -> None:
+        """Register this class for automatic loading.
+        This is a no-op for custom processors loaded with trust_remote_code=True,
+        but required by the transformers auto-loading mechanism.
+        Args:
+            auto_class: The auto class to register with (default: "AutoProcessor")
+        """
+        # No-op - custom classes loaded via trust_remote_code don't need registration
+        pass

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,47 @@

+{
+  "additional_special_tokens": [
+    "<image>"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:23632cdff814fe6ae5eb6159980453467d5e93ca315c82e4e13dadc78da7d525
+size 37007600

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab94ddf46d14f0279254858d53770c5319c5129d47291ee2bada530271cb1292
+size 4813276

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,1113 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<|reserved_token_1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<|reserved_token_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<|reserved_token_3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<|reserved_token_4|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<|reserved_token_5|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<|reserved_token_6|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<|reserved_token_7|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<|reserved_token_8|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<|reserved_token_9|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<|reserved_token_10|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<|reserved_token_11|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<|reserved_token_12|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<|reserved_token_13|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "19": {
+      "content": "<|reserved_token_14|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "20": {
+      "content": "<|reserved_token_15|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "21": {
+      "content": "<|reserved_token_16|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "22": {
+      "content": "<|reserved_token_17|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "23": {
+      "content": "<|reserved_token_18|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "24": {
+      "content": "<|reserved_token_19|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "25": {
+      "content": "<|reserved_token_20|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "26": {
+      "content": "<|reserved_token_21|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "27": {
+      "content": "<|reserved_token_22|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "28": {
+      "content": "<|reserved_token_23|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "29": {
+      "content": "<|reserved_token_24|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30": {
+      "content": "<|reserved_token_25|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "31": {
+      "content": "<|reserved_token_26|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32": {
+      "content": "<|reserved_token_27|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "33": {
+      "content": "<|reserved_token_28|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "34": {
+      "content": "<|reserved_token_29|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "35": {
+      "content": "<|reserved_token_30|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "36": {
+      "content": "<|reserved_token_31|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "37": {
+      "content": "<|reserved_token_32|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "38": {
+      "content": "<|reserved_token_33|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "39": {
+      "content": "<|reserved_token_34|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "40": {
+      "content": "<|reserved_token_35|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "41": {
+      "content": "<|reserved_token_36|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "42": {
+      "content": "<|reserved_token_37|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "43": {
+      "content": "<|reserved_token_38|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "44": {
+      "content": "<|reserved_token_39|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "45": {
+      "content": "<|reserved_token_40|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "46": {
+      "content": "<|reserved_token_41|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "47": {
+      "content": "<|reserved_token_42|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "48": {
+      "content": "<|reserved_token_43|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49": {
+      "content": "<|reserved_token_44|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50": {
+      "content": "<|reserved_token_45|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "51": {
+      "content": "<|reserved_token_46|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "52": {
+      "content": "<|reserved_token_47|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "53": {
+      "content": "<|reserved_token_48|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "54": {
+      "content": "<|reserved_token_49|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "55": {
+      "content": "<|reserved_token_50|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "56": {
+      "content": "<|reserved_token_51|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "57": {
+      "content": "<|reserved_token_52|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "58": {
+      "content": "<|reserved_token_53|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "59": {
+      "content": "<|reserved_token_54|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "60": {
+      "content": "<|reserved_token_55|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "61": {
+      "content": "<|reserved_token_56|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "62": {
+      "content": "<|reserved_token_57|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "63": {
+      "content": "<|reserved_token_58|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "64": {
+      "content": "<|reserved_token_59|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "65": {
+      "content": "<|reserved_token_60|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "66": {
+      "content": "<|reserved_token_61|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "67": {
+      "content": "<|reserved_token_62|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "68": {
+      "content": "<|reserved_token_63|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "69": {
+      "content": "<|reserved_token_64|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "70": {
+      "content": "<|reserved_token_65|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "71": {
+      "content": "<|reserved_token_66|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "72": {
+      "content": "<|reserved_token_67|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "73": {
+      "content": "<|reserved_token_68|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "74": {
+      "content": "<|reserved_token_69|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "75": {
+      "content": "<|reserved_token_70|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "76": {
+      "content": "<|reserved_token_71|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "77": {
+      "content": "<|reserved_token_72|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "78": {
+      "content": "<|reserved_token_73|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "79": {
+      "content": "<|reserved_token_74|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "80": {
+      "content": "<|reserved_token_75|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "81": {
+      "content": "<|reserved_token_76|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "82": {
+      "content": "<|reserved_token_77|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "83": {
+      "content": "<|reserved_token_78|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "84": {
+      "content": "<|reserved_token_79|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "85": {
+      "content": "<|reserved_token_80|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "86": {
+      "content": "<|reserved_token_81|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "87": {
+      "content": "<|reserved_token_82|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "88": {
+      "content": "<|reserved_token_83|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "89": {
+      "content": "<|reserved_token_84|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "90": {
+      "content": "<|reserved_token_85|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "91": {
+      "content": "<|reserved_token_86|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "92": {
+      "content": "<|reserved_token_87|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "93": {
+      "content": "<|reserved_token_88|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "94": {
+      "content": "<|reserved_token_89|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "95": {
+      "content": "<|reserved_token_90|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "96": {
+      "content": "<|reserved_token_91|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "97": {
+      "content": "<|reserved_token_92|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "98": {
+      "content": "<|reserved_token_93|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "99": {
+      "content": "<|reserved_token_94|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "<|reserved_token_95|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "<|reserved_token_96|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "<|reserved_token_97|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "<|reserved_token_98|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "104": {
+      "content": "\\r",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "105": {
+      "content": "▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "106": {
+      "content": "▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "107": {
+      "content": "▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "108": {
+      "content": "▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "109": {
+      "content": "▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "110": {
+      "content": "▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "111": {
+      "content": "▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "112": {
+      "content": "▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "113": {
+      "content": "▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "114": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "115": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "116": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "117": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "118": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "119": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "120": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "121": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "122": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "123": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "124": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "125": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "126": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "127": {
+      "content": "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128": {
+      "content": "\t\t",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "129": {
+      "content": "\t\t\t",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "130": {
+      "content": "\t\t\t\t",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "131": {
+      "content": "\t\t\t\t\t",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "132": {
+      "content": "\t\t\t\t\t\t",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "133": {
+      "content": "\n\n",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "134": {
+      "content": "\n\n\n",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "256000": {
+      "content": "<image>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<image>"
+  ],
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "legacy": false,
+  "local_files_only": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "</s>",
+  "sep_token": "</s>",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}