Instructions to use nvidia/gpt-oss-puzzle-88B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/gpt-oss-puzzle-88B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/gpt-oss-puzzle-88B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/gpt-oss-puzzle-88B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/gpt-oss-puzzle-88B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/gpt-oss-puzzle-88B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/gpt-oss-puzzle-88B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/gpt-oss-puzzle-88B

SGLang

How to use nvidia/gpt-oss-puzzle-88B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/gpt-oss-puzzle-88B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/gpt-oss-puzzle-88B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/gpt-oss-puzzle-88B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/gpt-oss-puzzle-88B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/gpt-oss-puzzle-88B with Docker Model Runner:
```
docker model run hf.co/nvidia/gpt-oss-puzzle-88B
```

itlevy

eladsegal

esegal commited on Mar 26

Commit

3f4b61d

0 Parent(s):

initial commit

Browse files

Co-authored-by: eladsegal <eladsegal@users.noreply.huggingface.co>
Co-authored-by: esegal <esegal@users.noreply.huggingface.co>

Files changed (28) hide show

.gitattributes +38 -0
README.md +323 -0
bias.md +4 -0
chat_template.jinja +331 -0
config.json +242 -0
configuration_gpt_oss_puzzle.py +65 -0
explainability.md +13 -0
fig1.png +3 -0
fig2.png +3 -0
generation_config.json +10 -0
model-00001-of-00011.safetensors +3 -0
model-00002-of-00011.safetensors +3 -0
model-00003-of-00011.safetensors +3 -0
model-00004-of-00011.safetensors +3 -0
model-00005-of-00011.safetensors +3 -0
model-00006-of-00011.safetensors +3 -0
model-00007-of-00011.safetensors +3 -0
model-00008-of-00011.safetensors +3 -0
model-00009-of-00011.safetensors +3 -0
model-00010-of-00011.safetensors +3 -0
model-00011-of-00011.safetensors +3 -0
model.safetensors.index.json +766 -0
modeling_gpt_oss_puzzle.py +260 -0
privacy.md +12 -0
safety.md +6 -0
special_tokens_map.json +23 -0
tokenizer.json +3 -0
tokenizer_config.json +183 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,38 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+fig2.png filter=lfs diff=lfs merge=lfs -text
+fig1.png filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,323 @@

+---
+library_name: transformers
+license: other
+license_name: nvidia-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+pipeline_tag: text-generation
+language:
+  - en
+tags:
+  - nvidia
+  - gpt-oss
+  - puzzle
+  - mixture-of-experts
+  - reasoning
+  - pytorch
+  - transformers
+  - vllm
+---
+# gpt-oss-puzzle-88B
+# Model Overview
+### Description
+gpt-oss-puzzle-88B is a deployment-optimized large language model developed by NVIDIA, derived from [OpenAI's gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b).
+The model is produced using Puzzle, a post-training neural architecture search (NAS) framework, with the goal of significantly improving inference efficiency for reasoning-heavy workloads while maintaining or improving accuracy across reasoning budgets.
+The model is specifically optimized for long-context and short-context serving on NVIDIA H100-class hardware, where reasoning models are often bottlenecked by KV-cache bandwidth and memory capacity rather than raw compute.
+Compared to its parent, gpt-oss-puzzle-88B:
+- Reduces total parameters to ~88B (≈73% of the parent),
+- Achieves 1.63× throughput improvement in long-context (64K/64K) scenarios on an 8×H100 node,
+- Achieves 1.22× throughput improvement in short-context (4K/4K) scenarios,
+- Delivers up to 2.82× throughput improvement on a single H100 GPU,
+- Matches or slightly exceeds parent accuracy across reasoning efforts.
+**Parameter count note.** Hugging Face Hub may automatically show this model as ~91B parameters. We refer to it as 88B because the automatic count includes additional MXFP4 quantization scale tensors for the MoE experts, which are typically not counted as model parameters.
+This model is ready for commercial use.
+![Accuracy vs Relative Request Rate](fig1.png)
+![Accuracy Retention and Throughput Speedup](fig2.png)
+### License/Terms of Use
+Governing Terms: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)).
+### Deployment Geography
+Global
+### Use Case
+gpt-oss-puzzle-88B is a general purpose reasoning and chat model. This model is intended for production deployment, cost-efficient reasoning, and long-context inference workloads.
+### Release Date
+March 26, 2026 via [Hugging Face](https://huggingface.co/nvidia/gpt-oss-puzzle-88B)
+## References(s)
+* [\[2411.19146\] Puzzle: Distillation-Based NAS for Inference-Optimized LLMs](https://arxiv.org/abs/2411.19146)
+* [\[2508.10925\] gpt-oss-120b & gpt-oss-20b Model Card](https://arxiv.org/abs/2508.10925)
+* [\[2602.11937\] Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration](https://arxiv.org/abs/2602.11937)
+## Model Architecture
+- **Architecture Type:** Mixture-of-Experts Decoder-only Transformer
+- **Network Architecture:** Modified [gpt-oss](https://huggingface.co/openai/gpt-oss-120b) architecture with varying number of experts per layer, and a modified global/window attention pattern across layers.
+- **Number of model parameters:** 88B
+### Key Architectural Optimizations
+This model was created using Puzzle, a post-training NAS framework that constructs a heterogeneous architecture under explicit deployment constraints:
+- Heterogeneous MoE Expert Pruning
+  Each MoE layer retains a different number of experts, determined via activation-based importance scoring. Early layers retain more experts; later layers are more aggressively pruned.
+- Selective Window Attention
+  A subset of global attention layers is replaced with window attention (8K window), reducing KV-cache footprint by ~40% in long-context scenarios while preserving long-range reasoning.
+- RoPE Scaling Adjustment
+  The YaRN RoPE scaling factor was increased to improve stability at 128K context length.
+## Training and Optimization Procedure
+### Knowledge Distillation
+After Puzzle architecture selection, the model underwent knowledge distillation:
+- Total Tokens: 84B
+- Sequence Length: 128K
+- MoE Experts & Router: Frozen
+- Framework: Megatron-LM
+This phase restores inter-block compatibility and recovers quality lost during blockwise substitution.
+### Reinforcement Learning:
+A post-distillation reinforcement learning (RL) phase was applied to improve reasoning accuracy while controlling generation length:
+- Multi-environment RL (math, coding, reasoning)
+- MoE experts and router frozen
+- Two complementary policies trained:
+  - High-effort-focused (max accuracy)
+  - Mixed-effort (length-regularized)
+- Final model obtained via checkpoint weight averaging
+This preserves high reasoning accuracy while maintaining a stable effort length ratio, ensuring predictable cost-quality trade-offs.
+### Quantization:
+- MoE Weights: MXFP4 (inherited from gpt-oss-120B)
+- KV Cache: FP8 with calibrated KV scales
+- Effect:
+  - ~2× KV-cache token capacity
+  - Faster attention kernels
+  - Preserved accuracy vs unscaled FP8 KV-cache
+## Reasoning Effort Control:
+The model supports three reasoning effort modes:
+- Low: Fast, concise responses
+- Medium: Balanced accuracy and verbosity
+- High: Deep, multi-step reasoning
+Effort reliably controls generation length and accuracy, enabling cost-aware deployment.
+## Input
+- **Input Type(s):** Text
+- **Input Format(s):** String
+- **Input Parameters:** One-Dimensional (1D): Sequences
+- **Other Properties Related to Input:** Context length is 128k tokens.
+## Output
+- **Output Type(s):** Text
+- **Output Format:** String
+- **Output Parameters:** One-Dimensional (1D): Sequences
+- **Other Properties Related to Output:** Context length is 128k tokens.
+Our AI models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
+## Software Integration
+**Runtime Engine(s):**
+* vLLM (See instructions [below](#vllm))
+**Supported Hardware Microarchitecture Compatibility:**
+* NVIDIA B200
+* NVIDIA H100-80GB
+**Preferred/Supported Operating System(s):**
+* Linux
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
+## Model Version
+- v1.0
+## Training and Evaluation Datasets
+### Dataset Overview
+**Total Number of Datasets:** 7
+**Time period for data collection:** 2013 to May 1, 2025
+For the KD stage data, the prompts from [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) were used to generate responses from the parent model (gpt-oss-120b) to create full KD training examples. For each prompt, we generated responses under high and medium reasoning-effort settings.
+For the RL stage data, we used a subset of the [NeMo Gym collection](https://huggingface.co/collections/nvidia/nemo-gym) which includes RL verifiable data.
+# Public Datasets
+- [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)
+- [nvidia/Nemotron-RL-coding-competitive_coding](https://huggingface.co/datasets/nvidia/Nemotron-RL-coding-competitive_coding)
+- [nvidia/Nemotron-RL-instruction_following](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following)
+- [BytedTsinghua-SIA/DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k)
+- [Skywork/Skywork-OR1-RL-Data](https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data)
+- [nvidia/Nemotron-RL-knowledge-mcqa](https://huggingface.co/datasets/nvidia/Nemotron-RL-knowledge-mcqa)
+- [nvidia/Nemotron-RL-instruction_following-structured_outputs](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following-structured_outputs)
+## Training Dataset
+**Data Modality**: Text
+**Text Training Data Size**: 1 Billion to 10 Trillion Tokens
+**Data Collection Method by dataset**: Automated/Synthetic/Human
+**Labeling Method by dataset**: Not Applicable
+**Properties**:
+The training data is text-only and spans a broad range of task categories. The knowledge distillation stage used the Llama-Nemotron-Post-Training-Dataset, a large-scale collection covering mathematics, code, science, instruction following, general chat, and safety. The reinforcement learning stage used datasets spanning several domains: competitive programming problems with unit tests (Nemotron-RL-coding-competitive_coding, Skywork-OR1-RL-Data), diverse verifiable mathematical reasoning problems (DAPO-Math-17k, Skywork-OR1-RL-Data), multi-domain multiple-choice question answering across fields such as physics, biology, chemistry, mathematics, computer science, engineering, humanities, law, and others (Nemotron-RL-knowledge-mcqa), easily verifiable instruction-following tasks with diverse format and linguistic constraints (Nemotron-RL-instruction_following), and structured output generation requiring adherence to JSON schemas (Nemotron-RL-instruction_following-structured_outputs). No personal data was used for training.
+## Evaluation Dataset
+**Data Collection Method by dataset:** Hybrid: Human, Synthetic
+**Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic
+| Benchmark | Description |
+|-----------|-------------|
+| [**MMLU-Pro**](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) | MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. |
+| [**GPQA-Diamond**](https://huggingface.co/datasets/Idavidrein/gpqa) | The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. |
+| [**HLE**](https://huggingface.co/datasets/cais/hle) | Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. |
+| [**AA-LCR**](https://huggingface.co/datasets/ArtificialAnalysis/AA-LCR) | A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer). |
+| [**AIME25**](https://huggingface.co/datasets/math-ai/aime25) | American Invitational Mathematics Examination (AIME) 2025 questions |
+| [**IFBench**](https://huggingface.co/datasets/allenai/IFBench_test) | IFBench is a new, challenging benchmark for precise instruction following. |
+| [**SciCode**](https://huggingface.co/datasets/SciCode1/SciCode) | SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. |
+| [**RULER 128K**](https://huggingface.co/datasets/GAIR/ruler-128k) | RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. Used with context length of 128K tokens. |
+# Inference
+**Acceleration Engine**: vLLM
+**Test Hardware:**
+- 1× NVIDIA H100-80GB
+- 8× NVIDIA H100-80GB
+- 8× NVIDIA B200
+## Quick Start
+The gpt-oss-puzzle-88B model can be used with standard inference stacks such as Hugging Face Transformers and vLLM.
+It is especially optimized for NVIDIA H100 GPUs and supports long-context inference up to 128K tokens.
+### Transformers
+We recommend using Transformers ≥ 4.57.3.
+```python
+from transformers import pipeline
+model_id = "nvidia/gpt-oss-puzzle-88B"
+pipe = pipeline(
+    "text-generation",
+    model=model_id,
+    trust_remote_code=True,
+    dtype="auto",
+    device_map="auto",
+)
+messages = [
+    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
+]
+generation_config = GenerationConfig.from_pretrained(model_id)
+generation_config.max_new_tokens = 256
+outputs = pipe(
+    messages,
+    generation_config=generation_config,
+)
+print(outputs[0]["generated_text"][-1])
+```
+### vLLM
+#### Serving
+Start the server with a single command:
+```bash
+docker run --gpus all -p 8000:8000 \
+  --entrypoint bash \
+  vllm/vllm-openai:v0.17.1 \
+  -c "
+    apt-get update && apt-get install -y git &&
+    VLLM_USE_PRECOMPILED=1 pip install --no-build-isolation 'git+https://github.com/vllm-project/vllm.git@refs/pull/38135/head' &&
+    pip install flashinfer-cubin==0.6.6 flashinfer-jit-cache==0.6.6 --extra-index-url https://flashinfer.ai/whl/cu\$(echo \$CUDA_VERSION | cut -d. -f1,2 | tr -d '.') &&
+    export PYTORCH_ALLOC_CONF=expandable_segments:True &&
+    vllm serve nvidia/gpt-oss-puzzle-88B \
+      -tp 1 \
+      --trust-remote-code \
+      --kv-cache-dtype fp8 \
+      --max-num-batched-tokens 8192 \
+      --stream-interval 20 \
+      --gpu-memory-utilization 0.95 \
+      --max-num-seqs 8 \
+      --max-cudagraph-capture-size 8 \
+      --max-model-len 131072
+  "
+```
+> **Notes:**
+> - On Blackwell (B200), add `-e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1` to the `docker run` command.
+> - Remove `--kv-cache-dtype fp8` for BF16 KV-cache instead of FP8.
+> - Increase `-tp` if you need larger batch sizes or longer sequences.
+> - Expert parallelism is supported via `--enable-expert-parallel`, but we recommend TP.
+#### Inference with Reasoning Effort Control
+The model supports three reasoning effort levels (`low`, `medium`, `high`). For example:
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
+# High effort — deep, multi-step reasoning
+response = client.chat.completions.create(
+    model="nvidia/gpt-oss-puzzle-88B",
+    messages=[{"role": "user", "content": "Write a haiku about neural network pruning"}],
+    reasoning_effort="high",
+)
+print(response.choices[0].message.content)
+# Low effort — fast, concise responses
+response = client.chat.completions.create(
+    model="nvidia/gpt-oss-puzzle-88B",
+    messages=[{"role": "user", "content": "What is the capital of France?"}],
+    reasoning_effort="low",
+)
+print(response.choices[0].message.content)
+```
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+For more detailed information on ethical considerations for this model, please see the [Bias, Explainability, Safety & Security, and Privacy Subcards](https://huggingface.co/nvidia/gpt-oss-puzzle-88B).
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

bias.md ADDED Viewed

	@@ -0,0 +1,4 @@

+| Field | Response |
+| :---- | :---- |
+| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
+| Measures taken to mitigate against unwanted bias: | None |

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,331 @@

+{#-
+  In addition to the normal inputs of `messages` and `tools`, this template also accepts the
+  following kwargs:
+  - "builtin_tools": A list, can contain "browser" and/or "python".
+  - "model_identity": A string that optionally describes the model identity.
+  - "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
+ #}
+{#- Tool Definition Rendering ============================================== #}
+{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
+    {%- if param_spec.type == "array" -%}
+        {%- if param_spec['items'] -%}
+            {%- if param_spec['items']['type'] == "string" -%}
+                {{- "string[]" }}
+            {%- elif param_spec['items']['type'] == "number" -%}
+                {{- "number[]" }}
+            {%- elif param_spec['items']['type'] == "integer" -%}
+                {{- "number[]" }}
+            {%- elif param_spec['items']['type'] == "boolean" -%}
+                {{- "boolean[]" }}
+            {%- else -%}
+                {%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
+                {%- if inner_type == "object | object" or inner_type|length > 50 -%}
+                    {{- "any[]" }}
+                {%- else -%}
+                    {{- inner_type + "[]" }}
+                {%- endif -%}
+            {%- endif -%}
+            {%- if param_spec.nullable -%}
+                {{- " | null" }}
+            {%- endif -%}
+        {%- else -%}
+            {{- "any[]" }}
+            {%- if param_spec.nullable -%}
+                {{- " | null" }}
+            {%- endif -%}
+        {%- endif -%}
+    {%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
+        {#- Handle array of types like ["object", "object"] from Union[dict, list] #}
+        {%- if param_spec.type | length > 1 -%}
+            {{- param_spec.type | join(" | ") }}
+        {%- else -%}
+            {{- param_spec.type[0] }}
+        {%- endif -%}
+    {%- elif param_spec.oneOf -%}
+        {#- Handle oneOf schemas - check for complex unions and fallback to any #}
+        {%- set has_object_variants = false -%}
+        {%- for variant in param_spec.oneOf -%}
+            {%- if variant.type == "object" -%}
+                {%- set has_object_variants = true -%}
+            {%- endif -%}
+        {%- endfor -%}
+        {%- if has_object_variants and param_spec.oneOf|length > 1 -%}
+            {{- "any" }}
+        {%- else -%}
+            {%- for variant in param_spec.oneOf -%}
+                {{- render_typescript_type(variant, required_params) -}}
+                {%- if variant.description %}
+                    {{- "// " + variant.description }}
+                {%- endif -%}
+                {%- if variant.default is defined %}
+                    {{ "// default: " + variant.default|tojson }}
+                {%- endif -%}
+                {%- if not loop.last %}
+                    {{- " | " }}
+                {% endif -%}
+            {%- endfor -%}
+        {%- endif -%}
+    {%- elif param_spec.type == "string" -%}
+        {%- if param_spec.enum -%}
+            {{- '"' + param_spec.enum|join('" | "') + '"' -}}
+        {%- else -%}
+            {{- "string" }}
+            {%- if param_spec.nullable %}
+                {{- " | null" }}
+            {%- endif -%}
+        {%- endif -%}
+    {%- elif param_spec.type == "number" -%}
+        {{- "number" }}
+    {%- elif param_spec.type == "integer" -%}
+        {{- "number" }}
+    {%- elif param_spec.type == "boolean" -%}
+        {{- "boolean" }}
+    {%- elif param_spec.type == "object" -%}
+        {%- if param_spec.properties -%}
+            {{- "{\n" }}
+            {%- for prop_name, prop_spec in param_spec.properties.items() -%}
+                {{- prop_name -}}
+                {%- if prop_name not in (param_spec.required or []) -%}
+                    {{- "?" }}
+                {%- endif -%}
+                {{- ": " }}
+                {{ render_typescript_type(prop_spec, param_spec.required or []) }}
+                {%- if not loop.last -%}
+                    {{-", " }}
+                {%- endif -%}
+            {%- endfor -%}
+            {{- "}" }}
+        {%- else -%}
+            {{- "object" }}
+        {%- endif -%}
+    {%- else -%}
+        {{- "any" }}
+    {%- endif -%}
+{%- endmacro -%}
+{%- macro render_tool_namespace(namespace_name, tools) -%}
+    {{- "## " + namespace_name + "\n\n" }}
+    {{- "namespace " + namespace_name + " {\n\n" }}
+    {%- for tool in tools %}
+        {%- set tool = tool.function %}
+        {{- "// " + tool.description + "\n" }}
+        {{- "type "+ tool.name + " = " }}
+        {%- if tool.parameters and tool.parameters.properties %}
+            {{- "(_: {\n" }}
+            {%- for param_name, param_spec in tool.parameters.properties.items() %}
+                {%- if param_spec.description %}
+                    {{- "// " + param_spec.description + "\n" }}
+                {%- endif %}
+                {{- param_name }}
+                {%- if param_name not in (tool.parameters.required or []) -%}
+                    {{- "?" }}
+                {%- endif -%}
+                {{- ": " }}
+                {{- render_typescript_type(param_spec, tool.parameters.required or []) }}
+                {%- if param_spec.default is defined -%}
+                    {%- if param_spec.enum %}
+                        {{- ", // default: " + param_spec.default }}
+                    {%- elif param_spec.oneOf %}
+                        {{- "// default: " + param_spec.default }}
+                    {%- else %}
+                        {{- ", // default: " + param_spec.default|tojson }}
+                    {%- endif -%}
+                {%- endif -%}
+                {%- if not loop.last %}
+                    {{- ",\n" }}
+                {%- else %}
+                    {{- ",\n" }}
+                {%- endif -%}
+            {%- endfor %}
+            {{- "}) => any;\n\n" }}
+        {%- else -%}
+            {{- "() => any;\n\n" }}
+        {%- endif -%}
+    {%- endfor %}
+    {{- "} // namespace " + namespace_name }}
+{%- endmacro -%}
+{%- macro render_builtin_tools(browser_tool, python_tool) -%}
+    {%- if browser_tool %}
+        {{- "## browser\n\n" }}
+        {{- "// Tool for browsing.\n" }}
+        {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
+        {{- "// Cite information from the tool using the following format:\n" }}
+        {{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
+        {{- "// Do not quote more than 10 words directly from the tool output.\n" }}
+        {{- "// sources=web (default: web)\n" }}
+        {{- "namespace browser {\n\n" }}
+        {{- "// Searches for information related to `query` and displays `topn` results.\n" }}
+        {{- "type search = (_: {\n" }}
+        {{- "query: string,\n" }}
+        {{- "topn?: number, // default: 10\n" }}
+        {{- "source?: string,\n" }}
+        {{- "}) => any;\n\n" }}
+        {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
+        {{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
+        {{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
+        {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
+        {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
+        {{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
+        {{- "type open = (_: {\n" }}
+        {{- "id?: number | string, // default: -1\n" }}
+        {{- "cursor?: number, // default: -1\n" }}
+        {{- "loc?: number, // default: -1\n" }}
+        {{- "num_lines?: number, // default: -1\n" }}
+        {{- "view_source?: boolean, // default: false\n" }}
+        {{- "source?: string,\n" }}
+        {{- "}) => any;\n\n" }}
+        {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
+        {{- "type find = (_: {\n" }}
+        {{- "pattern: string,\n" }}
+        {{- "cursor?: number, // default: -1\n" }}
+        {{- "}) => any;\n\n" }}
+        {{- "} // namespace browser\n\n" }}
+    {%- endif -%}
+    {%- if python_tool %}
+        {{- "## python\n\n" }}
+        {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
+        {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
+    {%- endif -%}
+{%- endmacro -%}
+{#- System Message Construction ============================================ #}
+{%- macro build_system_message() -%}
+    {%- if model_identity is not defined %}
+        {%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
+    {%- endif %}
+    {{- model_identity + "\n" }}
+    {{- "Knowledge cutoff: 2024-06\n" }}
+    {{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
+    {%- if reasoning_effort is not defined %}
+        {%- set reasoning_effort = "medium" %}
+    {%- endif %}
+    {{- "Reasoning: " + reasoning_effort + "\n\n" }}
+    {%- if builtin_tools %}
+        {{- "# Tools\n\n" }}
+        {%- set available_builtin_tools = namespace(browser=false, python=false) %}
+        {%- for tool in builtin_tools %}
+            {%- if tool == "browser" %}
+                {%- set available_builtin_tools.browser = true %}
+            {%- elif tool == "python" %}
+                {%- set available_builtin_tools.python = true %}
+            {%- endif %}
+        {%- endfor %}
+        {{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
+    {%- endif -%}
+    {{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
+    {%- if tools -%}
+        {{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
+    {%- endif -%}
+{%- endmacro -%}
+{#- Main Template Logic ================================================= #}
+{#- Set defaults #}
+{#- Render system message #}
+{{- "<|start|>system<|message|>" }}
+{{- build_system_message() }}
+{{- "<|end|>" }}
+{#- Extract developer message #}
+{%- if messages[0].role == "developer" or messages[0].role == "system" %}
+    {%- set developer_message = messages[0].content %}
+    {%- set loop_messages = messages[1:] %}
+{%- else %}
+    {%- set developer_message = "" %}
+    {%- set loop_messages = messages %}
+{%- endif %}
+{#- Render developer message #}
+{%- if developer_message or tools %}
+    {{- "<|start|>developer<|message|>" }}
+    {%- if developer_message %}
+        {{- "# Instructions\n\n" }}
+        {{- developer_message }}
+        {{- "\n\n" }}
+    {%- endif %}
+    {%- if tools -%}
+        {{- "# Tools\n\n" }}
+        {{- render_tool_namespace("functions", tools) }}
+    {%- endif -%}
+    {{- "<|end|>" }}
+{%- endif %}
+{#- Render messages #}
+{%- set last_tool_call = namespace(name=none) %}
+{%- for message in loop_messages -%}
+    {#- At this point only assistant/user/tool messages should remain #}
+    {%- if message.role == 'assistant' -%}
+        {#- Checks to ensure the messages are being passed in the format we expect #}
+        {%- if "content" in message %}
+            {%- if "<|channel|>analysis<|message|>" in message.content or "<|channel|>final<|message|>" in message.content %}
+                {{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
+            {%- endif %}
+        {%- endif %}
+        {%- if "thinking" in message %}
+            {%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
+                {{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
+            {%- endif %}
+        {%- endif %}
+        {%- if "tool_calls" in message %}
+            {#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
+            {#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
+            {#- when we render CoT/analysis messages in inference. #}
+            {%- set future_final_message = namespace(found=false) %}
+            {%- for future_message in loop_messages[loop.index:] %}
+                {%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
+                    {%- set future_final_message.found = true %}
+                {%- endif %}
+            {%- endfor %}
+            {#- We assume max 1 tool call per message, and so we infer the tool call name #}
+            {#- in "tool" messages from the most recent assistant tool call name #}
+            {%- set tool_call = message.tool_calls[0] %}
+            {%- if tool_call.function %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {%- if message.content and message.thinking %}
+                {{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
+            {%- elif message.content and not future_final_message.found %}
+                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
+            {%- elif message.thinking and not future_final_message.found %}
+                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
+            {%- endif %}
+            {{- "<|start|>assistant to=" }}
+            {{- "functions." + tool_call.name + "<|channel|>commentary " }}
+            {{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
+            {{- tool_call.arguments|tojson }}
+            {{- "<|call|>" }}
+            {%- set last_tool_call.name = tool_call.name %}
+        {%- elif loop.last and not add_generation_prompt %}
+            {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
+            {#- This is a situation that should only occur in training, never in inference. #}
+            {%- if "thinking" in message %}
+                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
+            {%- endif %}
+            {#- <|return|> indicates the end of generation, but <|end|> does not #}
+            {#- <|return|> should never be an input to the model, but we include it as the final token #}
+            {#- when training, so the model learns to emit it. #}
+            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
+        {%- else %}
+            {#- CoT is dropped during all previous turns, so we never render it for inference #}
+            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
+            {%- set last_tool_call.name = none %}
+        {%- endif %}
+    {%- elif message.role == 'tool' -%}
+        {%- if last_tool_call.name is none %}
+            {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
+        {%- endif %}
+        {{- "<|start|>functions." + last_tool_call.name }}
+        {{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
+    {%- elif message.role == 'user' -%}
+        {{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
+    {%- endif -%}
+{%- endfor -%}
+{#- Generation prompt #}
+{%- if add_generation_prompt -%}
+<|start|>assistant
+{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,242 @@

+{
+  "architectures": [
+    "GptOssPuzzleForCausalLM"
+  ],
+  "attention_bias": true,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_gpt_oss_puzzle.GptOssPuzzleConfig",
+    "AutoModelForCausalLM": "modeling_gpt_oss_puzzle.GptOssPuzzleForCausalLM"
+  },
+  "block_configs": [
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 8192
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 8192
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 128,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 8192
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 8192
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 8192
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": null
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 8192
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 8192
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 128
+    },
+    {
+      "num_local_experts": 64,
+      "sliding_window": 8192
+    }
+  ],
+  "dtype": "bfloat16",
+  "eos_token_id": 200002,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 2880,
+  "initializer_range": 0.02,
+  "intermediate_size": 2880,
+  "layer_types": [
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention"
+  ],
+  "max_position_embeddings": 229376,
+  "model_type": "gpt_oss_puzzle",
+  "num_attention_heads": 64,
+  "num_experts_per_tok": 4,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 8,
+  "output_router_logits": false,
+  "pad_token_id": 199999,
+  "quantization_config": {
+    "modules_to_not_convert": [
+      "model.layers.*.self_attn",
+      "model.layers.*.mlp.router",
+      "model.embed_tokens",
+      "lm_head"
+    ],
+    "quant_method": "mxfp4"
+  },
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "beta_fast": 32.0,
+    "beta_slow": 1.0,
+    "factor": 56.0,
+    "original_max_position_embeddings": 4096,
+    "rope_type": "yarn",
+    "truncate": false
+  },
+  "rope_scaling": {
+    "beta_fast": 32.0,
+    "beta_slow": 1.0,
+    "factor": 56.0,
+    "original_max_position_embeddings": 4096,
+    "rope_type": "yarn",
+    "truncate": false
+  },
+  "rope_theta": 150000,
+  "router_aux_loss_coef": 0.9,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.6",
+  "use_cache": true,
+  "vocab_size": 201088
+}

configuration_gpt_oss_puzzle.py ADDED Viewed

	@@ -0,0 +1,65 @@

+from typing import Any
+from dataclasses import asdict, dataclass, fields
+from transformers.models.gpt_oss.configuration_gpt_oss import GptOssConfig
+@dataclass
+class BlockConfig:
+    sliding_window: int
+    num_local_experts: int
+LAYER_SPECIFIC_MEMBERS = [field.name for field in fields(BlockConfig)]
+class GptOssPuzzleConfig(GptOssConfig):
+    model_type = "gpt_oss_puzzle"
+    def __init__(self, *, block_configs: list[dict[str, dict[str, Any]]] | None = None, **kwargs):
+        self.block_configs = block_configs
+        super().__init__(**kwargs)
+        if self.block_configs is not None:
+            self.block_configs = [BlockConfig(**block_config) for block_config in self.block_configs]
+            self.layer_types = [
+                ("full_attention" if block_config.sliding_window is None else "sliding_attention")
+                for block_config in self.block_configs
+            ]
+            for member in LAYER_SPECIFIC_MEMBERS:
+                if hasattr(self, member):
+                    delattr(self, member)
+        else:
+            self.block_configs = [
+                BlockConfig(
+                    sliding_window=self.sliding_window,
+                    num_local_experts=self.num_local_experts,
+                )
+                for _ in range(self.num_hidden_layers)
+            ]
+    def __getattr__(self, name: str) -> Any:
+        if name in LAYER_SPECIFIC_MEMBERS:
+            raise AttributeError(
+                f"'{name}' is a per-block attribute and varies across blocks. "
+                f"Access it via the individual block configs instead (e.g. config.block_configs[i].{name})."
+            )
+        non_heterogeneous_error_message = f"'{type(self).__name__}' object has no attribute '{name}'"
+        raise AttributeError(non_heterogeneous_error_message)
+    def to_dict(self) -> dict[str, Any]:
+        output = super().to_dict()
+        output["block_configs"] = [asdict(block_config) for block_config in self.block_configs]
+        return output
+    def get_gpt_oss_config_for_layer(self, layer_idx: int) -> GptOssConfig:
+        config_dict = self.to_dict()
+        del config_dict["block_configs"]
+        block_config = self.block_configs[layer_idx]
+        config_dict["sliding_window"] = block_config.sliding_window
+        config_dict["num_local_experts"] = block_config.num_local_experts
+        return GptOssConfig.from_dict(config_dict, attn_implementation=self._attn_implementation)

explainability.md ADDED Viewed

	@@ -0,0 +1,13 @@

+| Field | Response |
+| :---- | :---- |
+| Intended Task/Domain: | Text generation, reasoning, and chat |
+| Model Type: | Text-to-text Mixture-of-Experts Transformer |
+| Intended Users: | Generative AI creators working with conversational AI models. |
+| Output: | Text |
+| Describe how the model works: | Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers. |
+| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
+| Technical Limitations & Mitigation: | This model performs particularly well in instruction following regimes, as such may be strongly influenced by untrusted inputs and should be paired with appropriate guardrails and data filtering to better align use-case behaviors when exposed to such data. |
+| Verified to have met prescribed NVIDIA quality standards: | Yes |
+| Performance Metrics: | Accuracy, Throughput, and User-side throughput |
+| Potential Known Risks: | The model was optimized explicitly for instruction following and as such may be influenced by untrusted inputs (prompt injection, indirect prompt injection, jailbreaking, web search, etc.) as a result of its instruction tuning that may degrade safety alignment and other training efforts. This model should be paired with additional guardrails and data filtering to limit exposure to instructions from malicious sources. Bypassing of safety alignment, system guardrails, and filters may allow harmful outcomes up to and including remote code execution in some agentic systems when effective security controls are not in place. The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may generate and amplify harmful, biased, or otherwise unsafe content reinforcing these biases and return toxic responses especially when prompted with toxic prompts. The model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.  The model may exhibit self-anthropomorphism (e.g., displaying human-like characteristics in dialogue, such as expressing preferences and emotions). In integrated system contexts, the model could potentially be exploited to access or disclose information beyond the model’s intended permissions or scope of operation. |
+| Licensing: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) |

fig1.png ADDED Viewed

Git LFS Details

SHA256: b1e1bb2a9650e1541ad6ab62705de1c1b1b35d535e2097ab7c21ffef15315bf6
Pointer size: 131 Bytes
Size of remote file: 301 kB

fig2.png ADDED Viewed

Git LFS Details

SHA256: 348bd2b0c8db89e647e82e679419c606b1065a6a0d30a1959f110f033bed407b
Pointer size: 131 Bytes
Size of remote file: 206 kB

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "bos_token_id": 199998,
+  "do_sample": true,
+  "eos_token_id": [
+    200002,
+    199999
+  ],
+  "pad_token_id": 199999,
+  "transformers_version": "4.55.0.dev0"
+}

model-00001-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:450b3564a3cc1ff4fe2ca900c440aa619b017d7555658f7979f43116e591f7ad
+size 4115581080

model-00002-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e08e900384dff118d28397aa96caf7e4d9960a10f90217446841f2f716ce5ae9
+size 4678869240

model-00003-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:050248abe00e9216471283d8a0aae075db722282bbbbf61439cec2cd3ea130f8
+size 4679238480

model-00004-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:64e240da7cf5ac9531b3d0b9032c3660f261d0cb9787bd13878f5d7a5d499944
+size 4987817832

model-00005-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1ca2a05ac2f4e0c7b5b522471a9a0ac3fd6ce72e2db0089c7e544129ca7d077f
+size 4759633712

model-00006-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:221819279728057cfc388cf354417ce0641b0df7b7c602c1eba57dd667be1c97
+size 4503088168

model-00007-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:834f2121924b54f9960c299e06931ab8fc1f707245c1a06ea8d324532701ea43
+size 4980815736

model-00008-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e1cda36a0e16c3ca61087c2718abfc0f4ce3c22c2e6268fa500f6c233e737f5
+size 4537371312

model-00009-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:07632a1a122cb2ea3cf75d0f867f53e16d021bbc269f7f127b5397af7e929267
+size 4061736320

model-00010-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:23bae85e80dab667efcfa959c74778b71a1fecd03d6ba3b97fdb1d79710e1c7f
+size 4678869176

model-00011-of-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c4199a2da5a339d54f514066454e34cf9e9cf9563c1758772a5f466ebf5cf1b
+size 4010816256

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,766 @@

+{
+  "metadata": {
+    "total_size": 49993753248
+  },
+  "weight_map": {
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.o_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.sinks": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.experts.down_proj_bias": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.experts.down_proj_blocks": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.experts.down_proj_scales": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.experts.gate_up_proj_bias": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.experts.gate_up_proj_blocks": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.experts.gate_up_proj_scales": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.router.bias": "model-00001-of-00011.safetensors",
+    "model.layers.0.mlp.router.weight": "model-00001-of-00011.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00011.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.k_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.o_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.sinks": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.experts.down_proj_bias": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.experts.down_proj_blocks": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.experts.down_proj_scales": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.experts.gate_up_proj_bias": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.experts.gate_up_proj_blocks": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.experts.gate_up_proj_scales": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.router.bias": "model-00001-of-00011.safetensors",
+    "model.layers.10.mlp.router.weight": "model-00001-of-00011.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00011.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.k_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.o_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.sinks": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
+    "model.layers.11.mlp.experts.down_proj_bias": "model-00001-of-00011.safetensors",
+    "model.layers.11.mlp.experts.down_proj_blocks": "model-00001-of-00011.safetensors",
+    "model.layers.11.mlp.experts.down_proj_scales": "model-00001-of-00011.safetensors",
+    "model.layers.11.mlp.experts.gate_up_proj_bias": "model-00001-of-00011.safetensors",
+    "model.layers.11.mlp.experts.gate_up_proj_blocks": "model-00002-of-00011.safetensors",
+    "model.layers.11.mlp.experts.gate_up_proj_scales": "model-00002-of-00011.safetensors",
+    "model.layers.11.mlp.router.bias": "model-00002-of-00011.safetensors",
+    "model.layers.11.mlp.router.weight": "model-00002-of-00011.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.k_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.o_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.q_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.sinks": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.v_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.experts.down_proj_bias": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.experts.down_proj_blocks": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.experts.down_proj_scales": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.experts.gate_up_proj_bias": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.experts.gate_up_proj_blocks": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.experts.gate_up_proj_scales": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.router.bias": "model-00002-of-00011.safetensors",
+    "model.layers.12.mlp.router.weight": "model-00002-of-00011.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.k_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.o_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.q_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.sinks": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.v_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.experts.down_proj_bias": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.experts.down_proj_blocks": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.experts.down_proj_scales": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.experts.gate_up_proj_bias": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.experts.gate_up_proj_blocks": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.experts.gate_up_proj_scales": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.router.bias": "model-00002-of-00011.safetensors",
+    "model.layers.13.mlp.router.weight": "model-00002-of-00011.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.k_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.o_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.q_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.sinks": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.v_proj.bias": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
+    "model.layers.14.mlp.experts.down_proj_bias": "model-00002-of-00011.safetensors",
+    "model.layers.14.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
+    "model.layers.14.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
+    "model.layers.14.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
+    "model.layers.14.mlp.experts.gate_up_proj_blocks": "model-00003-of-00011.safetensors",
+    "model.layers.14.mlp.experts.gate_up_proj_scales": "model-00003-of-00011.safetensors",
+    "model.layers.14.mlp.router.bias": "model-00003-of-00011.safetensors",
+    "model.layers.14.mlp.router.weight": "model-00003-of-00011.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.k_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.o_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.q_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.sinks": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.v_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.experts.down_proj_bias": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.experts.gate_up_proj_blocks": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.experts.gate_up_proj_scales": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.router.bias": "model-00003-of-00011.safetensors",
+    "model.layers.15.mlp.router.weight": "model-00003-of-00011.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.k_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.o_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.q_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.sinks": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.v_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.experts.down_proj_bias": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.experts.gate_up_proj_blocks": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.experts.gate_up_proj_scales": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.router.bias": "model-00003-of-00011.safetensors",
+    "model.layers.16.mlp.router.weight": "model-00003-of-00011.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.k_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.o_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.q_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.sinks": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.v_proj.bias": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
+    "model.layers.17.mlp.experts.down_proj_bias": "model-00003-of-00011.safetensors",
+    "model.layers.17.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
+    "model.layers.17.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
+    "model.layers.17.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
+    "model.layers.17.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
+    "model.layers.17.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
+    "model.layers.17.mlp.router.bias": "model-00004-of-00011.safetensors",
+    "model.layers.17.mlp.router.weight": "model-00004-of-00011.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.sinks": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.v_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.experts.down_proj_bias": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.experts.down_proj_blocks": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.experts.down_proj_scales": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.experts.gate_up_proj_bias": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.router.bias": "model-00004-of-00011.safetensors",
+    "model.layers.18.mlp.router.weight": "model-00004-of-00011.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.sinks": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.v_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.experts.down_proj_bias": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.experts.down_proj_blocks": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.experts.down_proj_scales": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.experts.gate_up_proj_bias": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.router.bias": "model-00004-of-00011.safetensors",
+    "model.layers.19.mlp.router.weight": "model-00004-of-00011.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.sinks": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.v_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.experts.down_proj_bias": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.experts.down_proj_blocks": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.experts.down_proj_scales": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.experts.gate_up_proj_bias": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.router.bias": "model-00004-of-00011.safetensors",
+    "model.layers.1.mlp.router.weight": "model-00004-of-00011.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00004-of-00011.safetensors",
+    "model.layers.20.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.20.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
+    "model.layers.20.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.20.self_attn.sinks": "model-00005-of-00011.safetensors",
+    "model.layers.20.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.router.bias": "model-00005-of-00011.safetensors",
+    "model.layers.20.mlp.router.weight": "model-00005-of-00011.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.sinks": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.router.bias": "model-00005-of-00011.safetensors",
+    "model.layers.21.mlp.router.weight": "model-00005-of-00011.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.sinks": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.router.bias": "model-00005-of-00011.safetensors",
+    "model.layers.22.mlp.router.weight": "model-00005-of-00011.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.sinks": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.router.bias": "model-00005-of-00011.safetensors",
+    "model.layers.23.mlp.router.weight": "model-00005-of-00011.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.sinks": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
+    "model.layers.24.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.24.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
+    "model.layers.24.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
+    "model.layers.24.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
+    "model.layers.24.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.24.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.24.mlp.router.bias": "model-00006-of-00011.safetensors",
+    "model.layers.24.mlp.router.weight": "model-00006-of-00011.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.sinks": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.router.bias": "model-00006-of-00011.safetensors",
+    "model.layers.25.mlp.router.weight": "model-00006-of-00011.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.sinks": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.router.bias": "model-00006-of-00011.safetensors",
+    "model.layers.26.mlp.router.weight": "model-00006-of-00011.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.sinks": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.router.bias": "model-00006-of-00011.safetensors",
+    "model.layers.27.mlp.router.weight": "model-00006-of-00011.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.sinks": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.router.bias": "model-00006-of-00011.safetensors",
+    "model.layers.28.mlp.router.weight": "model-00006-of-00011.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.sinks": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
+    "model.layers.29.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.29.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
+    "model.layers.29.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
+    "model.layers.29.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
+    "model.layers.29.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.29.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.29.mlp.router.bias": "model-00007-of-00011.safetensors",
+    "model.layers.29.mlp.router.weight": "model-00007-of-00011.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.sinks": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.router.bias": "model-00007-of-00011.safetensors",
+    "model.layers.2.mlp.router.weight": "model-00007-of-00011.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.sinks": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.router.bias": "model-00007-of-00011.safetensors",
+    "model.layers.30.mlp.router.weight": "model-00007-of-00011.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.sinks": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.router.bias": "model-00007-of-00011.safetensors",
+    "model.layers.31.mlp.router.weight": "model-00007-of-00011.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.sinks": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
+    "model.layers.32.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.32.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.32.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
+    "model.layers.32.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
+    "model.layers.32.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
+    "model.layers.32.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.32.mlp.router.bias": "model-00008-of-00011.safetensors",
+    "model.layers.32.mlp.router.weight": "model-00008-of-00011.safetensors",
+    "model.layers.32.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.sinks": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.router.bias": "model-00008-of-00011.safetensors",
+    "model.layers.33.mlp.router.weight": "model-00008-of-00011.safetensors",
+    "model.layers.33.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.sinks": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.router.bias": "model-00008-of-00011.safetensors",
+    "model.layers.34.mlp.router.weight": "model-00008-of-00011.safetensors",
+    "model.layers.34.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.sinks": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.router.bias": "model-00008-of-00011.safetensors",
+    "model.layers.35.mlp.router.weight": "model-00008-of-00011.safetensors",
+    "model.layers.35.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.sinks": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.router.bias": "model-00008-of-00011.safetensors",
+    "model.layers.3.mlp.router.weight": "model-00008-of-00011.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.sinks": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
+    "model.layers.4.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
+    "model.layers.4.mlp.experts.down_proj_blocks": "model-00009-of-00011.safetensors",
+    "model.layers.4.mlp.experts.down_proj_scales": "model-00009-of-00011.safetensors",
+    "model.layers.4.mlp.experts.gate_up_proj_bias": "model-00009-of-00011.safetensors",
+    "model.layers.4.mlp.experts.gate_up_proj_blocks": "model-00009-of-00011.safetensors",
+    "model.layers.4.mlp.experts.gate_up_proj_scales": "model-00009-of-00011.safetensors",
+    "model.layers.4.mlp.router.bias": "model-00009-of-00011.safetensors",
+    "model.layers.4.mlp.router.weight": "model-00009-of-00011.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.k_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.o_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.q_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.sinks": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.v_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.experts.down_proj_bias": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.experts.down_proj_blocks": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.experts.down_proj_scales": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.experts.gate_up_proj_bias": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.experts.gate_up_proj_blocks": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.experts.gate_up_proj_scales": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.router.bias": "model-00009-of-00011.safetensors",
+    "model.layers.5.mlp.router.weight": "model-00009-of-00011.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.k_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.o_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.q_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.sinks": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.v_proj.bias": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
+    "model.layers.6.mlp.experts.down_proj_bias": "model-00009-of-00011.safetensors",
+    "model.layers.6.mlp.experts.down_proj_blocks": "model-00009-of-00011.safetensors",
+    "model.layers.6.mlp.experts.down_proj_scales": "model-00009-of-00011.safetensors",
+    "model.layers.6.mlp.experts.gate_up_proj_bias": "model-00009-of-00011.safetensors",
+    "model.layers.6.mlp.experts.gate_up_proj_blocks": "model-00010-of-00011.safetensors",
+    "model.layers.6.mlp.experts.gate_up_proj_scales": "model-00010-of-00011.safetensors",
+    "model.layers.6.mlp.router.bias": "model-00010-of-00011.safetensors",
+    "model.layers.6.mlp.router.weight": "model-00010-of-00011.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.k_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.o_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.q_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.sinks": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.v_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.experts.down_proj_bias": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.experts.down_proj_blocks": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.experts.down_proj_scales": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.experts.gate_up_proj_bias": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.experts.gate_up_proj_blocks": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.experts.gate_up_proj_scales": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.router.bias": "model-00010-of-00011.safetensors",
+    "model.layers.7.mlp.router.weight": "model-00010-of-00011.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.k_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.o_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.q_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.sinks": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.v_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.experts.down_proj_bias": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.experts.down_proj_blocks": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.experts.down_proj_scales": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.experts.gate_up_proj_bias": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.experts.gate_up_proj_blocks": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.experts.gate_up_proj_scales": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.router.bias": "model-00010-of-00011.safetensors",
+    "model.layers.8.mlp.router.weight": "model-00010-of-00011.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.k_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.o_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.q_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.sinks": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.v_proj.bias": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
+    "model.layers.9.mlp.experts.down_proj_bias": "model-00010-of-00011.safetensors",
+    "model.layers.9.mlp.experts.down_proj_blocks": "model-00011-of-00011.safetensors",
+    "model.layers.9.mlp.experts.down_proj_scales": "model-00011-of-00011.safetensors",
+    "model.layers.9.mlp.experts.gate_up_proj_bias": "model-00011-of-00011.safetensors",
+    "model.layers.9.mlp.experts.gate_up_proj_blocks": "model-00011-of-00011.safetensors",
+    "model.layers.9.mlp.experts.gate_up_proj_scales": "model-00011-of-00011.safetensors",
+    "model.layers.9.mlp.router.bias": "model-00011-of-00011.safetensors",
+    "model.layers.9.mlp.router.weight": "model-00011-of-00011.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00011-of-00011.safetensors",
+    "model.embed_tokens.weight": "model-00011-of-00011.safetensors",
+    "lm_head.weight": "model-00011-of-00011.safetensors",
+    "model.norm.weight": "model-00011-of-00011.safetensors",
+    "model.layers.0.self_attn.k_scale": "model-00001-of-00011.safetensors",
+    "model.layers.0.self_attn.v_scale": "model-00001-of-00011.safetensors",
+    "model.layers.1.self_attn.k_scale": "model-00004-of-00011.safetensors",
+    "model.layers.1.self_attn.v_scale": "model-00004-of-00011.safetensors",
+    "model.layers.10.self_attn.k_scale": "model-00001-of-00011.safetensors",
+    "model.layers.10.self_attn.v_scale": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.k_scale": "model-00001-of-00011.safetensors",
+    "model.layers.11.self_attn.v_scale": "model-00001-of-00011.safetensors",
+    "model.layers.12.self_attn.k_scale": "model-00002-of-00011.safetensors",
+    "model.layers.12.self_attn.v_scale": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.k_scale": "model-00002-of-00011.safetensors",
+    "model.layers.13.self_attn.v_scale": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.k_scale": "model-00002-of-00011.safetensors",
+    "model.layers.14.self_attn.v_scale": "model-00002-of-00011.safetensors",
+    "model.layers.15.self_attn.k_scale": "model-00003-of-00011.safetensors",
+    "model.layers.15.self_attn.v_scale": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.k_scale": "model-00003-of-00011.safetensors",
+    "model.layers.16.self_attn.v_scale": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.k_scale": "model-00003-of-00011.safetensors",
+    "model.layers.17.self_attn.v_scale": "model-00003-of-00011.safetensors",
+    "model.layers.18.self_attn.k_scale": "model-00004-of-00011.safetensors",
+    "model.layers.18.self_attn.v_scale": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.k_scale": "model-00004-of-00011.safetensors",
+    "model.layers.19.self_attn.v_scale": "model-00004-of-00011.safetensors",
+    "model.layers.2.self_attn.k_scale": "model-00007-of-00011.safetensors",
+    "model.layers.2.self_attn.v_scale": "model-00007-of-00011.safetensors",
+    "model.layers.20.self_attn.k_scale": "model-00004-of-00011.safetensors",
+    "model.layers.20.self_attn.v_scale": "model-00004-of-00011.safetensors",
+    "model.layers.21.self_attn.k_scale": "model-00005-of-00011.safetensors",
+    "model.layers.21.self_attn.v_scale": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.k_scale": "model-00005-of-00011.safetensors",
+    "model.layers.22.self_attn.v_scale": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.k_scale": "model-00005-of-00011.safetensors",
+    "model.layers.23.self_attn.v_scale": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.k_scale": "model-00005-of-00011.safetensors",
+    "model.layers.24.self_attn.v_scale": "model-00005-of-00011.safetensors",
+    "model.layers.25.self_attn.k_scale": "model-00006-of-00011.safetensors",
+    "model.layers.25.self_attn.v_scale": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.k_scale": "model-00006-of-00011.safetensors",
+    "model.layers.26.self_attn.v_scale": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.k_scale": "model-00006-of-00011.safetensors",
+    "model.layers.27.self_attn.v_scale": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.k_scale": "model-00006-of-00011.safetensors",
+    "model.layers.28.self_attn.v_scale": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.k_scale": "model-00006-of-00011.safetensors",
+    "model.layers.29.self_attn.v_scale": "model-00006-of-00011.safetensors",
+    "model.layers.3.self_attn.k_scale": "model-00008-of-00011.safetensors",
+    "model.layers.3.self_attn.v_scale": "model-00008-of-00011.safetensors",
+    "model.layers.30.self_attn.k_scale": "model-00007-of-00011.safetensors",
+    "model.layers.30.self_attn.v_scale": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.k_scale": "model-00007-of-00011.safetensors",
+    "model.layers.31.self_attn.v_scale": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.k_scale": "model-00007-of-00011.safetensors",
+    "model.layers.32.self_attn.v_scale": "model-00007-of-00011.safetensors",
+    "model.layers.33.self_attn.k_scale": "model-00008-of-00011.safetensors",
+    "model.layers.33.self_attn.v_scale": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.k_scale": "model-00008-of-00011.safetensors",
+    "model.layers.34.self_attn.v_scale": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.k_scale": "model-00008-of-00011.safetensors",
+    "model.layers.35.self_attn.v_scale": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.k_scale": "model-00008-of-00011.safetensors",
+    "model.layers.4.self_attn.v_scale": "model-00008-of-00011.safetensors",
+    "model.layers.5.self_attn.k_scale": "model-00009-of-00011.safetensors",
+    "model.layers.5.self_attn.v_scale": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.k_scale": "model-00009-of-00011.safetensors",
+    "model.layers.6.self_attn.v_scale": "model-00009-of-00011.safetensors",
+    "model.layers.7.self_attn.k_scale": "model-00010-of-00011.safetensors",
+    "model.layers.7.self_attn.v_scale": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.k_scale": "model-00010-of-00011.safetensors",
+    "model.layers.8.self_attn.v_scale": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.k_scale": "model-00010-of-00011.safetensors",
+    "model.layers.9.self_attn.v_scale": "model-00010-of-00011.safetensors"
+  }
+}

modeling_gpt_oss_puzzle.py ADDED Viewed

	@@ -0,0 +1,260 @@

+from typing import Any, Iterable, Optional, Union
+from dataclasses import dataclass
+import functools
+import inspect
+from .configuration_gpt_oss_puzzle import GptOssPuzzleConfig
+import torch
+from transformers.cache_utils import Cache, DynamicCache, DynamicLayer, DynamicSlidingWindowLayer
+from transformers.integrations import mxfp4
+from transformers.integrations.mxfp4 import Mxfp4GptOssExperts
+from transformers.masking_utils import create_sliding_window_causal_mask
+from transformers.models.gpt_oss import modeling_gpt_oss
+from transformers.models.gpt_oss.modeling_gpt_oss import GptOssDecoderLayer, GptOssForCausalLM
+@dataclass
+class SlidingWindowCausalMaskPlaceholder:
+    kwargs: dict[str, Any]
+class GptOssPuzzleDecoderLayer(GptOssDecoderLayer):
+    """
+    Extends GptOssDecoderLayer to support per-layer configs.
+    """
+    def __init__(self, config: GptOssPuzzleConfig, layer_idx: int):
+        layer_config = config.get_gpt_oss_config_for_layer(layer_idx)
+        super().__init__(layer_config, layer_idx)
+        self.config = layer_config
+        self.layer_idx = layer_idx
+    def forward(self, *args, **kwargs):
+        if "attention_mask" in kwargs and isinstance(kwargs["attention_mask"], SlidingWindowCausalMaskPlaceholder):
+            mask_kwargs = dict(kwargs["attention_mask"].kwargs)
+            mask_kwargs["config"] = self.config
+            if mask_kwargs["past_key_values"] is not None:
+                mask_kwargs["past_key_values"] = CacheViewForSlidingWindowMask(
+                    mask_kwargs["past_key_values"], self.layer_idx
+                )
+            kwargs["attention_mask"] = create_sliding_window_causal_mask(**mask_kwargs)
+        return super().forward(*args, **kwargs)
+class CacheViewForSlidingWindowMask:
+    """
+    A view wrapper around a Cache that makes `create_sliding_window_causal_mask` use the correct layer index.
+    `create_sliding_window_causal_mask` iterates over `past_key_values.is_sliding` to determine which layer
+    to use for deriving mask sizes, effectively using the first layer's index. Since gpt-oss-puzzle has
+    heterogeneous sliding window sizes across layers, we need to ensure each layer uses its own sliding
+    window size. This view returns an `is_sliding` list that only marks the current layer as sliding,
+    causing `create_sliding_window_causal_mask` to use the correct layer index for mask computation.
+    """
+    def __init__(self, cache: Cache, layer_idx: int):
+        self._cache = cache
+        self._layer_idx = layer_idx
+    @property
+    def is_sliding(self) -> list[bool]:
+        return [False] * self._layer_idx + [True]
+    def __getattr__(self, name: str):
+        return getattr(self._cache, name)
+class Mxfp4GptOssPuzzleExperts(Mxfp4GptOssExperts):
+    def __init__(self, config: GptOssPuzzleConfig):
+        """
+        Extends Mxfp4GptOssExperts to support per-layer configs.
+        Since this class is created without passing the layer index, we need to infer it from the call stack.
+        """
+        # module_name is of the form *.{layer_idx}.mlp.experts
+        current_key_name = _get_variable_from_stack(["current_key_name"])
+        if current_key_name is None:
+            module_name = _get_variable_from_stack(["module_name"])
+            if module_name is None:
+                raise RuntimeError("`current_key_name`/`module_name` variable not found in caller stack")
+            layer_idx = int(module_name.split(".")[-3])
+        else:
+            layer_idx = int(current_key_name[-3])
+        layer_config = config.get_gpt_oss_config_for_layer(layer_idx)
+        super().__init__(layer_config)
+def _get_variable_from_stack(names: list[str]) -> str | None:
+    f = inspect.currentframe().f_back
+    while f:
+        for name in names:
+            if name in f.f_locals:
+                return f.f_locals[name]
+        f = f.f_back
+    return None
+class PuzzleDynamicCache(DynamicCache):
+    """
+    A child class of DynamicCache that supports heterogeneous layer configurations.
+    __init__ is the same as in DynamicCache, except for the usage of sliding window which is obtained per layer from `block_configs`.
+    """
+    def __init__(
+        self,
+        ddp_cache_data: Optional[Iterable[tuple[torch.Tensor, torch.Tensor]]] = None,
+        config: Optional[GptOssPuzzleConfig] = None,
+        offloading: bool = False,
+        offload_only_non_sliding: bool = False,
+    ):
+        layers = []
+        # If a config is passed, use it to infer the layer types and initialize accordingly
+        if config is not None:
+            decoder_config = config.get_text_config(decoder=True)
+            layer_types = getattr(decoder_config, "layer_types", None)
+            if layer_types is None:
+                layer_types = []
+                for layer_idx in range(decoder_config.num_hidden_layers):
+                    sliding_window = None
+                    for attr_name in ("sliding_window", "attention_chunk_size"):
+                        sliding_window = getattr(
+                            config.block_configs[layer_idx],
+                            attr_name,
+                            getattr(decoder_config, attr_name, None),
+                        )
+                        if sliding_window is not None:
+                            break
+                    layer_types.append("sliding_attention" if sliding_window is not None else "full_attention")
+            # Some models have shared layers thus no cache is needed for them (e.g. Gemma3n)
+            if hasattr(decoder_config, "num_kv_shared_layers"):
+                layer_types = layer_types[: -decoder_config.num_kv_shared_layers]
+            for layer_idx, layer_type in enumerate(layer_types):
+                # From a cache point of view, both sliding and chunked are the same in how they should behave and how many
+                # states they should return - only the mask changes to make them different at the end!
+                if layer_type in ("sliding_attention", "chunked_attention"):
+                    sliding_window = None
+                    for attr_name in ("sliding_window", "attention_chunk_size"):
+                        sliding_window = getattr(
+                            decoder_config.block_configs[layer_idx],
+                            attr_name,
+                            getattr(decoder_config, attr_name, None),
+                        )
+                        if sliding_window is not None:
+                            break
+                    layers.append(DynamicSlidingWindowLayer(sliding_window=sliding_window))
+                else:
+                    layers.append(DynamicLayer())
+        # In this case, use the passed data to already fill in the Cache
+        if ddp_cache_data is not None:
+            # Init all the layers with the data
+            for layer_idx, (key_states, value_states) in enumerate(ddp_cache_data):
+                # If the config was not passed above, initialize a DynamicLayer for each entry of the ddp_data
+                if config is None:
+                    layers.append(DynamicLayer())
+                # Update the layer with the data
+                _, _ = layers[layer_idx].update(key_states, value_states)
+        # If neither of config nor ddp_data was passed, then simply lazy init a full cache of DynamicLayer
+        if len(layers) == 0:
+            super(DynamicCache, self).__init__(
+                layer_class_to_replicate=DynamicLayer,
+                offloading=offloading,
+                offload_only_non_sliding=offload_only_non_sliding,
+            )
+        else:
+            super(DynamicCache, self).__init__(
+                layers=layers, offloading=offloading, offload_only_non_sliding=offload_only_non_sliding
+            )
+original_load_balancing_loss_func = modeling_gpt_oss.load_balancing_loss_func
+def load_balancing_loss_func(
+    gate_logits: Union[torch.Tensor, tuple[torch.Tensor], None],
+    num_experts: Optional[int] = None,
+    top_k=2,
+    attention_mask: Optional[torch.Tensor] = None,
+    num_experts_per_layer: tuple[int, ...] = None,
+) -> Union[torch.Tensor, int]:
+    if gate_logits is None or not isinstance(gate_logits, tuple):
+        return 0
+    compute_device = gate_logits[0].device
+    overall_loss = 0
+    for layer_idx, layer_gate_logits in enumerate(gate_logits):
+        layer_loss = original_load_balancing_loss_func(
+            gate_logits=(layer_gate_logits,),
+            num_experts=num_experts_per_layer[layer_idx],
+            top_k=top_k,
+            attention_mask=attention_mask,
+        )
+        overall_loss += layer_loss.to(compute_device)
+    return overall_loss
+class GptOssPuzzleForCausalLM(GptOssForCausalLM):
+    """
+    A child class of GptOssForCausalLM to support heterogeneous layer configurations.
+    This class uses monkey-patching to inject custom behavior into the parent class while maximizing
+    code reuse and minimizing duplication. During `__init__`, it temporarily replaces the decoder layer
+    class to use `GptOssPuzzleDecoderLayer`. During `forward`, it patches mask creation, cache handling,
+    and load balancing loss computation to account for per-layer variations.
+    """
+    config_class = GptOssPuzzleConfig
+    _no_split_modules = ["GptOssPuzzleDecoderLayer"]
+    _keys_to_ignore_on_load_unexpected = [r"\.k_scale$", r"\.v_scale$"]
+    def __init__(self, config):
+        # PER_BLOCK_ATTRIBUTE values that are not supposed to be used. Required just because accessed in GptOssForCausalLM's __init__
+        config.num_local_experts = "PER_BLOCK_ATTRIBUTE"
+        original_decoder_layer_cls = modeling_gpt_oss.GptOssDecoderLayer
+        modeling_gpt_oss.GptOssDecoderLayer = GptOssPuzzleDecoderLayer
+        try:
+            super().__init__(config)
+            self.config = config  # Used for load_balancing_loss_func
+        finally:
+            modeling_gpt_oss.GptOssDecoderLayer = original_decoder_layer_cls
+        mxfp4.Mxfp4GptOssExperts = Mxfp4GptOssPuzzleExperts  # Used after the model is initialized
+    def forward(self, *args, **kwargs):
+        original_create_sliding_window_causal_mask = modeling_gpt_oss.create_sliding_window_causal_mask
+        original_dynamic_cache = modeling_gpt_oss.DynamicCache
+        modeling_gpt_oss.load_balancing_loss_func = functools.partial(
+            load_balancing_loss_func,
+            num_experts_per_layer=tuple(block_config.num_local_experts for block_config in self.config.block_configs),
+        )
+        modeling_gpt_oss.create_sliding_window_causal_mask = lambda **kwargs: SlidingWindowCausalMaskPlaceholder(
+            kwargs=kwargs
+        )
+        modeling_gpt_oss.DynamicCache = PuzzleDynamicCache
+        try:
+            return super().forward(*args, **kwargs)
+        finally:
+            modeling_gpt_oss.create_sliding_window_causal_mask = original_create_sliding_window_causal_mask
+            modeling_gpt_oss.load_balancing_loss_func = original_load_balancing_loss_func
+            modeling_gpt_oss.DynamicCache = original_dynamic_cache
+    def _prepare_cache_for_generation(self, *args, **kwargs):
+        from transformers.generation import utils as generation_utils
+        original_dynamic_cache = generation_utils.DynamicCache
+        generation_utils.DynamicCache = PuzzleDynamicCache
+        try:
+            return super()._prepare_cache_for_generation(*args, **kwargs)
+        finally:
+            generation_utils.DynamicCache = original_dynamic_cache

privacy.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# **Privacy**
+Field                                                                                                                              |  Response
+:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
+Generatable or reverse engineerable personal data?                                                     |  No
+Personal data used to create this model?                                                                                       |  No
+How often is dataset reviewed?                                                                                                     |  Before Release
+Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? |  No
+Is there provenance for all datasets used in training?                                                                                |  Yes
+Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Yes
+Is data compliant with data subject requests for data correction or removal, if such a request was made?                           |  No, not possible with externally-sourced data
+Applicable Privacy Policy        | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

safety.md ADDED Viewed

	@@ -0,0 +1,6 @@

+| Field | Response |
+| :---- | :---- |
+| Model Application Field(s): | Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning, Customer Service |
+| Describe the life critical impact (if present). | Not Applicable |
+| Use Case Restrictions: | Abide by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). |
+| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to. |

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|return|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0614fe83cadab421296e664e1f48f4261fa8fef6e03e63bb75c20f38e37d07d3
+size 27868174

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,183 @@

+{
+  "added_tokens_decoder": {
+    "199998": {
+      "content": "<|startoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "199999": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200000": {
+      "content": "<|reserved_200000|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200001": {
+      "content": "<|reserved_200001|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200002": {
+      "content": "<|return|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200003": {
+      "content": "<|constrain|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200004": {
+      "content": "<|reserved_200004|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200005": {
+      "content": "<|channel|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200006": {
+      "content": "<|start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200007": {
+      "content": "<|end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200008": {
+      "content": "<|message|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200009": {
+      "content": "<|reserved_200009|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200010": {
+      "content": "<|reserved_200010|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200011": {
+      "content": "<|reserved_200011|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200012": {
+      "content": "<|call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200013": {
+      "content": "<|reserved_200013|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200014": {
+      "content": "<|reserved_200014|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200015": {
+      "content": "<|reserved_200015|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200016": {
+      "content": "<|reserved_200016|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200017": {
+      "content": "<|reserved_200017|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200018": {
+      "content": "<|endofprompt|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|startoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|return|>",
+  "extra_special_tokens": {},
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "PreTrainedTokenizerFast"
+}