itlevy eladsegal esegal commited on
Commit
3f4b61d
·
0 Parent(s):

initial commit

Browse files

Co-authored-by: eladsegal <eladsegal@users.noreply.huggingface.co>
Co-authored-by: esegal <esegal@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ fig2.png filter=lfs diff=lfs merge=lfs -text
37
+ fig1.png filter=lfs diff=lfs merge=lfs -text
38
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: other
4
+ license_name: nvidia-open-model-license
5
+ license_link: >-
6
+ https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
7
+ pipeline_tag: text-generation
8
+ language:
9
+ - en
10
+ tags:
11
+ - nvidia
12
+ - gpt-oss
13
+ - puzzle
14
+ - mixture-of-experts
15
+ - reasoning
16
+ - pytorch
17
+ - transformers
18
+ - vllm
19
+ ---
20
+
21
+ # gpt-oss-puzzle-88B
22
+
23
+ # Model Overview
24
+
25
+ ### Description
26
+ gpt-oss-puzzle-88B is a deployment-optimized large language model developed by NVIDIA, derived from [OpenAI's gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b).
27
+ The model is produced using Puzzle, a post-training neural architecture search (NAS) framework, with the goal of significantly improving inference efficiency for reasoning-heavy workloads while maintaining or improving accuracy across reasoning budgets.
28
+
29
+ The model is specifically optimized for long-context and short-context serving on NVIDIA H100-class hardware, where reasoning models are often bottlenecked by KV-cache bandwidth and memory capacity rather than raw compute.
30
+
31
+ Compared to its parent, gpt-oss-puzzle-88B:
32
+ - Reduces total parameters to ~88B (≈73% of the parent),
33
+ - Achieves 1.63× throughput improvement in long-context (64K/64K) scenarios on an 8×H100 node,
34
+ - Achieves 1.22× throughput improvement in short-context (4K/4K) scenarios,
35
+ - Delivers up to 2.82× throughput improvement on a single H100 GPU,
36
+ - Matches or slightly exceeds parent accuracy across reasoning efforts.
37
+
38
+ **Parameter count note.** Hugging Face Hub may automatically show this model as ~91B parameters. We refer to it as 88B because the automatic count includes additional MXFP4 quantization scale tensors for the MoE experts, which are typically not counted as model parameters.
39
+
40
+
41
+ This model is ready for commercial use.
42
+
43
+ ![Accuracy vs Relative Request Rate](fig1.png)
44
+ ![Accuracy Retention and Throughput Speedup](fig2.png)
45
+
46
+ ### License/Terms of Use
47
+
48
+ Governing Terms: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)).
49
+
50
+ ### Deployment Geography
51
+ Global
52
+
53
+ ### Use Case
54
+ gpt-oss-puzzle-88B is a general purpose reasoning and chat model. This model is intended for production deployment, cost-efficient reasoning, and long-context inference workloads.
55
+
56
+ ### Release Date
57
+ March 26, 2026 via [Hugging Face](https://huggingface.co/nvidia/gpt-oss-puzzle-88B)
58
+
59
+ ## References(s)
60
+ * [\[2411.19146\] Puzzle: Distillation-Based NAS for Inference-Optimized LLMs](https://arxiv.org/abs/2411.19146)
61
+ * [\[2508.10925\] gpt-oss-120b & gpt-oss-20b Model Card](https://arxiv.org/abs/2508.10925)
62
+ * [\[2602.11937\] Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration](https://arxiv.org/abs/2602.11937)
63
+
64
+ ## Model Architecture
65
+ - **Architecture Type:** Mixture-of-Experts Decoder-only Transformer
66
+
67
+ - **Network Architecture:** Modified [gpt-oss](https://huggingface.co/openai/gpt-oss-120b) architecture with varying number of experts per layer, and a modified global/window attention pattern across layers.
68
+
69
+ - **Number of model parameters:** 88B
70
+
71
+ ### Key Architectural Optimizations
72
+ This model was created using Puzzle, a post-training NAS framework that constructs a heterogeneous architecture under explicit deployment constraints:
73
+
74
+ - Heterogeneous MoE Expert Pruning
75
+ Each MoE layer retains a different number of experts, determined via activation-based importance scoring. Early layers retain more experts; later layers are more aggressively pruned.
76
+
77
+ - Selective Window Attention
78
+ A subset of global attention layers is replaced with window attention (8K window), reducing KV-cache footprint by ~40% in long-context scenarios while preserving long-range reasoning.
79
+
80
+ - RoPE Scaling Adjustment
81
+ The YaRN RoPE scaling factor was increased to improve stability at 128K context length.
82
+
83
+ ## Training and Optimization Procedure
84
+
85
+ ### Knowledge Distillation
86
+
87
+ After Puzzle architecture selection, the model underwent knowledge distillation:
88
+
89
+ - Total Tokens: 84B
90
+ - Sequence Length: 128K
91
+ - MoE Experts & Router: Frozen
92
+ - Framework: Megatron-LM
93
+
94
+ This phase restores inter-block compatibility and recovers quality lost during blockwise substitution.
95
+
96
+ ### Reinforcement Learning:
97
+
98
+ A post-distillation reinforcement learning (RL) phase was applied to improve reasoning accuracy while controlling generation length:
99
+
100
+ - Multi-environment RL (math, coding, reasoning)
101
+ - MoE experts and router frozen
102
+ - Two complementary policies trained:
103
+ - High-effort-focused (max accuracy)
104
+ - Mixed-effort (length-regularized)
105
+ - Final model obtained via checkpoint weight averaging
106
+
107
+ This preserves high reasoning accuracy while maintaining a stable effort length ratio, ensuring predictable cost-quality trade-offs.
108
+
109
+ ### Quantization:
110
+
111
+ - MoE Weights: MXFP4 (inherited from gpt-oss-120B)
112
+ - KV Cache: FP8 with calibrated KV scales
113
+ - Effect:
114
+ - ~2× KV-cache token capacity
115
+ - Faster attention kernels
116
+ - Preserved accuracy vs unscaled FP8 KV-cache
117
+
118
+ ## Reasoning Effort Control:
119
+ The model supports three reasoning effort modes:
120
+
121
+ - Low: Fast, concise responses
122
+ - Medium: Balanced accuracy and verbosity
123
+ - High: Deep, multi-step reasoning
124
+
125
+ Effort reliably controls generation length and accuracy, enabling cost-aware deployment.
126
+
127
+ ## Input
128
+
129
+ - **Input Type(s):** Text
130
+
131
+ - **Input Format(s):** String
132
+
133
+ - **Input Parameters:** One-Dimensional (1D): Sequences
134
+
135
+ - **Other Properties Related to Input:** Context length is 128k tokens.
136
+
137
+ ## Output
138
+
139
+ - **Output Type(s):** Text
140
+
141
+ - **Output Format:** String
142
+
143
+ - **Output Parameters:** One-Dimensional (1D): Sequences
144
+
145
+ - **Other Properties Related to Output:** Context length is 128k tokens.
146
+
147
+ Our AI models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
148
+
149
+ ## Software Integration
150
+ **Runtime Engine(s):**
151
+ * vLLM (See instructions [below](#vllm))
152
+
153
+ **Supported Hardware Microarchitecture Compatibility:**
154
+ * NVIDIA B200
155
+ * NVIDIA H100-80GB
156
+
157
+ **Preferred/Supported Operating System(s):**
158
+ * Linux
159
+
160
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
161
+
162
+ ## Model Version
163
+ - v1.0
164
+
165
+ ## Training and Evaluation Datasets
166
+
167
+ ### Dataset Overview
168
+ **Total Number of Datasets:** 7
169
+ **Time period for data collection:** 2013 to May 1, 2025
170
+
171
+ For the KD stage data, the prompts from [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) were used to generate responses from the parent model (gpt-oss-120b) to create full KD training examples. For each prompt, we generated responses under high and medium reasoning-effort settings.
172
+
173
+ For the RL stage data, we used a subset of the [NeMo Gym collection](https://huggingface.co/collections/nvidia/nemo-gym) which includes RL verifiable data.
174
+
175
+ # Public Datasets
176
+ - [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)
177
+ - [nvidia/Nemotron-RL-coding-competitive_coding](https://huggingface.co/datasets/nvidia/Nemotron-RL-coding-competitive_coding)
178
+ - [nvidia/Nemotron-RL-instruction_following](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following)
179
+ - [BytedTsinghua-SIA/DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k)
180
+ - [Skywork/Skywork-OR1-RL-Data](https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data)
181
+ - [nvidia/Nemotron-RL-knowledge-mcqa](https://huggingface.co/datasets/nvidia/Nemotron-RL-knowledge-mcqa)
182
+ - [nvidia/Nemotron-RL-instruction_following-structured_outputs](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following-structured_outputs)
183
+
184
+ ## Training Dataset
185
+
186
+ **Data Modality**: Text
187
+
188
+ **Text Training Data Size**: 1 Billion to 10 Trillion Tokens
189
+
190
+ **Data Collection Method by dataset**: Automated/Synthetic/Human
191
+
192
+ **Labeling Method by dataset**: Not Applicable
193
+
194
+ **Properties**:
195
+ The training data is text-only and spans a broad range of task categories. The knowledge distillation stage used the Llama-Nemotron-Post-Training-Dataset, a large-scale collection covering mathematics, code, science, instruction following, general chat, and safety. The reinforcement learning stage used datasets spanning several domains: competitive programming problems with unit tests (Nemotron-RL-coding-competitive_coding, Skywork-OR1-RL-Data), diverse verifiable mathematical reasoning problems (DAPO-Math-17k, Skywork-OR1-RL-Data), multi-domain multiple-choice question answering across fields such as physics, biology, chemistry, mathematics, computer science, engineering, humanities, law, and others (Nemotron-RL-knowledge-mcqa), easily verifiable instruction-following tasks with diverse format and linguistic constraints (Nemotron-RL-instruction_following), and structured output generation requiring adherence to JSON schemas (Nemotron-RL-instruction_following-structured_outputs). No personal data was used for training.
196
+
197
+
198
+ ## Evaluation Dataset
199
+
200
+ **Data Collection Method by dataset:** Hybrid: Human, Synthetic
201
+
202
+ **Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic
203
+
204
+ | Benchmark | Description |
205
+ |-----------|-------------|
206
+ | [**MMLU-Pro**](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) | MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. |
207
+ | [**GPQA-Diamond**](https://huggingface.co/datasets/Idavidrein/gpqa) | The GPQA (Graduate-Level Google-Proof Q&A) benchmark is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry. |
208
+ | [**HLE**](https://huggingface.co/datasets/cais/hle) | Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. |
209
+ | [**AA-LCR**](https://huggingface.co/datasets/ArtificialAnalysis/AA-LCR) | A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer). |
210
+ | [**AIME25**](https://huggingface.co/datasets/math-ai/aime25) | American Invitational Mathematics Examination (AIME) 2025 questions |
211
+ | [**IFBench**](https://huggingface.co/datasets/allenai/IFBench_test) | IFBench is a new, challenging benchmark for precise instruction following. |
212
+ | [**SciCode**](https://huggingface.co/datasets/SciCode1/SciCode) | SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems. |
213
+ | [**RULER 128K**](https://huggingface.co/datasets/GAIR/ruler-128k) | RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. Used with context length of 128K tokens. |
214
+
215
+
216
+ # Inference
217
+ **Acceleration Engine**: vLLM
218
+ **Test Hardware:**
219
+ - 1× NVIDIA H100-80GB
220
+ - 8× NVIDIA H100-80GB
221
+ - 8× NVIDIA B200
222
+
223
+ ## Quick Start
224
+
225
+ The gpt-oss-puzzle-88B model can be used with standard inference stacks such as Hugging Face Transformers and vLLM.
226
+ It is especially optimized for NVIDIA H100 GPUs and supports long-context inference up to 128K tokens.
227
+
228
+ ### Transformers
229
+ We recommend using Transformers ≥ 4.57.3.
230
+
231
+ ```python
232
+ from transformers import pipeline
233
+
234
+ model_id = "nvidia/gpt-oss-puzzle-88B"
235
+
236
+ pipe = pipeline(
237
+ "text-generation",
238
+ model=model_id,
239
+ trust_remote_code=True,
240
+ dtype="auto",
241
+ device_map="auto",
242
+ )
243
+
244
+ messages = [
245
+ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
246
+ ]
247
+
248
+ generation_config = GenerationConfig.from_pretrained(model_id)
249
+ generation_config.max_new_tokens = 256
250
+
251
+ outputs = pipe(
252
+ messages,
253
+ generation_config=generation_config,
254
+ )
255
+ print(outputs[0]["generated_text"][-1])
256
+ ```
257
+
258
+ ### vLLM
259
+
260
+ #### Serving
261
+
262
+ Start the server with a single command:
263
+
264
+ ```bash
265
+ docker run --gpus all -p 8000:8000 \
266
+ --entrypoint bash \
267
+ vllm/vllm-openai:v0.17.1 \
268
+ -c "
269
+ apt-get update && apt-get install -y git &&
270
+ VLLM_USE_PRECOMPILED=1 pip install --no-build-isolation 'git+https://github.com/vllm-project/vllm.git@refs/pull/38135/head' &&
271
+ pip install flashinfer-cubin==0.6.6 flashinfer-jit-cache==0.6.6 --extra-index-url https://flashinfer.ai/whl/cu\$(echo \$CUDA_VERSION | cut -d. -f1,2 | tr -d '.') &&
272
+ export PYTORCH_ALLOC_CONF=expandable_segments:True &&
273
+ vllm serve nvidia/gpt-oss-puzzle-88B \
274
+ -tp 1 \
275
+ --trust-remote-code \
276
+ --kv-cache-dtype fp8 \
277
+ --max-num-batched-tokens 8192 \
278
+ --stream-interval 20 \
279
+ --gpu-memory-utilization 0.95 \
280
+ --max-num-seqs 8 \
281
+ --max-cudagraph-capture-size 8 \
282
+ --max-model-len 131072
283
+ "
284
+ ```
285
+
286
+ > **Notes:**
287
+ > - On Blackwell (B200), add `-e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1` to the `docker run` command.
288
+ > - Remove `--kv-cache-dtype fp8` for BF16 KV-cache instead of FP8.
289
+ > - Increase `-tp` if you need larger batch sizes or longer sequences.
290
+ > - Expert parallelism is supported via `--enable-expert-parallel`, but we recommend TP.
291
+
292
+ #### Inference with Reasoning Effort Control
293
+
294
+ The model supports three reasoning effort levels (`low`, `medium`, `high`). For example:
295
+
296
+ ```python
297
+ from openai import OpenAI
298
+
299
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
300
+
301
+ # High effort — deep, multi-step reasoning
302
+ response = client.chat.completions.create(
303
+ model="nvidia/gpt-oss-puzzle-88B",
304
+ messages=[{"role": "user", "content": "Write a haiku about neural network pruning"}],
305
+ reasoning_effort="high",
306
+ )
307
+ print(response.choices[0].message.content)
308
+
309
+ # Low effort — fast, concise responses
310
+ response = client.chat.completions.create(
311
+ model="nvidia/gpt-oss-puzzle-88B",
312
+ messages=[{"role": "user", "content": "What is the capital of France?"}],
313
+ reasoning_effort="low",
314
+ )
315
+ print(response.choices[0].message.content)
316
+ ```
317
+
318
+ ## Ethical Considerations
319
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
320
+
321
+ For more detailed information on ethical considerations for this model, please see the [Bias, Explainability, Safety & Security, and Privacy Subcards](https://huggingface.co/nvidia/gpt-oss-puzzle-88B).
322
+
323
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
bias.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ | Field | Response |
2
+ | :---- | :---- |
3
+ | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
4
+ | Measures taken to mitigate against unwanted bias: | None |
chat_template.jinja ADDED
@@ -0,0 +1,331 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {#-
2
+ In addition to the normal inputs of `messages` and `tools`, this template also accepts the
3
+ following kwargs:
4
+ - "builtin_tools": A list, can contain "browser" and/or "python".
5
+ - "model_identity": A string that optionally describes the model identity.
6
+ - "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
7
+ #}
8
+
9
+ {#- Tool Definition Rendering ============================================== #}
10
+ {%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
11
+ {%- if param_spec.type == "array" -%}
12
+ {%- if param_spec['items'] -%}
13
+ {%- if param_spec['items']['type'] == "string" -%}
14
+ {{- "string[]" }}
15
+ {%- elif param_spec['items']['type'] == "number" -%}
16
+ {{- "number[]" }}
17
+ {%- elif param_spec['items']['type'] == "integer" -%}
18
+ {{- "number[]" }}
19
+ {%- elif param_spec['items']['type'] == "boolean" -%}
20
+ {{- "boolean[]" }}
21
+ {%- else -%}
22
+ {%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
23
+ {%- if inner_type == "object | object" or inner_type|length > 50 -%}
24
+ {{- "any[]" }}
25
+ {%- else -%}
26
+ {{- inner_type + "[]" }}
27
+ {%- endif -%}
28
+ {%- endif -%}
29
+ {%- if param_spec.nullable -%}
30
+ {{- " | null" }}
31
+ {%- endif -%}
32
+ {%- else -%}
33
+ {{- "any[]" }}
34
+ {%- if param_spec.nullable -%}
35
+ {{- " | null" }}
36
+ {%- endif -%}
37
+ {%- endif -%}
38
+ {%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
39
+ {#- Handle array of types like ["object", "object"] from Union[dict, list] #}
40
+ {%- if param_spec.type | length > 1 -%}
41
+ {{- param_spec.type | join(" | ") }}
42
+ {%- else -%}
43
+ {{- param_spec.type[0] }}
44
+ {%- endif -%}
45
+ {%- elif param_spec.oneOf -%}
46
+ {#- Handle oneOf schemas - check for complex unions and fallback to any #}
47
+ {%- set has_object_variants = false -%}
48
+ {%- for variant in param_spec.oneOf -%}
49
+ {%- if variant.type == "object" -%}
50
+ {%- set has_object_variants = true -%}
51
+ {%- endif -%}
52
+ {%- endfor -%}
53
+ {%- if has_object_variants and param_spec.oneOf|length > 1 -%}
54
+ {{- "any" }}
55
+ {%- else -%}
56
+ {%- for variant in param_spec.oneOf -%}
57
+ {{- render_typescript_type(variant, required_params) -}}
58
+ {%- if variant.description %}
59
+ {{- "// " + variant.description }}
60
+ {%- endif -%}
61
+ {%- if variant.default is defined %}
62
+ {{ "// default: " + variant.default|tojson }}
63
+ {%- endif -%}
64
+ {%- if not loop.last %}
65
+ {{- " | " }}
66
+ {% endif -%}
67
+ {%- endfor -%}
68
+ {%- endif -%}
69
+ {%- elif param_spec.type == "string" -%}
70
+ {%- if param_spec.enum -%}
71
+ {{- '"' + param_spec.enum|join('" | "') + '"' -}}
72
+ {%- else -%}
73
+ {{- "string" }}
74
+ {%- if param_spec.nullable %}
75
+ {{- " | null" }}
76
+ {%- endif -%}
77
+ {%- endif -%}
78
+ {%- elif param_spec.type == "number" -%}
79
+ {{- "number" }}
80
+ {%- elif param_spec.type == "integer" -%}
81
+ {{- "number" }}
82
+ {%- elif param_spec.type == "boolean" -%}
83
+ {{- "boolean" }}
84
+
85
+ {%- elif param_spec.type == "object" -%}
86
+ {%- if param_spec.properties -%}
87
+ {{- "{\n" }}
88
+ {%- for prop_name, prop_spec in param_spec.properties.items() -%}
89
+ {{- prop_name -}}
90
+ {%- if prop_name not in (param_spec.required or []) -%}
91
+ {{- "?" }}
92
+ {%- endif -%}
93
+ {{- ": " }}
94
+ {{ render_typescript_type(prop_spec, param_spec.required or []) }}
95
+ {%- if not loop.last -%}
96
+ {{-", " }}
97
+ {%- endif -%}
98
+ {%- endfor -%}
99
+ {{- "}" }}
100
+ {%- else -%}
101
+ {{- "object" }}
102
+ {%- endif -%}
103
+ {%- else -%}
104
+ {{- "any" }}
105
+ {%- endif -%}
106
+ {%- endmacro -%}
107
+
108
+ {%- macro render_tool_namespace(namespace_name, tools) -%}
109
+ {{- "## " + namespace_name + "\n\n" }}
110
+ {{- "namespace " + namespace_name + " {\n\n" }}
111
+ {%- for tool in tools %}
112
+ {%- set tool = tool.function %}
113
+ {{- "// " + tool.description + "\n" }}
114
+ {{- "type "+ tool.name + " = " }}
115
+ {%- if tool.parameters and tool.parameters.properties %}
116
+ {{- "(_: {\n" }}
117
+ {%- for param_name, param_spec in tool.parameters.properties.items() %}
118
+ {%- if param_spec.description %}
119
+ {{- "// " + param_spec.description + "\n" }}
120
+ {%- endif %}
121
+ {{- param_name }}
122
+ {%- if param_name not in (tool.parameters.required or []) -%}
123
+ {{- "?" }}
124
+ {%- endif -%}
125
+ {{- ": " }}
126
+ {{- render_typescript_type(param_spec, tool.parameters.required or []) }}
127
+ {%- if param_spec.default is defined -%}
128
+ {%- if param_spec.enum %}
129
+ {{- ", // default: " + param_spec.default }}
130
+ {%- elif param_spec.oneOf %}
131
+ {{- "// default: " + param_spec.default }}
132
+ {%- else %}
133
+ {{- ", // default: " + param_spec.default|tojson }}
134
+ {%- endif -%}
135
+ {%- endif -%}
136
+ {%- if not loop.last %}
137
+ {{- ",\n" }}
138
+ {%- else %}
139
+ {{- ",\n" }}
140
+ {%- endif -%}
141
+ {%- endfor %}
142
+ {{- "}) => any;\n\n" }}
143
+ {%- else -%}
144
+ {{- "() => any;\n\n" }}
145
+ {%- endif -%}
146
+ {%- endfor %}
147
+ {{- "} // namespace " + namespace_name }}
148
+ {%- endmacro -%}
149
+
150
+ {%- macro render_builtin_tools(browser_tool, python_tool) -%}
151
+ {%- if browser_tool %}
152
+ {{- "## browser\n\n" }}
153
+ {{- "// Tool for browsing.\n" }}
154
+ {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
155
+ {{- "// Cite information from the tool using the following format:\n" }}
156
+ {{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
157
+ {{- "// Do not quote more than 10 words directly from the tool output.\n" }}
158
+ {{- "// sources=web (default: web)\n" }}
159
+ {{- "namespace browser {\n\n" }}
160
+ {{- "// Searches for information related to `query` and displays `topn` results.\n" }}
161
+ {{- "type search = (_: {\n" }}
162
+ {{- "query: string,\n" }}
163
+ {{- "topn?: number, // default: 10\n" }}
164
+ {{- "source?: string,\n" }}
165
+ {{- "}) => any;\n\n" }}
166
+ {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
167
+ {{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
168
+ {{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
169
+ {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
170
+ {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
171
+ {{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
172
+ {{- "type open = (_: {\n" }}
173
+ {{- "id?: number | string, // default: -1\n" }}
174
+ {{- "cursor?: number, // default: -1\n" }}
175
+ {{- "loc?: number, // default: -1\n" }}
176
+ {{- "num_lines?: number, // default: -1\n" }}
177
+ {{- "view_source?: boolean, // default: false\n" }}
178
+ {{- "source?: string,\n" }}
179
+ {{- "}) => any;\n\n" }}
180
+ {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
181
+ {{- "type find = (_: {\n" }}
182
+ {{- "pattern: string,\n" }}
183
+ {{- "cursor?: number, // default: -1\n" }}
184
+ {{- "}) => any;\n\n" }}
185
+ {{- "} // namespace browser\n\n" }}
186
+ {%- endif -%}
187
+
188
+ {%- if python_tool %}
189
+ {{- "## python\n\n" }}
190
+ {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
191
+ {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
192
+ {%- endif -%}
193
+ {%- endmacro -%}
194
+
195
+ {#- System Message Construction ============================================ #}
196
+ {%- macro build_system_message() -%}
197
+ {%- if model_identity is not defined %}
198
+ {%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
199
+ {%- endif %}
200
+ {{- model_identity + "\n" }}
201
+ {{- "Knowledge cutoff: 2024-06\n" }}
202
+ {{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
203
+ {%- if reasoning_effort is not defined %}
204
+ {%- set reasoning_effort = "medium" %}
205
+ {%- endif %}
206
+ {{- "Reasoning: " + reasoning_effort + "\n\n" }}
207
+ {%- if builtin_tools %}
208
+ {{- "# Tools\n\n" }}
209
+ {%- set available_builtin_tools = namespace(browser=false, python=false) %}
210
+ {%- for tool in builtin_tools %}
211
+ {%- if tool == "browser" %}
212
+ {%- set available_builtin_tools.browser = true %}
213
+ {%- elif tool == "python" %}
214
+ {%- set available_builtin_tools.python = true %}
215
+ {%- endif %}
216
+ {%- endfor %}
217
+ {{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
218
+ {%- endif -%}
219
+ {{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
220
+ {%- if tools -%}
221
+ {{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
222
+ {%- endif -%}
223
+ {%- endmacro -%}
224
+
225
+ {#- Main Template Logic ================================================= #}
226
+ {#- Set defaults #}
227
+
228
+ {#- Render system message #}
229
+ {{- "<|start|>system<|message|>" }}
230
+ {{- build_system_message() }}
231
+ {{- "<|end|>" }}
232
+
233
+ {#- Extract developer message #}
234
+ {%- if messages[0].role == "developer" or messages[0].role == "system" %}
235
+ {%- set developer_message = messages[0].content %}
236
+ {%- set loop_messages = messages[1:] %}
237
+ {%- else %}
238
+ {%- set developer_message = "" %}
239
+ {%- set loop_messages = messages %}
240
+ {%- endif %}
241
+
242
+ {#- Render developer message #}
243
+ {%- if developer_message or tools %}
244
+ {{- "<|start|>developer<|message|>" }}
245
+ {%- if developer_message %}
246
+ {{- "# Instructions\n\n" }}
247
+ {{- developer_message }}
248
+ {{- "\n\n" }}
249
+ {%- endif %}
250
+ {%- if tools -%}
251
+ {{- "# Tools\n\n" }}
252
+ {{- render_tool_namespace("functions", tools) }}
253
+ {%- endif -%}
254
+ {{- "<|end|>" }}
255
+ {%- endif %}
256
+
257
+ {#- Render messages #}
258
+ {%- set last_tool_call = namespace(name=none) %}
259
+ {%- for message in loop_messages -%}
260
+ {#- At this point only assistant/user/tool messages should remain #}
261
+ {%- if message.role == 'assistant' -%}
262
+ {#- Checks to ensure the messages are being passed in the format we expect #}
263
+ {%- if "content" in message %}
264
+ {%- if "<|channel|>analysis<|message|>" in message.content or "<|channel|>final<|message|>" in message.content %}
265
+ {{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
266
+ {%- endif %}
267
+ {%- endif %}
268
+ {%- if "thinking" in message %}
269
+ {%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
270
+ {{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
271
+ {%- endif %}
272
+ {%- endif %}
273
+ {%- if "tool_calls" in message %}
274
+ {#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
275
+ {#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
276
+ {#- when we render CoT/analysis messages in inference. #}
277
+ {%- set future_final_message = namespace(found=false) %}
278
+ {%- for future_message in loop_messages[loop.index:] %}
279
+ {%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
280
+ {%- set future_final_message.found = true %}
281
+ {%- endif %}
282
+ {%- endfor %}
283
+ {#- We assume max 1 tool call per message, and so we infer the tool call name #}
284
+ {#- in "tool" messages from the most recent assistant tool call name #}
285
+ {%- set tool_call = message.tool_calls[0] %}
286
+ {%- if tool_call.function %}
287
+ {%- set tool_call = tool_call.function %}
288
+ {%- endif %}
289
+ {%- if message.content and message.thinking %}
290
+ {{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
291
+ {%- elif message.content and not future_final_message.found %}
292
+ {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
293
+ {%- elif message.thinking and not future_final_message.found %}
294
+ {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
295
+ {%- endif %}
296
+ {{- "<|start|>assistant to=" }}
297
+ {{- "functions." + tool_call.name + "<|channel|>commentary " }}
298
+ {{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
299
+ {{- tool_call.arguments|tojson }}
300
+ {{- "<|call|>" }}
301
+ {%- set last_tool_call.name = tool_call.name %}
302
+ {%- elif loop.last and not add_generation_prompt %}
303
+ {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
304
+ {#- This is a situation that should only occur in training, never in inference. #}
305
+ {%- if "thinking" in message %}
306
+ {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
307
+ {%- endif %}
308
+ {#- <|return|> indicates the end of generation, but <|end|> does not #}
309
+ {#- <|return|> should never be an input to the model, but we include it as the final token #}
310
+ {#- when training, so the model learns to emit it. #}
311
+ {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
312
+ {%- else %}
313
+ {#- CoT is dropped during all previous turns, so we never render it for inference #}
314
+ {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
315
+ {%- set last_tool_call.name = none %}
316
+ {%- endif %}
317
+ {%- elif message.role == 'tool' -%}
318
+ {%- if last_tool_call.name is none %}
319
+ {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
320
+ {%- endif %}
321
+ {{- "<|start|>functions." + last_tool_call.name }}
322
+ {{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
323
+ {%- elif message.role == 'user' -%}
324
+ {{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
325
+ {%- endif -%}
326
+ {%- endfor -%}
327
+
328
+ {#- Generation prompt #}
329
+ {%- if add_generation_prompt -%}
330
+ <|start|>assistant
331
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GptOssPuzzleForCausalLM"
4
+ ],
5
+ "attention_bias": true,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_gpt_oss_puzzle.GptOssPuzzleConfig",
9
+ "AutoModelForCausalLM": "modeling_gpt_oss_puzzle.GptOssPuzzleForCausalLM"
10
+ },
11
+ "block_configs": [
12
+ {
13
+ "num_local_experts": 128,
14
+ "sliding_window": 128
15
+ },
16
+ {
17
+ "num_local_experts": 128,
18
+ "sliding_window": null
19
+ },
20
+ {
21
+ "num_local_experts": 128,
22
+ "sliding_window": 128
23
+ },
24
+ {
25
+ "num_local_experts": 128,
26
+ "sliding_window": 8192
27
+ },
28
+ {
29
+ "num_local_experts": 128,
30
+ "sliding_window": 128
31
+ },
32
+ {
33
+ "num_local_experts": 128,
34
+ "sliding_window": 8192
35
+ },
36
+ {
37
+ "num_local_experts": 128,
38
+ "sliding_window": 128
39
+ },
40
+ {
41
+ "num_local_experts": 128,
42
+ "sliding_window": null
43
+ },
44
+ {
45
+ "num_local_experts": 128,
46
+ "sliding_window": 128
47
+ },
48
+ {
49
+ "num_local_experts": 128,
50
+ "sliding_window": null
51
+ },
52
+ {
53
+ "num_local_experts": 128,
54
+ "sliding_window": 128
55
+ },
56
+ {
57
+ "num_local_experts": 128,
58
+ "sliding_window": null
59
+ },
60
+ {
61
+ "num_local_experts": 128,
62
+ "sliding_window": 128
63
+ },
64
+ {
65
+ "num_local_experts": 128,
66
+ "sliding_window": null
67
+ },
68
+ {
69
+ "num_local_experts": 128,
70
+ "sliding_window": 128
71
+ },
72
+ {
73
+ "num_local_experts": 64,
74
+ "sliding_window": null
75
+ },
76
+ {
77
+ "num_local_experts": 128,
78
+ "sliding_window": 128
79
+ },
80
+ {
81
+ "num_local_experts": 64,
82
+ "sliding_window": null
83
+ },
84
+ {
85
+ "num_local_experts": 128,
86
+ "sliding_window": 128
87
+ },
88
+ {
89
+ "num_local_experts": 64,
90
+ "sliding_window": null
91
+ },
92
+ {
93
+ "num_local_experts": 128,
94
+ "sliding_window": 128
95
+ },
96
+ {
97
+ "num_local_experts": 64,
98
+ "sliding_window": 8192
99
+ },
100
+ {
101
+ "num_local_experts": 64,
102
+ "sliding_window": 128
103
+ },
104
+ {
105
+ "num_local_experts": 64,
106
+ "sliding_window": 8192
107
+ },
108
+ {
109
+ "num_local_experts": 64,
110
+ "sliding_window": 128
111
+ },
112
+ {
113
+ "num_local_experts": 64,
114
+ "sliding_window": null
115
+ },
116
+ {
117
+ "num_local_experts": 64,
118
+ "sliding_window": 128
119
+ },
120
+ {
121
+ "num_local_experts": 64,
122
+ "sliding_window": 8192
123
+ },
124
+ {
125
+ "num_local_experts": 64,
126
+ "sliding_window": 128
127
+ },
128
+ {
129
+ "num_local_experts": 64,
130
+ "sliding_window": null
131
+ },
132
+ {
133
+ "num_local_experts": 64,
134
+ "sliding_window": 128
135
+ },
136
+ {
137
+ "num_local_experts": 64,
138
+ "sliding_window": 8192
139
+ },
140
+ {
141
+ "num_local_experts": 64,
142
+ "sliding_window": 128
143
+ },
144
+ {
145
+ "num_local_experts": 64,
146
+ "sliding_window": 8192
147
+ },
148
+ {
149
+ "num_local_experts": 64,
150
+ "sliding_window": 128
151
+ },
152
+ {
153
+ "num_local_experts": 64,
154
+ "sliding_window": 8192
155
+ }
156
+ ],
157
+ "dtype": "bfloat16",
158
+ "eos_token_id": 200002,
159
+ "head_dim": 64,
160
+ "hidden_act": "silu",
161
+ "hidden_size": 2880,
162
+ "initializer_range": 0.02,
163
+ "intermediate_size": 2880,
164
+ "layer_types": [
165
+ "sliding_attention",
166
+ "full_attention",
167
+ "sliding_attention",
168
+ "sliding_attention",
169
+ "sliding_attention",
170
+ "sliding_attention",
171
+ "sliding_attention",
172
+ "full_attention",
173
+ "sliding_attention",
174
+ "full_attention",
175
+ "sliding_attention",
176
+ "full_attention",
177
+ "sliding_attention",
178
+ "full_attention",
179
+ "sliding_attention",
180
+ "full_attention",
181
+ "sliding_attention",
182
+ "full_attention",
183
+ "sliding_attention",
184
+ "full_attention",
185
+ "sliding_attention",
186
+ "sliding_attention",
187
+ "sliding_attention",
188
+ "sliding_attention",
189
+ "sliding_attention",
190
+ "full_attention",
191
+ "sliding_attention",
192
+ "sliding_attention",
193
+ "sliding_attention",
194
+ "full_attention",
195
+ "sliding_attention",
196
+ "sliding_attention",
197
+ "sliding_attention",
198
+ "sliding_attention",
199
+ "sliding_attention",
200
+ "sliding_attention"
201
+ ],
202
+ "max_position_embeddings": 229376,
203
+ "model_type": "gpt_oss_puzzle",
204
+ "num_attention_heads": 64,
205
+ "num_experts_per_tok": 4,
206
+ "num_hidden_layers": 36,
207
+ "num_key_value_heads": 8,
208
+ "output_router_logits": false,
209
+ "pad_token_id": 199999,
210
+ "quantization_config": {
211
+ "modules_to_not_convert": [
212
+ "model.layers.*.self_attn",
213
+ "model.layers.*.mlp.router",
214
+ "model.embed_tokens",
215
+ "lm_head"
216
+ ],
217
+ "quant_method": "mxfp4"
218
+ },
219
+ "rms_norm_eps": 1e-05,
220
+ "rope_parameters": {
221
+ "beta_fast": 32.0,
222
+ "beta_slow": 1.0,
223
+ "factor": 56.0,
224
+ "original_max_position_embeddings": 4096,
225
+ "rope_type": "yarn",
226
+ "truncate": false
227
+ },
228
+ "rope_scaling": {
229
+ "beta_fast": 32.0,
230
+ "beta_slow": 1.0,
231
+ "factor": 56.0,
232
+ "original_max_position_embeddings": 4096,
233
+ "rope_type": "yarn",
234
+ "truncate": false
235
+ },
236
+ "rope_theta": 150000,
237
+ "router_aux_loss_coef": 0.9,
238
+ "tie_word_embeddings": false,
239
+ "transformers_version": "4.57.6",
240
+ "use_cache": true,
241
+ "vocab_size": 201088
242
+ }
configuration_gpt_oss_puzzle.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any
2
+
3
+ from dataclasses import asdict, dataclass, fields
4
+
5
+ from transformers.models.gpt_oss.configuration_gpt_oss import GptOssConfig
6
+
7
+
8
+ @dataclass
9
+ class BlockConfig:
10
+ sliding_window: int
11
+ num_local_experts: int
12
+
13
+
14
+ LAYER_SPECIFIC_MEMBERS = [field.name for field in fields(BlockConfig)]
15
+
16
+
17
+ class GptOssPuzzleConfig(GptOssConfig):
18
+ model_type = "gpt_oss_puzzle"
19
+
20
+ def __init__(self, *, block_configs: list[dict[str, dict[str, Any]]] | None = None, **kwargs):
21
+ self.block_configs = block_configs
22
+ super().__init__(**kwargs)
23
+
24
+ if self.block_configs is not None:
25
+ self.block_configs = [BlockConfig(**block_config) for block_config in self.block_configs]
26
+ self.layer_types = [
27
+ ("full_attention" if block_config.sliding_window is None else "sliding_attention")
28
+ for block_config in self.block_configs
29
+ ]
30
+
31
+ for member in LAYER_SPECIFIC_MEMBERS:
32
+ if hasattr(self, member):
33
+ delattr(self, member)
34
+ else:
35
+ self.block_configs = [
36
+ BlockConfig(
37
+ sliding_window=self.sliding_window,
38
+ num_local_experts=self.num_local_experts,
39
+ )
40
+ for _ in range(self.num_hidden_layers)
41
+ ]
42
+
43
+ def __getattr__(self, name: str) -> Any:
44
+ if name in LAYER_SPECIFIC_MEMBERS:
45
+ raise AttributeError(
46
+ f"'{name}' is a per-block attribute and varies across blocks. "
47
+ f"Access it via the individual block configs instead (e.g. config.block_configs[i].{name})."
48
+ )
49
+ non_heterogeneous_error_message = f"'{type(self).__name__}' object has no attribute '{name}'"
50
+ raise AttributeError(non_heterogeneous_error_message)
51
+
52
+ def to_dict(self) -> dict[str, Any]:
53
+ output = super().to_dict()
54
+ output["block_configs"] = [asdict(block_config) for block_config in self.block_configs]
55
+ return output
56
+
57
+ def get_gpt_oss_config_for_layer(self, layer_idx: int) -> GptOssConfig:
58
+ config_dict = self.to_dict()
59
+ del config_dict["block_configs"]
60
+ block_config = self.block_configs[layer_idx]
61
+
62
+ config_dict["sliding_window"] = block_config.sliding_window
63
+ config_dict["num_local_experts"] = block_config.num_local_experts
64
+
65
+ return GptOssConfig.from_dict(config_dict, attn_implementation=self._attn_implementation)
explainability.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ | Field | Response |
2
+ | :---- | :---- |
3
+ | Intended Task/Domain: | Text generation, reasoning, and chat |
4
+ | Model Type: | Text-to-text Mixture-of-Experts Transformer |
5
+ | Intended Users: | Generative AI creators working with conversational AI models. |
6
+ | Output: | Text |
7
+ | Describe how the model works: | Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers. |
8
+ | Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
9
+ | Technical Limitations & Mitigation: | This model performs particularly well in instruction following regimes, as such may be strongly influenced by untrusted inputs and should be paired with appropriate guardrails and data filtering to better align use-case behaviors when exposed to such data. |
10
+ | Verified to have met prescribed NVIDIA quality standards: | Yes |
11
+ | Performance Metrics: | Accuracy, Throughput, and User-side throughput |
12
+ | Potential Known Risks: | The model was optimized explicitly for instruction following and as such may be influenced by untrusted inputs (prompt injection, indirect prompt injection, jailbreaking, web search, etc.) as a result of its instruction tuning that may degrade safety alignment and other training efforts. This model should be paired with additional guardrails and data filtering to limit exposure to instructions from malicious sources. Bypassing of safety alignment, system guardrails, and filters may allow harmful outcomes up to and including remote code execution in some agentic systems when effective security controls are not in place. The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may generate and amplify harmful, biased, or otherwise unsafe content reinforcing these biases and return toxic responses especially when prompted with toxic prompts. The model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. The model may exhibit self-anthropomorphism (e.g., displaying human-like characteristics in dialogue, such as expressing preferences and emotions). In integrated system contexts, the model could potentially be exploited to access or disclose information beyond the model’s intended permissions or scope of operation. |
13
+ | Licensing: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) |
fig1.png ADDED

Git LFS Details

  • SHA256: b1e1bb2a9650e1541ad6ab62705de1c1b1b35d535e2097ab7c21ffef15315bf6
  • Pointer size: 131 Bytes
  • Size of remote file: 301 kB
fig2.png ADDED

Git LFS Details

  • SHA256: 348bd2b0c8db89e647e82e679419c606b1065a6a0d30a1959f110f033bed407b
  • Pointer size: 131 Bytes
  • Size of remote file: 206 kB
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 199998,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 200002,
6
+ 199999
7
+ ],
8
+ "pad_token_id": 199999,
9
+ "transformers_version": "4.55.0.dev0"
10
+ }
model-00001-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:450b3564a3cc1ff4fe2ca900c440aa619b017d7555658f7979f43116e591f7ad
3
+ size 4115581080
model-00002-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e08e900384dff118d28397aa96caf7e4d9960a10f90217446841f2f716ce5ae9
3
+ size 4678869240
model-00003-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:050248abe00e9216471283d8a0aae075db722282bbbbf61439cec2cd3ea130f8
3
+ size 4679238480
model-00004-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64e240da7cf5ac9531b3d0b9032c3660f261d0cb9787bd13878f5d7a5d499944
3
+ size 4987817832
model-00005-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ca2a05ac2f4e0c7b5b522471a9a0ac3fd6ce72e2db0089c7e544129ca7d077f
3
+ size 4759633712
model-00006-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:221819279728057cfc388cf354417ce0641b0df7b7c602c1eba57dd667be1c97
3
+ size 4503088168
model-00007-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:834f2121924b54f9960c299e06931ab8fc1f707245c1a06ea8d324532701ea43
3
+ size 4980815736
model-00008-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e1cda36a0e16c3ca61087c2718abfc0f4ce3c22c2e6268fa500f6c233e737f5
3
+ size 4537371312
model-00009-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07632a1a122cb2ea3cf75d0f867f53e16d021bbc269f7f127b5397af7e929267
3
+ size 4061736320
model-00010-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23bae85e80dab667efcfa959c74778b71a1fecd03d6ba3b97fdb1d79710e1c7f
3
+ size 4678869176
model-00011-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c4199a2da5a339d54f514066454e34cf9e9cf9563c1758772a5f466ebf5cf1b
3
+ size 4010816256
model.safetensors.index.json ADDED
@@ -0,0 +1,766 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 49993753248
4
+ },
5
+ "weight_map": {
6
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00011.safetensors",
7
+ "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00011.safetensors",
8
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
9
+ "model.layers.0.self_attn.o_proj.bias": "model-00001-of-00011.safetensors",
10
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
11
+ "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
12
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
13
+ "model.layers.0.self_attn.sinks": "model-00001-of-00011.safetensors",
14
+ "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
15
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
16
+ "model.layers.0.mlp.experts.down_proj_bias": "model-00001-of-00011.safetensors",
17
+ "model.layers.0.mlp.experts.down_proj_blocks": "model-00001-of-00011.safetensors",
18
+ "model.layers.0.mlp.experts.down_proj_scales": "model-00001-of-00011.safetensors",
19
+ "model.layers.0.mlp.experts.gate_up_proj_bias": "model-00001-of-00011.safetensors",
20
+ "model.layers.0.mlp.experts.gate_up_proj_blocks": "model-00001-of-00011.safetensors",
21
+ "model.layers.0.mlp.experts.gate_up_proj_scales": "model-00001-of-00011.safetensors",
22
+ "model.layers.0.mlp.router.bias": "model-00001-of-00011.safetensors",
23
+ "model.layers.0.mlp.router.weight": "model-00001-of-00011.safetensors",
24
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00011.safetensors",
25
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00011.safetensors",
26
+ "model.layers.10.self_attn.k_proj.bias": "model-00001-of-00011.safetensors",
27
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
28
+ "model.layers.10.self_attn.o_proj.bias": "model-00001-of-00011.safetensors",
29
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
30
+ "model.layers.10.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
31
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
32
+ "model.layers.10.self_attn.sinks": "model-00001-of-00011.safetensors",
33
+ "model.layers.10.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
34
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
35
+ "model.layers.10.mlp.experts.down_proj_bias": "model-00001-of-00011.safetensors",
36
+ "model.layers.10.mlp.experts.down_proj_blocks": "model-00001-of-00011.safetensors",
37
+ "model.layers.10.mlp.experts.down_proj_scales": "model-00001-of-00011.safetensors",
38
+ "model.layers.10.mlp.experts.gate_up_proj_bias": "model-00001-of-00011.safetensors",
39
+ "model.layers.10.mlp.experts.gate_up_proj_blocks": "model-00001-of-00011.safetensors",
40
+ "model.layers.10.mlp.experts.gate_up_proj_scales": "model-00001-of-00011.safetensors",
41
+ "model.layers.10.mlp.router.bias": "model-00001-of-00011.safetensors",
42
+ "model.layers.10.mlp.router.weight": "model-00001-of-00011.safetensors",
43
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00011.safetensors",
44
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00011.safetensors",
45
+ "model.layers.11.self_attn.k_proj.bias": "model-00001-of-00011.safetensors",
46
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
47
+ "model.layers.11.self_attn.o_proj.bias": "model-00001-of-00011.safetensors",
48
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
49
+ "model.layers.11.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
50
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
51
+ "model.layers.11.self_attn.sinks": "model-00001-of-00011.safetensors",
52
+ "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
53
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
54
+ "model.layers.11.mlp.experts.down_proj_bias": "model-00001-of-00011.safetensors",
55
+ "model.layers.11.mlp.experts.down_proj_blocks": "model-00001-of-00011.safetensors",
56
+ "model.layers.11.mlp.experts.down_proj_scales": "model-00001-of-00011.safetensors",
57
+ "model.layers.11.mlp.experts.gate_up_proj_bias": "model-00001-of-00011.safetensors",
58
+ "model.layers.11.mlp.experts.gate_up_proj_blocks": "model-00002-of-00011.safetensors",
59
+ "model.layers.11.mlp.experts.gate_up_proj_scales": "model-00002-of-00011.safetensors",
60
+ "model.layers.11.mlp.router.bias": "model-00002-of-00011.safetensors",
61
+ "model.layers.11.mlp.router.weight": "model-00002-of-00011.safetensors",
62
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
63
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00011.safetensors",
64
+ "model.layers.12.self_attn.k_proj.bias": "model-00002-of-00011.safetensors",
65
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
66
+ "model.layers.12.self_attn.o_proj.bias": "model-00002-of-00011.safetensors",
67
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
68
+ "model.layers.12.self_attn.q_proj.bias": "model-00002-of-00011.safetensors",
69
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
70
+ "model.layers.12.self_attn.sinks": "model-00002-of-00011.safetensors",
71
+ "model.layers.12.self_attn.v_proj.bias": "model-00002-of-00011.safetensors",
72
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
73
+ "model.layers.12.mlp.experts.down_proj_bias": "model-00002-of-00011.safetensors",
74
+ "model.layers.12.mlp.experts.down_proj_blocks": "model-00002-of-00011.safetensors",
75
+ "model.layers.12.mlp.experts.down_proj_scales": "model-00002-of-00011.safetensors",
76
+ "model.layers.12.mlp.experts.gate_up_proj_bias": "model-00002-of-00011.safetensors",
77
+ "model.layers.12.mlp.experts.gate_up_proj_blocks": "model-00002-of-00011.safetensors",
78
+ "model.layers.12.mlp.experts.gate_up_proj_scales": "model-00002-of-00011.safetensors",
79
+ "model.layers.12.mlp.router.bias": "model-00002-of-00011.safetensors",
80
+ "model.layers.12.mlp.router.weight": "model-00002-of-00011.safetensors",
81
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
82
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00011.safetensors",
83
+ "model.layers.13.self_attn.k_proj.bias": "model-00002-of-00011.safetensors",
84
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
85
+ "model.layers.13.self_attn.o_proj.bias": "model-00002-of-00011.safetensors",
86
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
87
+ "model.layers.13.self_attn.q_proj.bias": "model-00002-of-00011.safetensors",
88
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
89
+ "model.layers.13.self_attn.sinks": "model-00002-of-00011.safetensors",
90
+ "model.layers.13.self_attn.v_proj.bias": "model-00002-of-00011.safetensors",
91
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
92
+ "model.layers.13.mlp.experts.down_proj_bias": "model-00002-of-00011.safetensors",
93
+ "model.layers.13.mlp.experts.down_proj_blocks": "model-00002-of-00011.safetensors",
94
+ "model.layers.13.mlp.experts.down_proj_scales": "model-00002-of-00011.safetensors",
95
+ "model.layers.13.mlp.experts.gate_up_proj_bias": "model-00002-of-00011.safetensors",
96
+ "model.layers.13.mlp.experts.gate_up_proj_blocks": "model-00002-of-00011.safetensors",
97
+ "model.layers.13.mlp.experts.gate_up_proj_scales": "model-00002-of-00011.safetensors",
98
+ "model.layers.13.mlp.router.bias": "model-00002-of-00011.safetensors",
99
+ "model.layers.13.mlp.router.weight": "model-00002-of-00011.safetensors",
100
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
101
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00011.safetensors",
102
+ "model.layers.14.self_attn.k_proj.bias": "model-00002-of-00011.safetensors",
103
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
104
+ "model.layers.14.self_attn.o_proj.bias": "model-00002-of-00011.safetensors",
105
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
106
+ "model.layers.14.self_attn.q_proj.bias": "model-00002-of-00011.safetensors",
107
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
108
+ "model.layers.14.self_attn.sinks": "model-00002-of-00011.safetensors",
109
+ "model.layers.14.self_attn.v_proj.bias": "model-00002-of-00011.safetensors",
110
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
111
+ "model.layers.14.mlp.experts.down_proj_bias": "model-00002-of-00011.safetensors",
112
+ "model.layers.14.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
113
+ "model.layers.14.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
114
+ "model.layers.14.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
115
+ "model.layers.14.mlp.experts.gate_up_proj_blocks": "model-00003-of-00011.safetensors",
116
+ "model.layers.14.mlp.experts.gate_up_proj_scales": "model-00003-of-00011.safetensors",
117
+ "model.layers.14.mlp.router.bias": "model-00003-of-00011.safetensors",
118
+ "model.layers.14.mlp.router.weight": "model-00003-of-00011.safetensors",
119
+ "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
120
+ "model.layers.15.input_layernorm.weight": "model-00003-of-00011.safetensors",
121
+ "model.layers.15.self_attn.k_proj.bias": "model-00003-of-00011.safetensors",
122
+ "model.layers.15.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
123
+ "model.layers.15.self_attn.o_proj.bias": "model-00003-of-00011.safetensors",
124
+ "model.layers.15.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
125
+ "model.layers.15.self_attn.q_proj.bias": "model-00003-of-00011.safetensors",
126
+ "model.layers.15.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
127
+ "model.layers.15.self_attn.sinks": "model-00003-of-00011.safetensors",
128
+ "model.layers.15.self_attn.v_proj.bias": "model-00003-of-00011.safetensors",
129
+ "model.layers.15.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
130
+ "model.layers.15.mlp.experts.down_proj_bias": "model-00003-of-00011.safetensors",
131
+ "model.layers.15.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
132
+ "model.layers.15.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
133
+ "model.layers.15.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
134
+ "model.layers.15.mlp.experts.gate_up_proj_blocks": "model-00003-of-00011.safetensors",
135
+ "model.layers.15.mlp.experts.gate_up_proj_scales": "model-00003-of-00011.safetensors",
136
+ "model.layers.15.mlp.router.bias": "model-00003-of-00011.safetensors",
137
+ "model.layers.15.mlp.router.weight": "model-00003-of-00011.safetensors",
138
+ "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
139
+ "model.layers.16.input_layernorm.weight": "model-00003-of-00011.safetensors",
140
+ "model.layers.16.self_attn.k_proj.bias": "model-00003-of-00011.safetensors",
141
+ "model.layers.16.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
142
+ "model.layers.16.self_attn.o_proj.bias": "model-00003-of-00011.safetensors",
143
+ "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
144
+ "model.layers.16.self_attn.q_proj.bias": "model-00003-of-00011.safetensors",
145
+ "model.layers.16.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
146
+ "model.layers.16.self_attn.sinks": "model-00003-of-00011.safetensors",
147
+ "model.layers.16.self_attn.v_proj.bias": "model-00003-of-00011.safetensors",
148
+ "model.layers.16.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
149
+ "model.layers.16.mlp.experts.down_proj_bias": "model-00003-of-00011.safetensors",
150
+ "model.layers.16.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
151
+ "model.layers.16.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
152
+ "model.layers.16.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
153
+ "model.layers.16.mlp.experts.gate_up_proj_blocks": "model-00003-of-00011.safetensors",
154
+ "model.layers.16.mlp.experts.gate_up_proj_scales": "model-00003-of-00011.safetensors",
155
+ "model.layers.16.mlp.router.bias": "model-00003-of-00011.safetensors",
156
+ "model.layers.16.mlp.router.weight": "model-00003-of-00011.safetensors",
157
+ "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
158
+ "model.layers.17.input_layernorm.weight": "model-00003-of-00011.safetensors",
159
+ "model.layers.17.self_attn.k_proj.bias": "model-00003-of-00011.safetensors",
160
+ "model.layers.17.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
161
+ "model.layers.17.self_attn.o_proj.bias": "model-00003-of-00011.safetensors",
162
+ "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
163
+ "model.layers.17.self_attn.q_proj.bias": "model-00003-of-00011.safetensors",
164
+ "model.layers.17.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
165
+ "model.layers.17.self_attn.sinks": "model-00003-of-00011.safetensors",
166
+ "model.layers.17.self_attn.v_proj.bias": "model-00003-of-00011.safetensors",
167
+ "model.layers.17.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
168
+ "model.layers.17.mlp.experts.down_proj_bias": "model-00003-of-00011.safetensors",
169
+ "model.layers.17.mlp.experts.down_proj_blocks": "model-00003-of-00011.safetensors",
170
+ "model.layers.17.mlp.experts.down_proj_scales": "model-00003-of-00011.safetensors",
171
+ "model.layers.17.mlp.experts.gate_up_proj_bias": "model-00003-of-00011.safetensors",
172
+ "model.layers.17.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
173
+ "model.layers.17.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
174
+ "model.layers.17.mlp.router.bias": "model-00004-of-00011.safetensors",
175
+ "model.layers.17.mlp.router.weight": "model-00004-of-00011.safetensors",
176
+ "model.layers.17.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
177
+ "model.layers.18.input_layernorm.weight": "model-00004-of-00011.safetensors",
178
+ "model.layers.18.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
179
+ "model.layers.18.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
180
+ "model.layers.18.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
181
+ "model.layers.18.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
182
+ "model.layers.18.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
183
+ "model.layers.18.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
184
+ "model.layers.18.self_attn.sinks": "model-00004-of-00011.safetensors",
185
+ "model.layers.18.self_attn.v_proj.bias": "model-00004-of-00011.safetensors",
186
+ "model.layers.18.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
187
+ "model.layers.18.mlp.experts.down_proj_bias": "model-00004-of-00011.safetensors",
188
+ "model.layers.18.mlp.experts.down_proj_blocks": "model-00004-of-00011.safetensors",
189
+ "model.layers.18.mlp.experts.down_proj_scales": "model-00004-of-00011.safetensors",
190
+ "model.layers.18.mlp.experts.gate_up_proj_bias": "model-00004-of-00011.safetensors",
191
+ "model.layers.18.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
192
+ "model.layers.18.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
193
+ "model.layers.18.mlp.router.bias": "model-00004-of-00011.safetensors",
194
+ "model.layers.18.mlp.router.weight": "model-00004-of-00011.safetensors",
195
+ "model.layers.18.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
196
+ "model.layers.19.input_layernorm.weight": "model-00004-of-00011.safetensors",
197
+ "model.layers.19.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
198
+ "model.layers.19.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
199
+ "model.layers.19.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
200
+ "model.layers.19.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
201
+ "model.layers.19.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
202
+ "model.layers.19.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
203
+ "model.layers.19.self_attn.sinks": "model-00004-of-00011.safetensors",
204
+ "model.layers.19.self_attn.v_proj.bias": "model-00004-of-00011.safetensors",
205
+ "model.layers.19.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
206
+ "model.layers.19.mlp.experts.down_proj_bias": "model-00004-of-00011.safetensors",
207
+ "model.layers.19.mlp.experts.down_proj_blocks": "model-00004-of-00011.safetensors",
208
+ "model.layers.19.mlp.experts.down_proj_scales": "model-00004-of-00011.safetensors",
209
+ "model.layers.19.mlp.experts.gate_up_proj_bias": "model-00004-of-00011.safetensors",
210
+ "model.layers.19.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
211
+ "model.layers.19.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
212
+ "model.layers.19.mlp.router.bias": "model-00004-of-00011.safetensors",
213
+ "model.layers.19.mlp.router.weight": "model-00004-of-00011.safetensors",
214
+ "model.layers.19.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
215
+ "model.layers.1.input_layernorm.weight": "model-00004-of-00011.safetensors",
216
+ "model.layers.1.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
217
+ "model.layers.1.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
218
+ "model.layers.1.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
219
+ "model.layers.1.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
220
+ "model.layers.1.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
221
+ "model.layers.1.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
222
+ "model.layers.1.self_attn.sinks": "model-00004-of-00011.safetensors",
223
+ "model.layers.1.self_attn.v_proj.bias": "model-00004-of-00011.safetensors",
224
+ "model.layers.1.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
225
+ "model.layers.1.mlp.experts.down_proj_bias": "model-00004-of-00011.safetensors",
226
+ "model.layers.1.mlp.experts.down_proj_blocks": "model-00004-of-00011.safetensors",
227
+ "model.layers.1.mlp.experts.down_proj_scales": "model-00004-of-00011.safetensors",
228
+ "model.layers.1.mlp.experts.gate_up_proj_bias": "model-00004-of-00011.safetensors",
229
+ "model.layers.1.mlp.experts.gate_up_proj_blocks": "model-00004-of-00011.safetensors",
230
+ "model.layers.1.mlp.experts.gate_up_proj_scales": "model-00004-of-00011.safetensors",
231
+ "model.layers.1.mlp.router.bias": "model-00004-of-00011.safetensors",
232
+ "model.layers.1.mlp.router.weight": "model-00004-of-00011.safetensors",
233
+ "model.layers.1.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
234
+ "model.layers.20.input_layernorm.weight": "model-00004-of-00011.safetensors",
235
+ "model.layers.20.self_attn.k_proj.bias": "model-00004-of-00011.safetensors",
236
+ "model.layers.20.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
237
+ "model.layers.20.self_attn.o_proj.bias": "model-00004-of-00011.safetensors",
238
+ "model.layers.20.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
239
+ "model.layers.20.self_attn.q_proj.bias": "model-00004-of-00011.safetensors",
240
+ "model.layers.20.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
241
+ "model.layers.20.self_attn.sinks": "model-00005-of-00011.safetensors",
242
+ "model.layers.20.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
243
+ "model.layers.20.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
244
+ "model.layers.20.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
245
+ "model.layers.20.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
246
+ "model.layers.20.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
247
+ "model.layers.20.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
248
+ "model.layers.20.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
249
+ "model.layers.20.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
250
+ "model.layers.20.mlp.router.bias": "model-00005-of-00011.safetensors",
251
+ "model.layers.20.mlp.router.weight": "model-00005-of-00011.safetensors",
252
+ "model.layers.20.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
253
+ "model.layers.21.input_layernorm.weight": "model-00005-of-00011.safetensors",
254
+ "model.layers.21.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
255
+ "model.layers.21.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
256
+ "model.layers.21.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
257
+ "model.layers.21.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
258
+ "model.layers.21.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
259
+ "model.layers.21.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
260
+ "model.layers.21.self_attn.sinks": "model-00005-of-00011.safetensors",
261
+ "model.layers.21.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
262
+ "model.layers.21.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
263
+ "model.layers.21.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
264
+ "model.layers.21.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
265
+ "model.layers.21.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
266
+ "model.layers.21.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
267
+ "model.layers.21.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
268
+ "model.layers.21.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
269
+ "model.layers.21.mlp.router.bias": "model-00005-of-00011.safetensors",
270
+ "model.layers.21.mlp.router.weight": "model-00005-of-00011.safetensors",
271
+ "model.layers.21.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
272
+ "model.layers.22.input_layernorm.weight": "model-00005-of-00011.safetensors",
273
+ "model.layers.22.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
274
+ "model.layers.22.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
275
+ "model.layers.22.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
276
+ "model.layers.22.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
277
+ "model.layers.22.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
278
+ "model.layers.22.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
279
+ "model.layers.22.self_attn.sinks": "model-00005-of-00011.safetensors",
280
+ "model.layers.22.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
281
+ "model.layers.22.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
282
+ "model.layers.22.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
283
+ "model.layers.22.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
284
+ "model.layers.22.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
285
+ "model.layers.22.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
286
+ "model.layers.22.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
287
+ "model.layers.22.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
288
+ "model.layers.22.mlp.router.bias": "model-00005-of-00011.safetensors",
289
+ "model.layers.22.mlp.router.weight": "model-00005-of-00011.safetensors",
290
+ "model.layers.22.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
291
+ "model.layers.23.input_layernorm.weight": "model-00005-of-00011.safetensors",
292
+ "model.layers.23.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
293
+ "model.layers.23.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
294
+ "model.layers.23.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
295
+ "model.layers.23.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
296
+ "model.layers.23.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
297
+ "model.layers.23.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
298
+ "model.layers.23.self_attn.sinks": "model-00005-of-00011.safetensors",
299
+ "model.layers.23.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
300
+ "model.layers.23.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
301
+ "model.layers.23.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
302
+ "model.layers.23.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
303
+ "model.layers.23.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
304
+ "model.layers.23.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
305
+ "model.layers.23.mlp.experts.gate_up_proj_blocks": "model-00005-of-00011.safetensors",
306
+ "model.layers.23.mlp.experts.gate_up_proj_scales": "model-00005-of-00011.safetensors",
307
+ "model.layers.23.mlp.router.bias": "model-00005-of-00011.safetensors",
308
+ "model.layers.23.mlp.router.weight": "model-00005-of-00011.safetensors",
309
+ "model.layers.23.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
310
+ "model.layers.24.input_layernorm.weight": "model-00005-of-00011.safetensors",
311
+ "model.layers.24.self_attn.k_proj.bias": "model-00005-of-00011.safetensors",
312
+ "model.layers.24.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
313
+ "model.layers.24.self_attn.o_proj.bias": "model-00005-of-00011.safetensors",
314
+ "model.layers.24.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
315
+ "model.layers.24.self_attn.q_proj.bias": "model-00005-of-00011.safetensors",
316
+ "model.layers.24.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
317
+ "model.layers.24.self_attn.sinks": "model-00005-of-00011.safetensors",
318
+ "model.layers.24.self_attn.v_proj.bias": "model-00005-of-00011.safetensors",
319
+ "model.layers.24.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
320
+ "model.layers.24.mlp.experts.down_proj_bias": "model-00005-of-00011.safetensors",
321
+ "model.layers.24.mlp.experts.down_proj_blocks": "model-00005-of-00011.safetensors",
322
+ "model.layers.24.mlp.experts.down_proj_scales": "model-00005-of-00011.safetensors",
323
+ "model.layers.24.mlp.experts.gate_up_proj_bias": "model-00005-of-00011.safetensors",
324
+ "model.layers.24.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
325
+ "model.layers.24.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
326
+ "model.layers.24.mlp.router.bias": "model-00006-of-00011.safetensors",
327
+ "model.layers.24.mlp.router.weight": "model-00006-of-00011.safetensors",
328
+ "model.layers.24.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
329
+ "model.layers.25.input_layernorm.weight": "model-00006-of-00011.safetensors",
330
+ "model.layers.25.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
331
+ "model.layers.25.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
332
+ "model.layers.25.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
333
+ "model.layers.25.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
334
+ "model.layers.25.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
335
+ "model.layers.25.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
336
+ "model.layers.25.self_attn.sinks": "model-00006-of-00011.safetensors",
337
+ "model.layers.25.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
338
+ "model.layers.25.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
339
+ "model.layers.25.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
340
+ "model.layers.25.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
341
+ "model.layers.25.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
342
+ "model.layers.25.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
343
+ "model.layers.25.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
344
+ "model.layers.25.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
345
+ "model.layers.25.mlp.router.bias": "model-00006-of-00011.safetensors",
346
+ "model.layers.25.mlp.router.weight": "model-00006-of-00011.safetensors",
347
+ "model.layers.25.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
348
+ "model.layers.26.input_layernorm.weight": "model-00006-of-00011.safetensors",
349
+ "model.layers.26.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
350
+ "model.layers.26.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
351
+ "model.layers.26.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
352
+ "model.layers.26.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
353
+ "model.layers.26.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
354
+ "model.layers.26.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
355
+ "model.layers.26.self_attn.sinks": "model-00006-of-00011.safetensors",
356
+ "model.layers.26.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
357
+ "model.layers.26.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
358
+ "model.layers.26.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
359
+ "model.layers.26.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
360
+ "model.layers.26.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
361
+ "model.layers.26.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
362
+ "model.layers.26.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
363
+ "model.layers.26.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
364
+ "model.layers.26.mlp.router.bias": "model-00006-of-00011.safetensors",
365
+ "model.layers.26.mlp.router.weight": "model-00006-of-00011.safetensors",
366
+ "model.layers.26.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
367
+ "model.layers.27.input_layernorm.weight": "model-00006-of-00011.safetensors",
368
+ "model.layers.27.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
369
+ "model.layers.27.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
370
+ "model.layers.27.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
371
+ "model.layers.27.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
372
+ "model.layers.27.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
373
+ "model.layers.27.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
374
+ "model.layers.27.self_attn.sinks": "model-00006-of-00011.safetensors",
375
+ "model.layers.27.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
376
+ "model.layers.27.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
377
+ "model.layers.27.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
378
+ "model.layers.27.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
379
+ "model.layers.27.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
380
+ "model.layers.27.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
381
+ "model.layers.27.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
382
+ "model.layers.27.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
383
+ "model.layers.27.mlp.router.bias": "model-00006-of-00011.safetensors",
384
+ "model.layers.27.mlp.router.weight": "model-00006-of-00011.safetensors",
385
+ "model.layers.27.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
386
+ "model.layers.28.input_layernorm.weight": "model-00006-of-00011.safetensors",
387
+ "model.layers.28.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
388
+ "model.layers.28.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
389
+ "model.layers.28.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
390
+ "model.layers.28.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
391
+ "model.layers.28.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
392
+ "model.layers.28.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
393
+ "model.layers.28.self_attn.sinks": "model-00006-of-00011.safetensors",
394
+ "model.layers.28.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
395
+ "model.layers.28.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
396
+ "model.layers.28.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
397
+ "model.layers.28.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
398
+ "model.layers.28.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
399
+ "model.layers.28.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
400
+ "model.layers.28.mlp.experts.gate_up_proj_blocks": "model-00006-of-00011.safetensors",
401
+ "model.layers.28.mlp.experts.gate_up_proj_scales": "model-00006-of-00011.safetensors",
402
+ "model.layers.28.mlp.router.bias": "model-00006-of-00011.safetensors",
403
+ "model.layers.28.mlp.router.weight": "model-00006-of-00011.safetensors",
404
+ "model.layers.28.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
405
+ "model.layers.29.input_layernorm.weight": "model-00006-of-00011.safetensors",
406
+ "model.layers.29.self_attn.k_proj.bias": "model-00006-of-00011.safetensors",
407
+ "model.layers.29.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
408
+ "model.layers.29.self_attn.o_proj.bias": "model-00006-of-00011.safetensors",
409
+ "model.layers.29.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
410
+ "model.layers.29.self_attn.q_proj.bias": "model-00006-of-00011.safetensors",
411
+ "model.layers.29.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
412
+ "model.layers.29.self_attn.sinks": "model-00006-of-00011.safetensors",
413
+ "model.layers.29.self_attn.v_proj.bias": "model-00006-of-00011.safetensors",
414
+ "model.layers.29.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
415
+ "model.layers.29.mlp.experts.down_proj_bias": "model-00006-of-00011.safetensors",
416
+ "model.layers.29.mlp.experts.down_proj_blocks": "model-00006-of-00011.safetensors",
417
+ "model.layers.29.mlp.experts.down_proj_scales": "model-00006-of-00011.safetensors",
418
+ "model.layers.29.mlp.experts.gate_up_proj_bias": "model-00006-of-00011.safetensors",
419
+ "model.layers.29.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
420
+ "model.layers.29.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
421
+ "model.layers.29.mlp.router.bias": "model-00007-of-00011.safetensors",
422
+ "model.layers.29.mlp.router.weight": "model-00007-of-00011.safetensors",
423
+ "model.layers.29.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
424
+ "model.layers.2.input_layernorm.weight": "model-00007-of-00011.safetensors",
425
+ "model.layers.2.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
426
+ "model.layers.2.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
427
+ "model.layers.2.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
428
+ "model.layers.2.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
429
+ "model.layers.2.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
430
+ "model.layers.2.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
431
+ "model.layers.2.self_attn.sinks": "model-00007-of-00011.safetensors",
432
+ "model.layers.2.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
433
+ "model.layers.2.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
434
+ "model.layers.2.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
435
+ "model.layers.2.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
436
+ "model.layers.2.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
437
+ "model.layers.2.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
438
+ "model.layers.2.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
439
+ "model.layers.2.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
440
+ "model.layers.2.mlp.router.bias": "model-00007-of-00011.safetensors",
441
+ "model.layers.2.mlp.router.weight": "model-00007-of-00011.safetensors",
442
+ "model.layers.2.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
443
+ "model.layers.30.input_layernorm.weight": "model-00007-of-00011.safetensors",
444
+ "model.layers.30.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
445
+ "model.layers.30.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
446
+ "model.layers.30.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
447
+ "model.layers.30.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
448
+ "model.layers.30.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
449
+ "model.layers.30.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
450
+ "model.layers.30.self_attn.sinks": "model-00007-of-00011.safetensors",
451
+ "model.layers.30.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
452
+ "model.layers.30.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
453
+ "model.layers.30.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
454
+ "model.layers.30.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
455
+ "model.layers.30.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
456
+ "model.layers.30.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
457
+ "model.layers.30.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
458
+ "model.layers.30.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
459
+ "model.layers.30.mlp.router.bias": "model-00007-of-00011.safetensors",
460
+ "model.layers.30.mlp.router.weight": "model-00007-of-00011.safetensors",
461
+ "model.layers.30.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
462
+ "model.layers.31.input_layernorm.weight": "model-00007-of-00011.safetensors",
463
+ "model.layers.31.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
464
+ "model.layers.31.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
465
+ "model.layers.31.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
466
+ "model.layers.31.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
467
+ "model.layers.31.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
468
+ "model.layers.31.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
469
+ "model.layers.31.self_attn.sinks": "model-00007-of-00011.safetensors",
470
+ "model.layers.31.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
471
+ "model.layers.31.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
472
+ "model.layers.31.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
473
+ "model.layers.31.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
474
+ "model.layers.31.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
475
+ "model.layers.31.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
476
+ "model.layers.31.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
477
+ "model.layers.31.mlp.experts.gate_up_proj_scales": "model-00007-of-00011.safetensors",
478
+ "model.layers.31.mlp.router.bias": "model-00007-of-00011.safetensors",
479
+ "model.layers.31.mlp.router.weight": "model-00007-of-00011.safetensors",
480
+ "model.layers.31.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
481
+ "model.layers.32.input_layernorm.weight": "model-00007-of-00011.safetensors",
482
+ "model.layers.32.self_attn.k_proj.bias": "model-00007-of-00011.safetensors",
483
+ "model.layers.32.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
484
+ "model.layers.32.self_attn.o_proj.bias": "model-00007-of-00011.safetensors",
485
+ "model.layers.32.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
486
+ "model.layers.32.self_attn.q_proj.bias": "model-00007-of-00011.safetensors",
487
+ "model.layers.32.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
488
+ "model.layers.32.self_attn.sinks": "model-00007-of-00011.safetensors",
489
+ "model.layers.32.self_attn.v_proj.bias": "model-00007-of-00011.safetensors",
490
+ "model.layers.32.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
491
+ "model.layers.32.mlp.experts.down_proj_bias": "model-00007-of-00011.safetensors",
492
+ "model.layers.32.mlp.experts.down_proj_blocks": "model-00007-of-00011.safetensors",
493
+ "model.layers.32.mlp.experts.down_proj_scales": "model-00007-of-00011.safetensors",
494
+ "model.layers.32.mlp.experts.gate_up_proj_bias": "model-00007-of-00011.safetensors",
495
+ "model.layers.32.mlp.experts.gate_up_proj_blocks": "model-00007-of-00011.safetensors",
496
+ "model.layers.32.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
497
+ "model.layers.32.mlp.router.bias": "model-00008-of-00011.safetensors",
498
+ "model.layers.32.mlp.router.weight": "model-00008-of-00011.safetensors",
499
+ "model.layers.32.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
500
+ "model.layers.33.input_layernorm.weight": "model-00008-of-00011.safetensors",
501
+ "model.layers.33.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
502
+ "model.layers.33.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
503
+ "model.layers.33.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
504
+ "model.layers.33.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
505
+ "model.layers.33.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
506
+ "model.layers.33.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
507
+ "model.layers.33.self_attn.sinks": "model-00008-of-00011.safetensors",
508
+ "model.layers.33.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
509
+ "model.layers.33.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
510
+ "model.layers.33.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
511
+ "model.layers.33.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
512
+ "model.layers.33.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
513
+ "model.layers.33.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
514
+ "model.layers.33.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
515
+ "model.layers.33.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
516
+ "model.layers.33.mlp.router.bias": "model-00008-of-00011.safetensors",
517
+ "model.layers.33.mlp.router.weight": "model-00008-of-00011.safetensors",
518
+ "model.layers.33.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
519
+ "model.layers.34.input_layernorm.weight": "model-00008-of-00011.safetensors",
520
+ "model.layers.34.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
521
+ "model.layers.34.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
522
+ "model.layers.34.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
523
+ "model.layers.34.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
524
+ "model.layers.34.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
525
+ "model.layers.34.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
526
+ "model.layers.34.self_attn.sinks": "model-00008-of-00011.safetensors",
527
+ "model.layers.34.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
528
+ "model.layers.34.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
529
+ "model.layers.34.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
530
+ "model.layers.34.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
531
+ "model.layers.34.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
532
+ "model.layers.34.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
533
+ "model.layers.34.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
534
+ "model.layers.34.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
535
+ "model.layers.34.mlp.router.bias": "model-00008-of-00011.safetensors",
536
+ "model.layers.34.mlp.router.weight": "model-00008-of-00011.safetensors",
537
+ "model.layers.34.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
538
+ "model.layers.35.input_layernorm.weight": "model-00008-of-00011.safetensors",
539
+ "model.layers.35.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
540
+ "model.layers.35.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
541
+ "model.layers.35.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
542
+ "model.layers.35.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
543
+ "model.layers.35.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
544
+ "model.layers.35.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
545
+ "model.layers.35.self_attn.sinks": "model-00008-of-00011.safetensors",
546
+ "model.layers.35.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
547
+ "model.layers.35.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
548
+ "model.layers.35.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
549
+ "model.layers.35.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
550
+ "model.layers.35.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
551
+ "model.layers.35.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
552
+ "model.layers.35.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
553
+ "model.layers.35.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
554
+ "model.layers.35.mlp.router.bias": "model-00008-of-00011.safetensors",
555
+ "model.layers.35.mlp.router.weight": "model-00008-of-00011.safetensors",
556
+ "model.layers.35.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
557
+ "model.layers.3.input_layernorm.weight": "model-00008-of-00011.safetensors",
558
+ "model.layers.3.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
559
+ "model.layers.3.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
560
+ "model.layers.3.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
561
+ "model.layers.3.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
562
+ "model.layers.3.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
563
+ "model.layers.3.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
564
+ "model.layers.3.self_attn.sinks": "model-00008-of-00011.safetensors",
565
+ "model.layers.3.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
566
+ "model.layers.3.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
567
+ "model.layers.3.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
568
+ "model.layers.3.mlp.experts.down_proj_blocks": "model-00008-of-00011.safetensors",
569
+ "model.layers.3.mlp.experts.down_proj_scales": "model-00008-of-00011.safetensors",
570
+ "model.layers.3.mlp.experts.gate_up_proj_bias": "model-00008-of-00011.safetensors",
571
+ "model.layers.3.mlp.experts.gate_up_proj_blocks": "model-00008-of-00011.safetensors",
572
+ "model.layers.3.mlp.experts.gate_up_proj_scales": "model-00008-of-00011.safetensors",
573
+ "model.layers.3.mlp.router.bias": "model-00008-of-00011.safetensors",
574
+ "model.layers.3.mlp.router.weight": "model-00008-of-00011.safetensors",
575
+ "model.layers.3.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
576
+ "model.layers.4.input_layernorm.weight": "model-00008-of-00011.safetensors",
577
+ "model.layers.4.self_attn.k_proj.bias": "model-00008-of-00011.safetensors",
578
+ "model.layers.4.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
579
+ "model.layers.4.self_attn.o_proj.bias": "model-00008-of-00011.safetensors",
580
+ "model.layers.4.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
581
+ "model.layers.4.self_attn.q_proj.bias": "model-00008-of-00011.safetensors",
582
+ "model.layers.4.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
583
+ "model.layers.4.self_attn.sinks": "model-00008-of-00011.safetensors",
584
+ "model.layers.4.self_attn.v_proj.bias": "model-00008-of-00011.safetensors",
585
+ "model.layers.4.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
586
+ "model.layers.4.mlp.experts.down_proj_bias": "model-00008-of-00011.safetensors",
587
+ "model.layers.4.mlp.experts.down_proj_blocks": "model-00009-of-00011.safetensors",
588
+ "model.layers.4.mlp.experts.down_proj_scales": "model-00009-of-00011.safetensors",
589
+ "model.layers.4.mlp.experts.gate_up_proj_bias": "model-00009-of-00011.safetensors",
590
+ "model.layers.4.mlp.experts.gate_up_proj_blocks": "model-00009-of-00011.safetensors",
591
+ "model.layers.4.mlp.experts.gate_up_proj_scales": "model-00009-of-00011.safetensors",
592
+ "model.layers.4.mlp.router.bias": "model-00009-of-00011.safetensors",
593
+ "model.layers.4.mlp.router.weight": "model-00009-of-00011.safetensors",
594
+ "model.layers.4.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
595
+ "model.layers.5.input_layernorm.weight": "model-00009-of-00011.safetensors",
596
+ "model.layers.5.self_attn.k_proj.bias": "model-00009-of-00011.safetensors",
597
+ "model.layers.5.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
598
+ "model.layers.5.self_attn.o_proj.bias": "model-00009-of-00011.safetensors",
599
+ "model.layers.5.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
600
+ "model.layers.5.self_attn.q_proj.bias": "model-00009-of-00011.safetensors",
601
+ "model.layers.5.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
602
+ "model.layers.5.self_attn.sinks": "model-00009-of-00011.safetensors",
603
+ "model.layers.5.self_attn.v_proj.bias": "model-00009-of-00011.safetensors",
604
+ "model.layers.5.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
605
+ "model.layers.5.mlp.experts.down_proj_bias": "model-00009-of-00011.safetensors",
606
+ "model.layers.5.mlp.experts.down_proj_blocks": "model-00009-of-00011.safetensors",
607
+ "model.layers.5.mlp.experts.down_proj_scales": "model-00009-of-00011.safetensors",
608
+ "model.layers.5.mlp.experts.gate_up_proj_bias": "model-00009-of-00011.safetensors",
609
+ "model.layers.5.mlp.experts.gate_up_proj_blocks": "model-00009-of-00011.safetensors",
610
+ "model.layers.5.mlp.experts.gate_up_proj_scales": "model-00009-of-00011.safetensors",
611
+ "model.layers.5.mlp.router.bias": "model-00009-of-00011.safetensors",
612
+ "model.layers.5.mlp.router.weight": "model-00009-of-00011.safetensors",
613
+ "model.layers.5.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
614
+ "model.layers.6.input_layernorm.weight": "model-00009-of-00011.safetensors",
615
+ "model.layers.6.self_attn.k_proj.bias": "model-00009-of-00011.safetensors",
616
+ "model.layers.6.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
617
+ "model.layers.6.self_attn.o_proj.bias": "model-00009-of-00011.safetensors",
618
+ "model.layers.6.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
619
+ "model.layers.6.self_attn.q_proj.bias": "model-00009-of-00011.safetensors",
620
+ "model.layers.6.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
621
+ "model.layers.6.self_attn.sinks": "model-00009-of-00011.safetensors",
622
+ "model.layers.6.self_attn.v_proj.bias": "model-00009-of-00011.safetensors",
623
+ "model.layers.6.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
624
+ "model.layers.6.mlp.experts.down_proj_bias": "model-00009-of-00011.safetensors",
625
+ "model.layers.6.mlp.experts.down_proj_blocks": "model-00009-of-00011.safetensors",
626
+ "model.layers.6.mlp.experts.down_proj_scales": "model-00009-of-00011.safetensors",
627
+ "model.layers.6.mlp.experts.gate_up_proj_bias": "model-00009-of-00011.safetensors",
628
+ "model.layers.6.mlp.experts.gate_up_proj_blocks": "model-00010-of-00011.safetensors",
629
+ "model.layers.6.mlp.experts.gate_up_proj_scales": "model-00010-of-00011.safetensors",
630
+ "model.layers.6.mlp.router.bias": "model-00010-of-00011.safetensors",
631
+ "model.layers.6.mlp.router.weight": "model-00010-of-00011.safetensors",
632
+ "model.layers.6.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
633
+ "model.layers.7.input_layernorm.weight": "model-00010-of-00011.safetensors",
634
+ "model.layers.7.self_attn.k_proj.bias": "model-00010-of-00011.safetensors",
635
+ "model.layers.7.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
636
+ "model.layers.7.self_attn.o_proj.bias": "model-00010-of-00011.safetensors",
637
+ "model.layers.7.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
638
+ "model.layers.7.self_attn.q_proj.bias": "model-00010-of-00011.safetensors",
639
+ "model.layers.7.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
640
+ "model.layers.7.self_attn.sinks": "model-00010-of-00011.safetensors",
641
+ "model.layers.7.self_attn.v_proj.bias": "model-00010-of-00011.safetensors",
642
+ "model.layers.7.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
643
+ "model.layers.7.mlp.experts.down_proj_bias": "model-00010-of-00011.safetensors",
644
+ "model.layers.7.mlp.experts.down_proj_blocks": "model-00010-of-00011.safetensors",
645
+ "model.layers.7.mlp.experts.down_proj_scales": "model-00010-of-00011.safetensors",
646
+ "model.layers.7.mlp.experts.gate_up_proj_bias": "model-00010-of-00011.safetensors",
647
+ "model.layers.7.mlp.experts.gate_up_proj_blocks": "model-00010-of-00011.safetensors",
648
+ "model.layers.7.mlp.experts.gate_up_proj_scales": "model-00010-of-00011.safetensors",
649
+ "model.layers.7.mlp.router.bias": "model-00010-of-00011.safetensors",
650
+ "model.layers.7.mlp.router.weight": "model-00010-of-00011.safetensors",
651
+ "model.layers.7.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
652
+ "model.layers.8.input_layernorm.weight": "model-00010-of-00011.safetensors",
653
+ "model.layers.8.self_attn.k_proj.bias": "model-00010-of-00011.safetensors",
654
+ "model.layers.8.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
655
+ "model.layers.8.self_attn.o_proj.bias": "model-00010-of-00011.safetensors",
656
+ "model.layers.8.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
657
+ "model.layers.8.self_attn.q_proj.bias": "model-00010-of-00011.safetensors",
658
+ "model.layers.8.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
659
+ "model.layers.8.self_attn.sinks": "model-00010-of-00011.safetensors",
660
+ "model.layers.8.self_attn.v_proj.bias": "model-00010-of-00011.safetensors",
661
+ "model.layers.8.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
662
+ "model.layers.8.mlp.experts.down_proj_bias": "model-00010-of-00011.safetensors",
663
+ "model.layers.8.mlp.experts.down_proj_blocks": "model-00010-of-00011.safetensors",
664
+ "model.layers.8.mlp.experts.down_proj_scales": "model-00010-of-00011.safetensors",
665
+ "model.layers.8.mlp.experts.gate_up_proj_bias": "model-00010-of-00011.safetensors",
666
+ "model.layers.8.mlp.experts.gate_up_proj_blocks": "model-00010-of-00011.safetensors",
667
+ "model.layers.8.mlp.experts.gate_up_proj_scales": "model-00010-of-00011.safetensors",
668
+ "model.layers.8.mlp.router.bias": "model-00010-of-00011.safetensors",
669
+ "model.layers.8.mlp.router.weight": "model-00010-of-00011.safetensors",
670
+ "model.layers.8.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
671
+ "model.layers.9.input_layernorm.weight": "model-00010-of-00011.safetensors",
672
+ "model.layers.9.self_attn.k_proj.bias": "model-00010-of-00011.safetensors",
673
+ "model.layers.9.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
674
+ "model.layers.9.self_attn.o_proj.bias": "model-00010-of-00011.safetensors",
675
+ "model.layers.9.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
676
+ "model.layers.9.self_attn.q_proj.bias": "model-00010-of-00011.safetensors",
677
+ "model.layers.9.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
678
+ "model.layers.9.self_attn.sinks": "model-00010-of-00011.safetensors",
679
+ "model.layers.9.self_attn.v_proj.bias": "model-00010-of-00011.safetensors",
680
+ "model.layers.9.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
681
+ "model.layers.9.mlp.experts.down_proj_bias": "model-00010-of-00011.safetensors",
682
+ "model.layers.9.mlp.experts.down_proj_blocks": "model-00011-of-00011.safetensors",
683
+ "model.layers.9.mlp.experts.down_proj_scales": "model-00011-of-00011.safetensors",
684
+ "model.layers.9.mlp.experts.gate_up_proj_bias": "model-00011-of-00011.safetensors",
685
+ "model.layers.9.mlp.experts.gate_up_proj_blocks": "model-00011-of-00011.safetensors",
686
+ "model.layers.9.mlp.experts.gate_up_proj_scales": "model-00011-of-00011.safetensors",
687
+ "model.layers.9.mlp.router.bias": "model-00011-of-00011.safetensors",
688
+ "model.layers.9.mlp.router.weight": "model-00011-of-00011.safetensors",
689
+ "model.layers.9.post_attention_layernorm.weight": "model-00011-of-00011.safetensors",
690
+ "model.embed_tokens.weight": "model-00011-of-00011.safetensors",
691
+ "lm_head.weight": "model-00011-of-00011.safetensors",
692
+ "model.norm.weight": "model-00011-of-00011.safetensors",
693
+ "model.layers.0.self_attn.k_scale": "model-00001-of-00011.safetensors",
694
+ "model.layers.0.self_attn.v_scale": "model-00001-of-00011.safetensors",
695
+ "model.layers.1.self_attn.k_scale": "model-00004-of-00011.safetensors",
696
+ "model.layers.1.self_attn.v_scale": "model-00004-of-00011.safetensors",
697
+ "model.layers.10.self_attn.k_scale": "model-00001-of-00011.safetensors",
698
+ "model.layers.10.self_attn.v_scale": "model-00001-of-00011.safetensors",
699
+ "model.layers.11.self_attn.k_scale": "model-00001-of-00011.safetensors",
700
+ "model.layers.11.self_attn.v_scale": "model-00001-of-00011.safetensors",
701
+ "model.layers.12.self_attn.k_scale": "model-00002-of-00011.safetensors",
702
+ "model.layers.12.self_attn.v_scale": "model-00002-of-00011.safetensors",
703
+ "model.layers.13.self_attn.k_scale": "model-00002-of-00011.safetensors",
704
+ "model.layers.13.self_attn.v_scale": "model-00002-of-00011.safetensors",
705
+ "model.layers.14.self_attn.k_scale": "model-00002-of-00011.safetensors",
706
+ "model.layers.14.self_attn.v_scale": "model-00002-of-00011.safetensors",
707
+ "model.layers.15.self_attn.k_scale": "model-00003-of-00011.safetensors",
708
+ "model.layers.15.self_attn.v_scale": "model-00003-of-00011.safetensors",
709
+ "model.layers.16.self_attn.k_scale": "model-00003-of-00011.safetensors",
710
+ "model.layers.16.self_attn.v_scale": "model-00003-of-00011.safetensors",
711
+ "model.layers.17.self_attn.k_scale": "model-00003-of-00011.safetensors",
712
+ "model.layers.17.self_attn.v_scale": "model-00003-of-00011.safetensors",
713
+ "model.layers.18.self_attn.k_scale": "model-00004-of-00011.safetensors",
714
+ "model.layers.18.self_attn.v_scale": "model-00004-of-00011.safetensors",
715
+ "model.layers.19.self_attn.k_scale": "model-00004-of-00011.safetensors",
716
+ "model.layers.19.self_attn.v_scale": "model-00004-of-00011.safetensors",
717
+ "model.layers.2.self_attn.k_scale": "model-00007-of-00011.safetensors",
718
+ "model.layers.2.self_attn.v_scale": "model-00007-of-00011.safetensors",
719
+ "model.layers.20.self_attn.k_scale": "model-00004-of-00011.safetensors",
720
+ "model.layers.20.self_attn.v_scale": "model-00004-of-00011.safetensors",
721
+ "model.layers.21.self_attn.k_scale": "model-00005-of-00011.safetensors",
722
+ "model.layers.21.self_attn.v_scale": "model-00005-of-00011.safetensors",
723
+ "model.layers.22.self_attn.k_scale": "model-00005-of-00011.safetensors",
724
+ "model.layers.22.self_attn.v_scale": "model-00005-of-00011.safetensors",
725
+ "model.layers.23.self_attn.k_scale": "model-00005-of-00011.safetensors",
726
+ "model.layers.23.self_attn.v_scale": "model-00005-of-00011.safetensors",
727
+ "model.layers.24.self_attn.k_scale": "model-00005-of-00011.safetensors",
728
+ "model.layers.24.self_attn.v_scale": "model-00005-of-00011.safetensors",
729
+ "model.layers.25.self_attn.k_scale": "model-00006-of-00011.safetensors",
730
+ "model.layers.25.self_attn.v_scale": "model-00006-of-00011.safetensors",
731
+ "model.layers.26.self_attn.k_scale": "model-00006-of-00011.safetensors",
732
+ "model.layers.26.self_attn.v_scale": "model-00006-of-00011.safetensors",
733
+ "model.layers.27.self_attn.k_scale": "model-00006-of-00011.safetensors",
734
+ "model.layers.27.self_attn.v_scale": "model-00006-of-00011.safetensors",
735
+ "model.layers.28.self_attn.k_scale": "model-00006-of-00011.safetensors",
736
+ "model.layers.28.self_attn.v_scale": "model-00006-of-00011.safetensors",
737
+ "model.layers.29.self_attn.k_scale": "model-00006-of-00011.safetensors",
738
+ "model.layers.29.self_attn.v_scale": "model-00006-of-00011.safetensors",
739
+ "model.layers.3.self_attn.k_scale": "model-00008-of-00011.safetensors",
740
+ "model.layers.3.self_attn.v_scale": "model-00008-of-00011.safetensors",
741
+ "model.layers.30.self_attn.k_scale": "model-00007-of-00011.safetensors",
742
+ "model.layers.30.self_attn.v_scale": "model-00007-of-00011.safetensors",
743
+ "model.layers.31.self_attn.k_scale": "model-00007-of-00011.safetensors",
744
+ "model.layers.31.self_attn.v_scale": "model-00007-of-00011.safetensors",
745
+ "model.layers.32.self_attn.k_scale": "model-00007-of-00011.safetensors",
746
+ "model.layers.32.self_attn.v_scale": "model-00007-of-00011.safetensors",
747
+ "model.layers.33.self_attn.k_scale": "model-00008-of-00011.safetensors",
748
+ "model.layers.33.self_attn.v_scale": "model-00008-of-00011.safetensors",
749
+ "model.layers.34.self_attn.k_scale": "model-00008-of-00011.safetensors",
750
+ "model.layers.34.self_attn.v_scale": "model-00008-of-00011.safetensors",
751
+ "model.layers.35.self_attn.k_scale": "model-00008-of-00011.safetensors",
752
+ "model.layers.35.self_attn.v_scale": "model-00008-of-00011.safetensors",
753
+ "model.layers.4.self_attn.k_scale": "model-00008-of-00011.safetensors",
754
+ "model.layers.4.self_attn.v_scale": "model-00008-of-00011.safetensors",
755
+ "model.layers.5.self_attn.k_scale": "model-00009-of-00011.safetensors",
756
+ "model.layers.5.self_attn.v_scale": "model-00009-of-00011.safetensors",
757
+ "model.layers.6.self_attn.k_scale": "model-00009-of-00011.safetensors",
758
+ "model.layers.6.self_attn.v_scale": "model-00009-of-00011.safetensors",
759
+ "model.layers.7.self_attn.k_scale": "model-00010-of-00011.safetensors",
760
+ "model.layers.7.self_attn.v_scale": "model-00010-of-00011.safetensors",
761
+ "model.layers.8.self_attn.k_scale": "model-00010-of-00011.safetensors",
762
+ "model.layers.8.self_attn.v_scale": "model-00010-of-00011.safetensors",
763
+ "model.layers.9.self_attn.k_scale": "model-00010-of-00011.safetensors",
764
+ "model.layers.9.self_attn.v_scale": "model-00010-of-00011.safetensors"
765
+ }
766
+ }
modeling_gpt_oss_puzzle.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Iterable, Optional, Union
2
+
3
+ from dataclasses import dataclass
4
+ import functools
5
+ import inspect
6
+
7
+ from .configuration_gpt_oss_puzzle import GptOssPuzzleConfig
8
+ import torch
9
+ from transformers.cache_utils import Cache, DynamicCache, DynamicLayer, DynamicSlidingWindowLayer
10
+ from transformers.integrations import mxfp4
11
+ from transformers.integrations.mxfp4 import Mxfp4GptOssExperts
12
+ from transformers.masking_utils import create_sliding_window_causal_mask
13
+ from transformers.models.gpt_oss import modeling_gpt_oss
14
+ from transformers.models.gpt_oss.modeling_gpt_oss import GptOssDecoderLayer, GptOssForCausalLM
15
+
16
+
17
+ @dataclass
18
+ class SlidingWindowCausalMaskPlaceholder:
19
+ kwargs: dict[str, Any]
20
+
21
+
22
+ class GptOssPuzzleDecoderLayer(GptOssDecoderLayer):
23
+ """
24
+ Extends GptOssDecoderLayer to support per-layer configs.
25
+ """
26
+
27
+ def __init__(self, config: GptOssPuzzleConfig, layer_idx: int):
28
+ layer_config = config.get_gpt_oss_config_for_layer(layer_idx)
29
+ super().__init__(layer_config, layer_idx)
30
+ self.config = layer_config
31
+ self.layer_idx = layer_idx
32
+
33
+ def forward(self, *args, **kwargs):
34
+ if "attention_mask" in kwargs and isinstance(kwargs["attention_mask"], SlidingWindowCausalMaskPlaceholder):
35
+ mask_kwargs = dict(kwargs["attention_mask"].kwargs)
36
+ mask_kwargs["config"] = self.config
37
+ if mask_kwargs["past_key_values"] is not None:
38
+ mask_kwargs["past_key_values"] = CacheViewForSlidingWindowMask(
39
+ mask_kwargs["past_key_values"], self.layer_idx
40
+ )
41
+
42
+ kwargs["attention_mask"] = create_sliding_window_causal_mask(**mask_kwargs)
43
+ return super().forward(*args, **kwargs)
44
+
45
+
46
+ class CacheViewForSlidingWindowMask:
47
+ """
48
+ A view wrapper around a Cache that makes `create_sliding_window_causal_mask` use the correct layer index.
49
+
50
+ `create_sliding_window_causal_mask` iterates over `past_key_values.is_sliding` to determine which layer
51
+ to use for deriving mask sizes, effectively using the first layer's index. Since gpt-oss-puzzle has
52
+ heterogeneous sliding window sizes across layers, we need to ensure each layer uses its own sliding
53
+ window size. This view returns an `is_sliding` list that only marks the current layer as sliding,
54
+ causing `create_sliding_window_causal_mask` to use the correct layer index for mask computation.
55
+ """
56
+
57
+ def __init__(self, cache: Cache, layer_idx: int):
58
+ self._cache = cache
59
+ self._layer_idx = layer_idx
60
+
61
+ @property
62
+ def is_sliding(self) -> list[bool]:
63
+ return [False] * self._layer_idx + [True]
64
+
65
+ def __getattr__(self, name: str):
66
+ return getattr(self._cache, name)
67
+
68
+
69
+ class Mxfp4GptOssPuzzleExperts(Mxfp4GptOssExperts):
70
+ def __init__(self, config: GptOssPuzzleConfig):
71
+ """
72
+ Extends Mxfp4GptOssExperts to support per-layer configs.
73
+ Since this class is created without passing the layer index, we need to infer it from the call stack.
74
+ """
75
+ # module_name is of the form *.{layer_idx}.mlp.experts
76
+ current_key_name = _get_variable_from_stack(["current_key_name"])
77
+ if current_key_name is None:
78
+ module_name = _get_variable_from_stack(["module_name"])
79
+ if module_name is None:
80
+ raise RuntimeError("`current_key_name`/`module_name` variable not found in caller stack")
81
+ layer_idx = int(module_name.split(".")[-3])
82
+ else:
83
+ layer_idx = int(current_key_name[-3])
84
+
85
+ layer_config = config.get_gpt_oss_config_for_layer(layer_idx)
86
+ super().__init__(layer_config)
87
+
88
+
89
+ def _get_variable_from_stack(names: list[str]) -> str | None:
90
+ f = inspect.currentframe().f_back
91
+ while f:
92
+ for name in names:
93
+ if name in f.f_locals:
94
+ return f.f_locals[name]
95
+ f = f.f_back
96
+ return None
97
+
98
+
99
+ class PuzzleDynamicCache(DynamicCache):
100
+ """
101
+ A child class of DynamicCache that supports heterogeneous layer configurations.
102
+
103
+ __init__ is the same as in DynamicCache, except for the usage of sliding window which is obtained per layer from `block_configs`.
104
+ """
105
+
106
+ def __init__(
107
+ self,
108
+ ddp_cache_data: Optional[Iterable[tuple[torch.Tensor, torch.Tensor]]] = None,
109
+ config: Optional[GptOssPuzzleConfig] = None,
110
+ offloading: bool = False,
111
+ offload_only_non_sliding: bool = False,
112
+ ):
113
+ layers = []
114
+ # If a config is passed, use it to infer the layer types and initialize accordingly
115
+ if config is not None:
116
+ decoder_config = config.get_text_config(decoder=True)
117
+ layer_types = getattr(decoder_config, "layer_types", None)
118
+ if layer_types is None:
119
+ layer_types = []
120
+ for layer_idx in range(decoder_config.num_hidden_layers):
121
+ sliding_window = None
122
+ for attr_name in ("sliding_window", "attention_chunk_size"):
123
+ sliding_window = getattr(
124
+ config.block_configs[layer_idx],
125
+ attr_name,
126
+ getattr(decoder_config, attr_name, None),
127
+ )
128
+ if sliding_window is not None:
129
+ break
130
+ layer_types.append("sliding_attention" if sliding_window is not None else "full_attention")
131
+
132
+ # Some models have shared layers thus no cache is needed for them (e.g. Gemma3n)
133
+ if hasattr(decoder_config, "num_kv_shared_layers"):
134
+ layer_types = layer_types[: -decoder_config.num_kv_shared_layers]
135
+
136
+ for layer_idx, layer_type in enumerate(layer_types):
137
+ # From a cache point of view, both sliding and chunked are the same in how they should behave and how many
138
+ # states they should return - only the mask changes to make them different at the end!
139
+ if layer_type in ("sliding_attention", "chunked_attention"):
140
+ sliding_window = None
141
+ for attr_name in ("sliding_window", "attention_chunk_size"):
142
+ sliding_window = getattr(
143
+ decoder_config.block_configs[layer_idx],
144
+ attr_name,
145
+ getattr(decoder_config, attr_name, None),
146
+ )
147
+ if sliding_window is not None:
148
+ break
149
+
150
+ layers.append(DynamicSlidingWindowLayer(sliding_window=sliding_window))
151
+ else:
152
+ layers.append(DynamicLayer())
153
+
154
+ # In this case, use the passed data to already fill in the Cache
155
+ if ddp_cache_data is not None:
156
+ # Init all the layers with the data
157
+ for layer_idx, (key_states, value_states) in enumerate(ddp_cache_data):
158
+ # If the config was not passed above, initialize a DynamicLayer for each entry of the ddp_data
159
+ if config is None:
160
+ layers.append(DynamicLayer())
161
+ # Update the layer with the data
162
+ _, _ = layers[layer_idx].update(key_states, value_states)
163
+
164
+ # If neither of config nor ddp_data was passed, then simply lazy init a full cache of DynamicLayer
165
+ if len(layers) == 0:
166
+ super(DynamicCache, self).__init__(
167
+ layer_class_to_replicate=DynamicLayer,
168
+ offloading=offloading,
169
+ offload_only_non_sliding=offload_only_non_sliding,
170
+ )
171
+ else:
172
+ super(DynamicCache, self).__init__(
173
+ layers=layers, offloading=offloading, offload_only_non_sliding=offload_only_non_sliding
174
+ )
175
+
176
+
177
+ original_load_balancing_loss_func = modeling_gpt_oss.load_balancing_loss_func
178
+
179
+
180
+ def load_balancing_loss_func(
181
+ gate_logits: Union[torch.Tensor, tuple[torch.Tensor], None],
182
+ num_experts: Optional[int] = None,
183
+ top_k=2,
184
+ attention_mask: Optional[torch.Tensor] = None,
185
+ num_experts_per_layer: tuple[int, ...] = None,
186
+ ) -> Union[torch.Tensor, int]:
187
+ if gate_logits is None or not isinstance(gate_logits, tuple):
188
+ return 0
189
+
190
+ compute_device = gate_logits[0].device
191
+ overall_loss = 0
192
+
193
+ for layer_idx, layer_gate_logits in enumerate(gate_logits):
194
+ layer_loss = original_load_balancing_loss_func(
195
+ gate_logits=(layer_gate_logits,),
196
+ num_experts=num_experts_per_layer[layer_idx],
197
+ top_k=top_k,
198
+ attention_mask=attention_mask,
199
+ )
200
+ overall_loss += layer_loss.to(compute_device)
201
+
202
+ return overall_loss
203
+
204
+
205
+ class GptOssPuzzleForCausalLM(GptOssForCausalLM):
206
+ """
207
+ A child class of GptOssForCausalLM to support heterogeneous layer configurations.
208
+
209
+ This class uses monkey-patching to inject custom behavior into the parent class while maximizing
210
+ code reuse and minimizing duplication. During `__init__`, it temporarily replaces the decoder layer
211
+ class to use `GptOssPuzzleDecoderLayer`. During `forward`, it patches mask creation, cache handling,
212
+ and load balancing loss computation to account for per-layer variations.
213
+ """
214
+
215
+ config_class = GptOssPuzzleConfig
216
+ _no_split_modules = ["GptOssPuzzleDecoderLayer"]
217
+ _keys_to_ignore_on_load_unexpected = [r"\.k_scale$", r"\.v_scale$"]
218
+
219
+ def __init__(self, config):
220
+ # PER_BLOCK_ATTRIBUTE values that are not supposed to be used. Required just because accessed in GptOssForCausalLM's __init__
221
+ config.num_local_experts = "PER_BLOCK_ATTRIBUTE"
222
+
223
+ original_decoder_layer_cls = modeling_gpt_oss.GptOssDecoderLayer
224
+ modeling_gpt_oss.GptOssDecoderLayer = GptOssPuzzleDecoderLayer
225
+ try:
226
+ super().__init__(config)
227
+ self.config = config # Used for load_balancing_loss_func
228
+ finally:
229
+ modeling_gpt_oss.GptOssDecoderLayer = original_decoder_layer_cls
230
+
231
+ mxfp4.Mxfp4GptOssExperts = Mxfp4GptOssPuzzleExperts # Used after the model is initialized
232
+
233
+ def forward(self, *args, **kwargs):
234
+ original_create_sliding_window_causal_mask = modeling_gpt_oss.create_sliding_window_causal_mask
235
+ original_dynamic_cache = modeling_gpt_oss.DynamicCache
236
+
237
+ modeling_gpt_oss.load_balancing_loss_func = functools.partial(
238
+ load_balancing_loss_func,
239
+ num_experts_per_layer=tuple(block_config.num_local_experts for block_config in self.config.block_configs),
240
+ )
241
+ modeling_gpt_oss.create_sliding_window_causal_mask = lambda **kwargs: SlidingWindowCausalMaskPlaceholder(
242
+ kwargs=kwargs
243
+ )
244
+ modeling_gpt_oss.DynamicCache = PuzzleDynamicCache
245
+ try:
246
+ return super().forward(*args, **kwargs)
247
+ finally:
248
+ modeling_gpt_oss.create_sliding_window_causal_mask = original_create_sliding_window_causal_mask
249
+ modeling_gpt_oss.load_balancing_loss_func = original_load_balancing_loss_func
250
+ modeling_gpt_oss.DynamicCache = original_dynamic_cache
251
+
252
+ def _prepare_cache_for_generation(self, *args, **kwargs):
253
+ from transformers.generation import utils as generation_utils
254
+
255
+ original_dynamic_cache = generation_utils.DynamicCache
256
+ generation_utils.DynamicCache = PuzzleDynamicCache
257
+ try:
258
+ return super()._prepare_cache_for_generation(*args, **kwargs)
259
+ finally:
260
+ generation_utils.DynamicCache = original_dynamic_cache
privacy.md ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **Privacy**
2
+
3
+ Field | Response
4
+ :----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
5
+ Generatable or reverse engineerable personal data? | No
6
+ Personal data used to create this model? | No
7
+ How often is dataset reviewed? | Before Release
8
+ Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No
9
+ Is there provenance for all datasets used in training? | Yes
10
+ Does data labeling (annotation, metadata) comply with privacy laws? | Yes
11
+ Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data
12
+ Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
safety.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ | Field | Response |
2
+ | :---- | :---- |
3
+ | Model Application Field(s): | Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning, Customer Service |
4
+ | Describe the life critical impact (if present). | Not Applicable |
5
+ | Use Case Restrictions: | Abide by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). |
6
+ | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|return|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0614fe83cadab421296e664e1f48f4261fa8fef6e03e63bb75c20f38e37d07d3
3
+ size 27868174
tokenizer_config.json ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "199998": {
4
+ "content": "<|startoftext|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "199999": {
12
+ "content": "<|endoftext|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "200000": {
20
+ "content": "<|reserved_200000|>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "200001": {
28
+ "content": "<|reserved_200001|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "200002": {
36
+ "content": "<|return|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "200003": {
44
+ "content": "<|constrain|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "200004": {
52
+ "content": "<|reserved_200004|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "200005": {
60
+ "content": "<|channel|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "200006": {
68
+ "content": "<|start|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "200007": {
76
+ "content": "<|end|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "200008": {
84
+ "content": "<|message|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "200009": {
92
+ "content": "<|reserved_200009|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "200010": {
100
+ "content": "<|reserved_200010|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "200011": {
108
+ "content": "<|reserved_200011|>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "200012": {
116
+ "content": "<|call|>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "200013": {
124
+ "content": "<|reserved_200013|>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "200014": {
132
+ "content": "<|reserved_200014|>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "200015": {
140
+ "content": "<|reserved_200015|>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "200016": {
148
+ "content": "<|reserved_200016|>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "200017": {
156
+ "content": "<|reserved_200017|>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "200018": {
164
+ "content": "<|endofprompt|>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ }
171
+ },
172
+ "bos_token": "<|startoftext|>",
173
+ "clean_up_tokenization_spaces": false,
174
+ "eos_token": "<|return|>",
175
+ "extra_special_tokens": {},
176
+ "model_input_names": [
177
+ "input_ids",
178
+ "attention_mask"
179
+ ],
180
+ "model_max_length": 1000000000000000019884624838656,
181
+ "pad_token": "<|endoftext|>",
182
+ "tokenizer_class": "PreTrainedTokenizerFast"
183
+ }