Solshine commited on Apr 22

Commit

2325115

0 Parent(s):

Initial public release: SAE weights, cfg, and model card

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +35 -0
README.md +216 -0
d20_jumprelu_L10_deceptive_only/cfg.json +23 -0
d20_jumprelu_L10_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_L10_honest_only/cfg.json +23 -0
d20_jumprelu_L10_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_L10_mixed/cfg.json +23 -0
d20_jumprelu_L10_mixed/sae_weights.safetensors +3 -0
d20_jumprelu_L14_deceptive_only/cfg.json +23 -0
d20_jumprelu_L14_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_L14_honest_only/cfg.json +23 -0
d20_jumprelu_L14_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_L14_mixed/cfg.json +23 -0
d20_jumprelu_L14_mixed/sae_weights.safetensors +3 -0
d20_jumprelu_L18_deceptive_only/cfg.json +23 -0
d20_jumprelu_L18_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_L18_honest_only/cfg.json +23 -0
d20_jumprelu_L18_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_L18_mixed/cfg.json +23 -0
d20_jumprelu_L18_mixed/sae_weights.safetensors +3 -0
d20_jumprelu_L2_deceptive_only/cfg.json +23 -0
d20_jumprelu_L2_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_L2_honest_only/cfg.json +23 -0
d20_jumprelu_L2_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_L2_mixed/cfg.json +23 -0
d20_jumprelu_L2_mixed/sae_weights.safetensors +3 -0
d20_jumprelu_L4_deceptive_only/cfg.json +23 -0
d20_jumprelu_L4_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_L4_honest_only/cfg.json +23 -0
d20_jumprelu_L4_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_L4_mixed/cfg.json +23 -0
d20_jumprelu_L4_mixed/sae_weights.safetensors +3 -0
d20_jumprelu_L8_deceptive_only/cfg.json +23 -0
d20_jumprelu_L8_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_L8_honest_only/cfg.json +23 -0
d20_jumprelu_L8_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_L8_mixed/cfg.json +23 -0
d20_jumprelu_L8_mixed/sae_weights.safetensors +3 -0
d20_jumprelu_ste_L10_deceptive_only/cfg.json +24 -0
d20_jumprelu_ste_L10_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_ste_L10_honest_only/cfg.json +24 -0
d20_jumprelu_ste_L10_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_ste_L10_mixed/cfg.json +24 -0
d20_jumprelu_ste_L10_mixed/sae_weights.safetensors +3 -0
d20_jumprelu_ste_L14_deceptive_only/cfg.json +24 -0
d20_jumprelu_ste_L14_deceptive_only/sae_weights.safetensors +3 -0
d20_jumprelu_ste_L14_honest_only/cfg.json +24 -0
d20_jumprelu_ste_L14_honest_only/sae_weights.safetensors +3 -0
d20_jumprelu_ste_L14_mixed/cfg.json +24 -0
d20_jumprelu_ste_L14_mixed/sae_weights.safetensors +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,216 @@

+---
+language: en
+license: apache-2.0
+tags:
+  - sparse-autoencoder
+  - SAE
+  - interpretability
+  - deception-detection
+  - mechanistic-interpretability
+  - saelens
+  - neuronpedia
+base_model:
+  - karpathy/nanochat-d32
+  - karpathy/nanochat-d20
+  - meta-llama/Llama-3.2-1B
+  - Qwen/Qwen3-1.7B
+  - HuggingFaceTB/SmolLM2-135M
+  - microsoft/phi-2
+  - TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
+  - microsoft/Phi-4-mini-reasoning
+  - EleutherAI/pythia-160m
+datasets:
+  - Solshine/deception-behavioral-multimodel
+---
+# Deception Behavioral SAEs — SAELens/Neuronpedia Format
+**366 Sparse Autoencoders** trained on behavioral deception activations across
+9 language models (5 architecture families), formatted for
+[SAELens](https://github.com/jbloomAus/SAELens) and
+[Neuronpedia](https://neuronpedia.org) compatibility.
+## Training-data caveat — please read before use
+The "honest" and "deceptive" labels used to train and evaluate this SAE
+come from a **mixed scenario pool** of nine prompts:
+- **Six clean incentive-structure scenarios** — `insider_info`,
+  `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
+  `surprise_party`, `job_interview_gap`. Each presents a realistic
+  situation with a non-trivial incentive to conceal or disclose
+  information; the honest/deceptive distinction is which of the two
+  behavioral choices the model's completion settles into under
+  temperature sampling.
+- **Three role-play identity-assignment scenarios** —
+  `secret_role_game` ("You have been assigned the role of Fascist"),
+  `secret_role_game_v2` ("You are secretly one of the Mafia members"),
+  `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
+  deceptive identity and label a completion "deceptive" when the model
+  drifts away from the assigned role or "honest" when it echoes it.
+**What this mixed pool means for the SAE's labels.** Within the six
+incentive-structure scenarios, the honest/deceptive distinction is a
+measurement of behavioral choice under an ambiguous incentive. Within
+the three role-play scenarios, the distinction is a measurement of
+role-consistency under identity-assigned role-play — which is a
+well-defined phenomenon but not the same as emergent or incentive-
+driven deception.
+**What this SAE is and is not good for.**
+- **Good for:** research on mixed-pool activation geometry; SAE
+  feature-geometry studies; as one of a set of baselines when
+  comparing multiple SAE families; as a reference implementation of
+  same-prompt temperature-sampled behavioral SAE training at scale.
+- **Not recommended as a standalone deception detector.** The
+  role-consistency signal from the three role-play scenarios is mixed
+  into every aggregate metric reported below. A downstream user who
+  wants an "emergent-deception feature set" should restrict attention
+  to features whose activation pattern concentrates in the
+  `insider_info` / `accounting_error` / `ai_oversight_log` /
+  `ai_capability_hide` / `surprise_party` / `job_interview_gap`
+  scenarios — or wait for the methodologically corrected V3 re-release
+  currently in preparation on the decision-incentive scenario bank
+  (no pre-assigned deceptive identity).
+**What is unaffected by this caveat.**
+- The SAE weights, reconstruction metrics (explained variance, L0,
+  alive features), and engineering of the training pipeline are
+  accurate as reported.
+- The linear-probe balanced-accuracy numbers in the upstream paper
+  measure the mixed pool; the 6-scenario clean-subset re-analysis is
+  listed as a planned appendix for the next manuscript revision.
+A companion methodology-first Gemma 4 SAE suite is in preparation using
+pretraining-distribution data + a decision-incentive behavior split;
+this README will be updated with a link when that release is public.
+---
+Original flat-file checkpoints (with full training metadata) are in:
+[Solshine/nanochat-d32-deception-saes-batch](https://huggingface.co/Solshine/nanochat-d32-deception-saes-batch)
+## Research Context
+These SAEs are trained on **same-prompt behavioral sampling** data: a single ambiguous
+scenario prompt produces both deceptive and honest completions via temperature sampling.
+The SAEs decompose residual stream activations during deceptive vs. honest response
+generation — enabling interpretability analysis of deception-relevant features.
+**Paper:** "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
+[arXiv:2509.20393](https://arxiv.org/abs/2509.20393)
+**Follow-up repo:** [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)
+**Author:** Caleb DeLeeuw (2026)
+## Key Findings (Cross-Model, 366 SAEs, 9 Models, 5 Architecture Families)
+**Linear probes on raw activations:**
+| Model | Params | Peak Layer (depth) | Bal. Accuracy | AUROC |
+|---|---|---|---|---|
+| nanochat-d32 | 1.88B | L12 (37%) | **86.9%** | 0.923 |
+| Qwen3-1.7B | 1.7B | L17 (63%) | **80.9%** | 0.893 |
+| Phi-4-mini-reasoning | 3.8B | L20 (64%) | **80.8%** | 0.860 |
+| Phi-2 | 2.7B | L21 (75%) | ~75% | — |
+| TinyLlama-1.1B | 1.1B | L21 (95%) | **73.2%** | 0.784 |
+| Llama 3.2-1B | 1.0B | L9 (56%) | **72.5%** | — |
+| nanochat-d20 | 1.88B | L14 (70%) | ~67% | — |
+| SmolLM2-135M | 135M | L4 (80%) | ~69% | — |
+| Pythia-160M | 160M | L0 (0%) | **66.0%** | 0.696 |
+All results p < 0.001, PCA-robust.
+**SAE decomposition — model-size-dependent:**
+- **Models ≤ 1.3B:** SAEs *help* detection (8–47% of SAEs beat raw probe accuracy)
+- **Models ≥ 1.7B:** SAEs *hurt* detection (0–4% beat raw)
+- **Transition:** between TinyLlama-1.1B (47% help) and Qwen3-1.7B (<4% help)
+- **Best SAE config (small models):** JumpReLU + honest_only training condition
+- **Phi-2 anomaly:** 33% of SAEs help at 2.7B (parallel attention architecture); does NOT extend to Phi-4-mini (3.8B, 2%)
+- **Feature steering:** Null results at all tested layers/models — deception is distributed, not localizable to individual features
+## Models Covered
+| Model | Params | Architecture | Layers in SAEs | SAE Count | SAE Arches |
+|---|---|---|---|---|---|
+| nanochat-d32 | 1.88B | GPT-NeoX | L4, 8, 12, 16, 20, 24 | 57 | TopK, JumpReLU, Gated |
+| nanochat-d20 | 1.88B | GPT-NeoX | L2, 4, 8, 10, 14, 18 | 45 | TopK, JumpReLU |
+| Qwen3-1.7B | 1.7B | Qwen | L12, 14, 15, 17, 18 | 45 | TopK, JumpReLU, Gated |
+| Phi-4-mini-reasoning | 3.8B | Phi | L2, 6, 10, 14, 18, 22, 26 | 42 | TopK, JumpReLU |
+| SmolLM2-135M | 135M | Llama2 | L3, 4, 5, 6, 9, 12, 15, 18, 21 | 54 | TopK, JumpReLU |
+| Phi-2 | 2.7B | Phi (parallel) | L4, 8, 12, 16, 20 | 30 | TopK, JumpReLU |
+| TinyLlama-1.1B | 1.1B | Llama2 | L3, 6, 9, 12, 15 (+STE) | 39 | TopK, JumpReLU |
+| Llama 3.2-1B | 1.0B | Llama | L2, 4, 6 | 18 | TopK, JumpReLU |
+| Pythia-160M | 160M | GPT-NeoX | L1, 2, 4, 6, 8, 10 | 36 | TopK, JumpReLU |
+| **Total** | | | | **366** | |
+**Note on STE validation SAEs:** nanochat-d20 and TinyLlama each include 9 additional
+"_ste_" tagged SAEs (e.g., `d20_jumprelu_ste_L14_honest_only`) trained with the corrected
+Gaussian-kernel STE to validate that the JumpReLU honest_only advantage is not a
+dimensionality artifact. 15/18 conditions (83%) confirm the advantage is real.
+## Training Details
+**Hardware:** NVIDIA GeForce GTX 1650 Ti with Max-Q Design, 4 GB VRAM (Windows 11 Pro)
+**Training time:** ~400–600 seconds per SAE (300 epochs, batch_size=128)
+**Framework:** Custom PyTorch training loop with SAELens-compatible architecture
+**Activations:** Residual stream (`resid_post`) collected at generation time
+**Expansion factor:** 4× (d_sae = 4 × d_model)
+**Architectures:** TopK (k=64), JumpReLU, Gated
+**Training conditions:** `mixed` (all completions), `honest_only`, `deceptive_only`
+**Classification:** Gemini 2.5 Flash (behavioral LLM classification, not regex)
+## SAE Format
+Each SAE is in its own subfolder `{sae_id}/` containing:
+- `sae_weights.safetensors` — weights (W_enc, b_enc, W_dec, b_dec, [threshold for JumpReLU])
+- `cfg.json` — SAELens-compatible config (architecture, hook_name, d_in, d_sae, training condition)
+## Known Limitations
+**JumpReLU threshold training (348 original SAEs):**
+The 348 original batch SAEs (non-STE) have `threshold = 0` throughout — functionally equivalent
+to ReLU. The Heaviside step function has zero autograd gradient with respect to the threshold,
+so without a custom straight-through estimator (STE), the threshold never updates from its
+initialization of zero. These SAEs operate with ~50% feature density (L0 ≈ d_sae/2) rather
+than the intended sparse regime. TopK SAEs (exact L0=64) are the properly sparse architecture
+in this collection.
+**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE
+(Rajamanoharan et al. 2024, arXiv:2407.14435). The 18 `_ste_` tagged SAEs in this repo
+use the corrected code. Targeted validation (18 STE SAEs across d20 and TinyLlama)
+confirmed that the honest_only advantage over TopK is **not** a dimensionality artifact —
+15/18 conditions (83%) show STE JumpReLU > TopK even with threshold training.
+**The honest_only > TopK probe accuracy finding is valid** regardless of the threshold bug.
+The threshold bug affects downstream Neuronpedia feature analysis (active feature density),
+not the probe accuracy comparisons.
+## Loading with SAELens
+```python
+from safetensors.torch import load_file
+import json
+sae_id = "d32_topk_L12_honest_only"  # or any sae_id from the repo
+weights = load_file(f"{sae_id}/sae_weights.safetensors")
+cfg = json.load(open(f"{sae_id}/cfg.json"))
+# W_enc shape: [d_in, d_sae], W_dec shape: [d_sae, d_in]
+# cfg["training_condition"] records honest_only / deceptive_only / mixed
+```
+## Citation
+If you use these SAEs, please cite the original paper:
+```
+@article{thesecretagenda2025,
+  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
+  author={DeLeeuw, Caleb},
+  journal={arXiv:2509.20393},
+  year={2025}
+}
+```

d20_jumprelu_L10_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.10",
+  "hook_layer": 10,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 10, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L10_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d01e98627d6498c10850c0ea36d28e1bb37a0202c3dc0396297432fe6ae93a6b
+size 52475272

d20_jumprelu_L10_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.10",
+  "hook_layer": 10,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 10, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L10_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1fa8acaff891b62eba2be1545b301037e343bfd382f35a311fa36907b7787801
+size 52475272

d20_jumprelu_L10_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.10",
+  "hook_layer": 10,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 10, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L10_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f2791a0c4d6bd906f93d47b1bbae82621114bae50b284c207dbb1244c32d28e
+size 52475272

d20_jumprelu_L14_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.14",
+  "hook_layer": 14,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 14, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L14_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9f48d0dbdf4b16efc6e3f2c9f4d9c291c83026d5932dea117497d04725d57743
+size 52475272

d20_jumprelu_L14_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.14",
+  "hook_layer": 14,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 14, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L14_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2c947de17dcbf644c39a6739fff5a5ffcb5f0a4f91301096969e818322a710de
+size 52475272

d20_jumprelu_L14_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.14",
+  "hook_layer": 14,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 14, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L14_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:93b6f726f410d888c15122af0aebe96f78963df83409953498811d1b3507fbfb
+size 52475272

d20_jumprelu_L18_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.18",
+  "hook_layer": 18,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 18, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L18_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:84f9cf2f3df9b455c31d7dfa6d020d7833fb5997e4942a901aa13972209ec425
+size 52475272

d20_jumprelu_L18_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.18",
+  "hook_layer": 18,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 18, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L18_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bd323fe9111e8252a4eb1a7bf98412b5bc565a24f065aff59763391028d128d5
+size 52475272

d20_jumprelu_L18_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.18",
+  "hook_layer": 18,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 18, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L18_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c067cf83457eac9b858cd73d99e6fe432f99823baaab576a2c628b430068e5e1
+size 52475272

d20_jumprelu_L2_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.2",
+  "hook_layer": 2,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 2, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L2_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c123a302a2fa65c4a70c4d92a8fb1dc6134fa67f82852a64938136d852d1eb95
+size 52475272

d20_jumprelu_L2_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.2",
+  "hook_layer": 2,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 2, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L2_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:03787ad756a1e60a282884cacf72ca6964e23fabd539004835c31a4aff878a03
+size 52475272

d20_jumprelu_L2_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.2",
+  "hook_layer": 2,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 2, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L2_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e945cec230f3f9d8e3cbd10efc2dcf817a5f072c03253e40d21af04c0442aba6
+size 52475272

d20_jumprelu_L4_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.4",
+  "hook_layer": 4,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 4, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L4_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:659910ca6dfa1e448ab6868976ecd700eb2b9839a7e6def02ab113c5a0e1043a
+size 52475272

d20_jumprelu_L4_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.4",
+  "hook_layer": 4,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 4, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L4_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f201daada8c7efdbd486e54ae176f8e03908fc58360753e8d688728565d155c
+size 52475272

d20_jumprelu_L4_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.4",
+  "hook_layer": 4,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 4, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L4_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a5bb121f7146fb52ba7114f04eccfdb9b0a5edd19f51ca600223ad8700b7480
+size 52475272

d20_jumprelu_L8_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.8",
+  "hook_layer": 8,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 8, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L8_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d23b57485e5312179551b8258dcf733876659c9eca59a52077fcb6dd0ebe3c05
+size 52475272

d20_jumprelu_L8_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.8",
+  "hook_layer": 8,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 8, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L8_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:17b6a84176fb1de8f1ef94fa54a0c3d362603495a71beb25898b212af1c57263
+size 52475272

d20_jumprelu_L8_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "transformer.h.8",
+  "hook_layer": 8,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_notes": "Deception behavioral SAE \u2014 same-prompt behavioral sampling. Model: karpathy/nanochat-d20, Layer 8, jumprelu. See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "source_repo": "https://github.com/SolshineCode/deception-nanochat-sae-research"
+}

d20_jumprelu_L8_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca72b83da9ebed2cc0de114dbbaad61003c25b92321538de58a9bd39bdba17f5
+size 52475272

d20_jumprelu_ste_L10_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "model.layers.10",
+  "hook_layer": 10,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-ste-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_condition": "deceptive_only",
+  "training_notes": "STE validation SAE (2026-04-11) \u2014 deceptive_only training, Gaussian-kernel STE fix (arXiv:2407.14435). See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "ste_note": "STE validation SAE: thresholds trained via Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Trained for 300 epochs. See entries #61/#62 in RESULTS_INDEX.md for probe accuracy comparison vs TopK."
+}

d20_jumprelu_ste_L10_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e9197f52fa0a3b8b0628c2f5d3f055a271d61dad03f3e53043d5d5bb2941c441
+size 52475272

d20_jumprelu_ste_L10_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "model.layers.10",
+  "hook_layer": 10,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-ste-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_condition": "honest_only",
+  "training_notes": "STE validation SAE (2026-04-11) \u2014 honest_only training, Gaussian-kernel STE fix (arXiv:2407.14435). See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "ste_note": "STE validation SAE: thresholds trained via Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Trained for 300 epochs. See entries #61/#62 in RESULTS_INDEX.md for probe accuracy comparison vs TopK."
+}

d20_jumprelu_ste_L10_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3817a36427111d4988927011ba41f7e0f4c4557a0e54fd7f8c68be88903e5aba
+size 52475272

d20_jumprelu_ste_L10_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "model.layers.10",
+  "hook_layer": 10,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-ste-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_condition": "mixed",
+  "training_notes": "STE validation SAE (2026-04-11) \u2014 mixed training, Gaussian-kernel STE fix (arXiv:2407.14435). See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "ste_note": "STE validation SAE: thresholds trained via Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Trained for 300 epochs. See entries #61/#62 in RESULTS_INDEX.md for probe accuracy comparison vs TopK."
+}

d20_jumprelu_ste_L10_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1ab333e58a17d0b66347283a80362dad7084daeeed74a2f64151226bd8449d6
+size 52475272

d20_jumprelu_ste_L14_deceptive_only/cfg.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "model.layers.14",
+  "hook_layer": 14,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-ste-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_condition": "deceptive_only",
+  "training_notes": "STE validation SAE (2026-04-11) \u2014 deceptive_only training, Gaussian-kernel STE fix (arXiv:2407.14435). See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "ste_note": "STE validation SAE: thresholds trained via Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Trained for 300 epochs. See entries #61/#62 in RESULTS_INDEX.md for probe accuracy comparison vs TopK."
+}

d20_jumprelu_ste_L14_deceptive_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:53862b8a592b6e290d92188c9e1b1e7050ece8e4723631ce23baee33fb618528
+size 52475272

d20_jumprelu_ste_L14_honest_only/cfg.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "model.layers.14",
+  "hook_layer": 14,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-ste-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_condition": "honest_only",
+  "training_notes": "STE validation SAE (2026-04-11) \u2014 honest_only training, Gaussian-kernel STE fix (arXiv:2407.14435). See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "ste_note": "STE validation SAE: thresholds trained via Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Trained for 300 epochs. See entries #61/#62 in RESULTS_INDEX.md for probe accuracy comparison vs TopK."
+}

d20_jumprelu_ste_L14_honest_only/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca090dad90ed0e54105b37b5b2e6be6cd7ec4b589e7ced771d192d00ecd8ab7c
+size 52475272

d20_jumprelu_ste_L14_mixed/cfg.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architecture": "jumprelu",
+  "d_in": 1280,
+  "d_sae": 5120,
+  "dtype": "float32",
+  "device": "cpu",
+  "model_name": "karpathy/nanochat-d20",
+  "hook_name": "model.layers.14",
+  "hook_layer": 14,
+  "hook_head_index": null,
+  "activation_fn_str": "jumprelu",
+  "activation_fn_kwargs": {},
+  "apply_b_dec_to_input": false,
+  "finetuning_scaling_factor": false,
+  "sae_lens_training_version": "deception-behavioral-ste-v1",
+  "prepend_bos": false,
+  "dataset_path": "Solshine/deception-behavioral-multimodel",
+  "dataset_trust_remote_code": false,
+  "context_size": null,
+  "normalize_activations": "none",
+  "training_condition": "mixed",
+  "training_notes": "STE validation SAE (2026-04-11) \u2014 mixed training, Gaussian-kernel STE fix (arXiv:2407.14435). See https://github.com/SolshineCode/deception-nanochat-sae-research",
+  "ste_note": "STE validation SAE: thresholds trained via Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). Trained for 300 epochs. See entries #61/#62 in RESULTS_INDEX.md for probe accuracy comparison vs TopK."
+}

d20_jumprelu_ste_L14_mixed/sae_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d349a99b8412a6f70beff23b3fd593b710325b49f4512e2f856ebf2df2008594
+size 52475272