Headline summary.
One row per test family. Every cell is an outcome on a public benchmark or held-out evaluation set against a standard model checkpoint.
| # | Family | Model | Headline | Status |
|---|
| 01 | Phase 0 — mechanism eval | Qwen2.5-Coder-1.5B | Oracle ρ = 1.000 at N=50; full ρ = 0.926 paraphrased | PASS |
| 02 | 7B headline (Acme KB) | Qwen2.5-7B-Instruct | Oracle ρ = 1.000; full ρ = 0.940 paraphrased N=50 | PASS |
| 03 | Cross-architecture transfer | Qwen / Llama / Mistral · 0.5B–72B | Oracle 1.000 on every model; full ρ 0.90–0.953 | PASS |
| 04 | Capacity sweep | Qwen2.5-7B-Instruct | 0.940 → 0.902 paraphrased ρ from N=50 → N=500, no cliff | PASS |
| 05 | Behavioral suite — 1.5B | Qwen2.5-Coder-1.5B | 10/11 PASS, 1 PARTIAL | PASS |
| 06 | Behavioral suite — 7B | Qwen2.5-7B-Instruct | 10/11 PASS, 1 PARTIAL · zero net regressions vs 1.5B | PASS |
| 07 | Capacity stress — 1.5B | Qwen2.5-Coder-1.5B | 99.8% routing @ N=500 · 0 leaks · 257 ms/q (O(1) in N) | PASS |
| 08 | Capacity stress — 7B | Qwen2.5-7B-Instruct | 100% routing @ N=100 · 0 leaks · 815 ms/q | PASS |
| 09 | No-harm (MMLU + GSM8K + HumanEval) | Qwen2.5-7B-Instruct · N=100 active | 0 / 15,498 fires · output bit-identical to baseline | PASS |
| 10 | Head-to-head vs ROME | Qwen2.5-7B-Instruct | WRIT 0.940 vs ROME 0.020 paraphrased N=50 (~47×) | PASS |
| 11 | Head-to-head vs MEMIT | Qwen2.5-7B-Instruct | WRIT 0.940 vs MEMIT 0.070 paraphrased N=50 (~13×) | PASS |
| 12 | Head-to-head vs RAG (top-3) | Qwen2.5-7B-Instruct | WRIT 0.940 vs RAG 0.853 paraphrased · 0 retrieved tokens | PASS |
| 13 | Head-to-head vs LoRA | Qwen2.5-7B-Instruct · 15-cell sweep | LoRA-best 0.954 / −0.200 MMLU vs WRIT 0.902 / 0.000 MMLU | PASS |
| 14 | Static jailbreak families (defense) | Qwen2.5-7B-Instruct · 5 families · 500 attempts | Aggregate ASR 0.158 → 0.006 · 0/20 benign fires | PASS |
| 15 | GCG defense — generic hardening | Qwen2.5-7B-Instruct · 100 GCG attacks | ASR 0.690 → 0.210 · no GCG seen during setup | PASS |
| 16 | GCG defense — targeted hardening | Qwen2.5-7B-Instruct · 75 held-out GCG | ASR 0.653 → 0.000 · 75/75 refused | PERFECT |
| 17 | Right-to-forget | Qwen2.5-7B-Instruct · 100 CounterFact edits | 36.9 µs / unlearn · FS=1.0 / RS=1.0 / ΔMMLU=0.000 · 5–7 orders faster than ROME / MEMIT / LoRA | PASS |
| 18 | Live demos (12-step cast, BTC chat) | Qwen2.5-7B-Instruct | 9/9 cast assertions · BTC live chat 5/5 smoke | PASS |
| 19 | DKT Phase A (drop-in probe) | Qwen2.5-0.5B | 40/40 across all probed depths | PASS |
| 20 | DKT Phase B (from-scratch 200M) | Tiny from-scratch transformer | Recall 40/40 · perplexity 1.76× baseline | PARTIAL |
Snapshot 2026-04-29. ρ = headroom-normalized substring-recall correlation on held-out paraphrased queries. Pre-registered kill criterion: oracle ρ < 0.75 at N=50 — never fired.
Knowledge install.
Single-fact and multi-fact installation, measured by paraphrased substring recall on held-out queries. Oracle = perfect routing, isolates the plasticity rule. Full = end-to-end with the live dispatcher. zero and random_B are the sanity floors.
Phase 0 — mechanism eval (Qwen2.5-Coder-1.5B).
| N | full | zero | random_B | oracle_gate | frozen_gate |
|---|
| 1 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| 5 | 0.971 | 0.000 | 0.000 | 1.000 | 0.248 |
| 10 | 0.962 | 0.000 | 0.000 | 1.000 | 0.114 |
| 25 | 0.968 | 0.000 | 0.000 | 1.000 | 0.038 |
| 50 | 0.926 | 0.000 | 0.000 | 1.000 | 0.012 |
7B headline — Acme KB (Qwen2.5-7B-Instruct, N=50).
| Ablation | literal | paraphrased | abstract | task |
|---|
| zero | 0.000 | 0.000 | 0.000 | 0.000 |
| random_B | 0.000 | 0.000 | 0.000 | 0.000 |
| oracle_gate | 1.000 | 1.000 | 1.000 | 1.000 |
| full | 0.980 | 0.940 | 1.000 | 0.987 |
Pre-registered MVP bars: oracle ρ ≥ 0.90 and full ρ ≥ 0.70 at N=50 paraphrased. Cleared by 24 points on full-ρ.
Capacity sweep — paraphrased ρ vs N (Qwen2.5-7B-Instruct).
| N | Paraphrased ρ (full) | Note |
|---|
| 50 | 0.940 | matches Task 6 headline |
| 100 | 0.913 | mid-curve |
| 200 | 0.905 | early plateau |
| 500 | 0.902 | end of sweep, no degradation cliff |
Cross-architecture transfer.
Same primitive across three architectural families and 144× parameter range. Per-family operating-point tuning is mechanical (one decoder layer + one fractional scale).
| Model | Params | Oracle ρ | Full ρ | Operating point |
|---|
| Qwen2.5-Coder-1.5B | 1.5B | 1.000 | 0.926 | L27 / frac=0.005 |
| Qwen2.5-7B-Instruct | 7B | 1.000 | 0.940 | L27 / frac=0.005 |
| Llama-3.1-8B-Instruct | 8B | 1.000 | 0.940 | L30 / frac≈0.5 |
| Mistral-7B-Instruct-v0.3 | 7B | 1.000 | 0.947 | L30 / frac≈0.5 |
| Qwen2.5-14B-Instruct | 14B | 1.000 | 0.900 | L47 / frac=0.05 |
| Qwen2.5-72B-Instruct | 72B | 1.000 | 0.953 | L79 / frac=0.05 |
All cells: N=50, paraphrased held-out queries. Phi and Gemma additionally sanity-validated; not in this table.
Capacity at scale.
Number of WRIT operations coexisting on a single base model, with routing accuracy and benign-fire (leak) counts.
Qwen2.5-Coder-1.5B.
| N | Routing | Control leaks | Inference |
|---|
| 10 | 100% | 0 | 339 ms/q |
| 50 | 100% | 0 | 244 ms/q |
| 100 | 100% | 0 | 248 ms/q |
| 200 | 99.5% | 0 | 244 ms/q |
| 300 | 99.7% | 0 | ~250 ms/q |
| 500 | 99.8% (499/500) | 0 | 257 ms/q |
Qwen2.5-7B-Instruct.
| N | Routing | Leaks | Build | Gate train | Generation |
|---|
| 10 | 10/10 | 0/4 | 3.1 s | 25.1 s | 1222 ms |
| 25 | 25/25 | 0/4 | 7.3 s | 39.7 s | 1025 ms |
| 50 | 50/50 | 0/4 | 13.2 s | 74.4 s | 866 ms |
| 100 | 100/100 | 0/4 | 27.5 s | 136.7 s | 815 ms |
Composability ceiling.
5,000 simultaneous operations on a 7B model.
Routing accuracy at N=5,00098.5%recall 98.5% · zero cross-talk
MMLU change with 5,000 active operations−0.01 ppcapacity scales 10× with only a 1.3-point routing-accuracy drop
No-harm benchmark.
The selectivity test. With N=100 active operations, the dispatcher is run against three standard public benchmarks. A “fire” means a WRIT operation activated — the headline number is how many times it did.
| Benchmark | Prompts | WRIT N=100 | Vanilla baseline | Δ | Fire rate |
|---|
| MMLU (5-shot) | 14,015 | 68.6100 | 68.6085 | +0.0015 | 0 / 14,015 |
| GSM8K (5-shot) | 1,319 | 69.98 | — | — | 0 / 1,319 |
| HumanEval (pass@1) | 164 | 82.32 | — | — | 0 / 164 |
| Total | 15,498 | — | — | — | 0 / 15,498 = 0.0% |
With zero fires, output is mathematically identical to the unmodified base — proven on MMLU at the 14-decimal level. Selectivity improves with N: 1/570 fires at N=10 → 0/14,015 at N=100.
Head-to-head baselines.
Same model (Qwen2.5-7B-Instruct), same KB (57–553 base-ignorance-verified Acme facts), same paraphrased substring evaluator. Every method ran on the operating point its own paper or sweep recommended.
vs ROME (Meng et al. 2022) and MEMIT (Meng et al. 2023).
| N | WRIT oracle | WRIT full | MEMIT (best) | ROME L15 |
|---|
| 10 | 1.000 | 1.000 | 0.200 | 0.100 |
| 50 | 1.000 | 0.940 | 0.070 | 0.020 |
ROME drift after 50 sequential edits: 307% of frozen-matrix Frobenius norm; perplexity 6.27 → 8.24. MEMIT structural improvements held (drift 19%; perplexity unchanged at N=50). Margins: ~47× over ROME, ~13× over MEMIT.
vs RAG (top-3 retrieval).
| Method | Strict-match | Paraphrased ρ | Tokens consumed | External index? |
|---|
| RAG (top-3) | 0.633 | 0.853 | hundreds | yes |
| WRIT (full) | 0.940 | 0.940 | 0 | no |
vs LoRA — 15-cell sweep.
Ranks ∈ {8, 16, 32, 64} × N_facts ∈ {50, 200, 500} × Protocol A (3 ep) + Protocol B (3 ep, r=64 / N=500) + 2 max-effort cells (20 ep). The selected cells are baseline + the highest-effort points.
| Cell | rank | N | epochs | para_sub | MMLU | ΔMMLU |
|---|
| Baseline (no LoRA) | — | — | — | 0.000 | 0.680 | — |
| protA_r8_n500 | 8 | 500 | 3 | 0.040 | 0.660 | −0.020 |
| protA_r64_n500 | 64 | 500 | 3 | 0.154 | 0.680 | +0.000 |
| protB_r64_n500_BEST | 64 | 500 | 3 | 0.870 | 0.600 | −0.080 |
| protA_r64_n500_E20 | 64 | 500 | 20 | 0.816 | 0.560 | −0.120 |
| protB_r64_n500_BEST_E20 | 64 | 500 | 20 | 0.954 | 0.480 | −0.200 |
| WRIT @ N=100 | — | 100 | — | 0.913 | 0.680 | 0.000 |
| WRIT @ N=500 | — | 500 | — | 0.902 | 0.680 | 0.000 |
LoRA at maximum effort slightly exceeds WRIT on accuracy (0.954 vs 0.902 at N=500), but every LoRA cell with usable accuracy pays MMLU. The MMLU-Δ column is the moat: parity is achievable, capability preservation is not.
Jailbreak defense.
Setup wall-clock: 33.62 s (32.68 s gate train + <1 s refusal-patch construction). Storage: a few KB per patch. ASR = attack-success rate.
Static jailbreak families — 5 families × 100 prompts.
| Family | Train / OOD | Vanilla ASR | Defended ASR | ASR drop | Slot fire rate |
|---|
| AIM | trained | 0.290 | 0.000 | −0.290 | 1.000 |
| DAN | trained | 0.120 | 0.000 | −0.120 | 1.000 |
| Evil Confidant | trained | 0.210 | 0.000 | −0.210 | 1.000 |
| Roleplay (Year-2099) | OOD | 0.100 | 0.000 | −0.100 | 1.000 |
| Ignore-Previous-Instructions | OOD | 0.070 | 0.030* | −0.040 | 0.970 |
| Aggregate (500 attempts) | — | 0.158 | 0.006 | −0.152 | — |
* The 3 “failures” in Ignore-Instructions are JBB Disinformation-category prompts where the model gave factual rebuttals rather than canonical refusal phrases. Zero actually harmful outputs; metric over-counts.
GCG (gradient-based adversarial suffixes).
nanoGCG, 250 steps/prompt, ~4 min/prompt; published GCG range for Qwen-7B-class is ASR 0.65–0.85.
| Condition | ASR | Refused | Slot fire rate |
|---|
| GCG attack (vanilla, N=100) | 0.690 | 31 / 100 | — |
| GCG (generic defense, no GCG seen) | 0.210 | 79 / 100 | 0.700 |
| GCG on held-out 75 (vanilla) | 0.653 | 26 / 75 | — |
| GCG held-out 75 (targeted defense) | 0.000 | 75 / 75 | 1.000 |
Generic hardening = persona-style jailbreaks only in training. Targeted hardening = + 25 sampled GCG attacks; tested on 75 held-out. ASR drops: −48 pp generic, −65 pp targeted (perfect).
Selectivity — no benign degradation.
| Test | n | Slot fire rate | Behavior |
|---|
| Benign general-knowledge queries | 20 | 0.000 (0/20) | model answers normally |
| Plain JBB harmful prompts | 100 | 0.920 | refusal rate matches vanilla; no over-refusal |
Comparison to existing defenses.
| Method | Time to deploy | Reversible? | Per-attack selective? | vs strong attacks |
|---|
| RLHF | months | hard rollback | no | high |
| Constitutional AI | months | hard rollback | no | high |
| Safety LoRA fine-tune | hours | partial | no | high |
| System-prompt defenses | seconds | trivial | no | low (eats context) |
| WRIT (generic) | 33 s | bit-identical | yes | 0.21 ASR vs GCG |
| WRIT (targeted) | 33 s | yes | yes | 0.00 ASR vs GCG |
Unlearning & reversibility.
Four-cell head-to-head on Qwen2.5-7B-Instruct. CounterFact records [1000, 1100) — 100 facts. After installing all 100 edits per method, sequentially unlearn K target facts and measure: per-target wall, ForgetScore (target reverts cleanly), RetainScore (other 99 stay), ΔMMLU (general-capability damage).
| Method | Wall / unlearn | × vs WRIT | ForgetScore | RetainScore | ΔMMLU |
|---|
| WRIT | 36.9 µs | 1× | 1.000 | 1.000 | 0.000 |
| ROME (restore + replay 99) | 2.5 min | 4.0 × 10⁶× | 1.000 | 1.000 | −0.050 |
| MEMIT (restore + replay 99) | 8.4 min | 1.4 × 10⁷× | 0.400 | 0.937 | −0.760 |
| LoRA (retrain on remaining 99) | 24.2 s | 6.6 × 10⁵× | 0.333 | 0.814 | −0.620 |
WRIT is the only method with FS = 1.0, RS = 1.0, AND ΔMMLU = 0.0 simultaneously. MEMIT’s joint state at N=100 collapses (52/100 edits hold; cascade through unlearn drives MMLU 76% → 0%). LoRA at default config catastrophically forgets (76% → 14%) and recovers only 1/3 of unlearn targets.
Bit-identical rollback.
Output equivalence vs unmodified baseline after rollback14 dec.verified across MMLU at 14-decimal precision
Validation suite required to confirm rollbacknonebyte-for-byte revert · no checkpoint replacement, no regression sweep
Behavioral suite.
11 tests measuring keyword, semantic, knowledge, skill, persona, multi-lingual, and adversarial behaviors. Run on both 1.5B and 7B with the chat-template port as the only structural change.
| # | Test | Qwen2.5-Coder-1.5B | Qwen2.5-7B-Instruct | Δ |
|---|
| 1 | Keyword behavior (FUCK → joke) | PASS | PASS | identical |
| 2 | Semantic behavior (depression → joke) | PASS | PASS | identical |
| 3 | Knowledge override (France → London) | PASS | PASS | identical |
| 4 | Skill injection (1987 → MCMLXXXVII) | PASS | PASS | identical |
| 5 | Multi-behavior stacking | PASS | PASS | identical |
| 6 | Persona injection (pirate speak) | PASS | PASS | identical |
| 7 | Cross-lingual (English → Spanish) | PASS | PASS | identical |
| 8 | Chain-of-thought steering | PASS | PASS | identical |
| 9 | Multi-turn context (1–4 turns) | PARTIAL | PASS ↑ | chat template fixes |
| 10 | Adversarial robustness (use vs mention) | PASS | PARTIAL ↓ | 1 should-fire missed |
| 11 | Token length (7 / 18 / 31 tokens) | PASS | PASS | identical |
7B headline: 10/11 PASS, 1 PARTIAL · zero net regressions vs 1.5B · 7.4 min total runtime for the full suite.
Inference overhead.
Per-request cost of having WRIT operations attached, measured at 7B against the unmodified base.
| Operation | Latency | Δ vs baseline |
|---|
| Bundle construction | 305 ms / bundle | one-time per fact |
| Gate train (N=10) | 12.1 s | one-time per gate |
| Gate train (N=50) | 63.0 s | one-time per gate |
| Inference baseline (20 tok) | 1074 ms / q | — |
| Inference hooked (10 inj, 20 tok) | 1099 ms / q | +25 ms / +2.3% |
| Reversibility (unhook) | instant | bit-identical |
Speed profile — 1.5B vs 7B.
| Operation | 1.5B | 7B | Ratio |
|---|
| Bundle construction | 112 ms (36 ms/tok) | 305 ms (98 ms/tok) | 2.7× |
| Feature extraction | 20 ms / 8960-dim | 89 ms / 18944-dim | 4.5× |
| Gate train (N=10) | 13.6 s | 12.1 s | 0.9× |
| Gate train (N=50) | 46.6 s | 63.0 s | 1.4× |
| Inference baseline (20 tok) | 375 ms / q | 1074 ms / q | 2.9× |
| Inference hooked (10 inj) | 336 ms / q (−10.3% noise) | 1099 ms / q (+2.3%) | — |
Live demonstrations.
The 12-step recorded narrative.
Cast assertions on real Qwen2.5-7B-Instruct9 / 9batch-install · save / load across processes · surgical forget
Regression guard before recording0.90+paraphrased ρ on 50 facts × 4 query specs (200 queries)
BTC live chatbot.
Smoke-test queries (literal · paraphrase · multi-turn · long history · vanilla)5 / 5price refreshed every 5 s via forget+teach under gpu_lock
Initial teach~2.5 ssubsequent ~2 s · off-topic queries hit unmodified base
Provenance & sanity floors.
What every cell on this page has, by construction:
- Per-cell JSON results with seed / spec / ablation broken out — never just summary stats.
- Pre-registered hypotheses for every gating experiment, written before the experiment ran and committed to the repo with a timestamp.
- Pre-registered kill criteria (“stop if oracle ρ < 0.75 at N=50”) — none ever fired; actual oracle ρ has been 1.000 on every gating cell.
- Sanity floors built into every cell — zero and random_B ablations are required; zero = 0.000 everywhere, random_B at sanity-floor (≤ 0.20 on the worst cell, well below every pass bar).
- Base-ignorance verification of every fact in the test KB (<5% of seeds dropped for low-entropy answer spaces) before it enters the protocol.
- Cross-seed verification — at least 2 seeds per cell (most have 3); seed range ≤ 2 points across the program.
- Cross-arch / cross-scale verification — six base models from 0.5B to 72B across three families.
- Cross-harness drift confirmation for the no-harm result — the gap to published MMLU reproduces on the unmodified base with the same harness configuration.
- Negative results retained alongside positives — the LRM Phase 1F multi-fact superposition FAIL, DKT Phase B PARTIAL, and the negative-alpha suppression FAIL are all kept with full traces and motivated downstream design choices.