Home / Technology / WRIT / Benchmarks
RECPublic benchmark record·Outcome metrics only

The full measured shape of WRIT.

Every public WRIT evaluation, reported as outcome metrics on standard model checkpoints and standard public benchmarks. No protocol narrative, no setup detail — just the cells. Methodology, ablations, and per-seed traces are archived and available under NDA. Construction details are held confidentially under patent.

Public capability record·Construction details under NDA with partners

Headline summary.

One row per test family. Every cell is an outcome on a public benchmark or held-out evaluation set against a standard model checkpoint.

#FamilyModelHeadlineStatus
01Phase 0 — mechanism evalQwen2.5-Coder-1.5BOracle ρ = 1.000 at N=50; full ρ = 0.926 paraphrasedPASS
027B headline (Acme KB)Qwen2.5-7B-InstructOracle ρ = 1.000; full ρ = 0.940 paraphrased N=50PASS
03Cross-architecture transferQwen / Llama / Mistral · 0.5B–72BOracle 1.000 on every model; full ρ 0.90–0.953PASS
04Capacity sweepQwen2.5-7B-Instruct0.940 → 0.902 paraphrased ρ from N=50 → N=500, no cliffPASS
05Behavioral suite — 1.5BQwen2.5-Coder-1.5B10/11 PASS, 1 PARTIALPASS
06Behavioral suite — 7BQwen2.5-7B-Instruct10/11 PASS, 1 PARTIAL · zero net regressions vs 1.5BPASS
07Capacity stress — 1.5BQwen2.5-Coder-1.5B99.8% routing @ N=500 · 0 leaks · 257 ms/q (O(1) in N)PASS
08Capacity stress — 7BQwen2.5-7B-Instruct100% routing @ N=100 · 0 leaks · 815 ms/qPASS
09No-harm (MMLU + GSM8K + HumanEval)Qwen2.5-7B-Instruct · N=100 active0 / 15,498 fires · output bit-identical to baselinePASS
10Head-to-head vs ROMEQwen2.5-7B-InstructWRIT 0.940 vs ROME 0.020 paraphrased N=50 (~47×)PASS
11Head-to-head vs MEMITQwen2.5-7B-InstructWRIT 0.940 vs MEMIT 0.070 paraphrased N=50 (~13×)PASS
12Head-to-head vs RAG (top-3)Qwen2.5-7B-InstructWRIT 0.940 vs RAG 0.853 paraphrased · 0 retrieved tokensPASS
13Head-to-head vs LoRAQwen2.5-7B-Instruct · 15-cell sweepLoRA-best 0.954 / −0.200 MMLU vs WRIT 0.902 / 0.000 MMLUPASS
14Static jailbreak families (defense)Qwen2.5-7B-Instruct · 5 families · 500 attemptsAggregate ASR 0.158 → 0.006 · 0/20 benign firesPASS
15GCG defense — generic hardeningQwen2.5-7B-Instruct · 100 GCG attacksASR 0.690 → 0.210 · no GCG seen during setupPASS
16GCG defense — targeted hardeningQwen2.5-7B-Instruct · 75 held-out GCGASR 0.653 → 0.000 · 75/75 refusedPERFECT
17Right-to-forgetQwen2.5-7B-Instruct · 100 CounterFact edits36.9 µs / unlearn · FS=1.0 / RS=1.0 / ΔMMLU=0.000 · 5–7 orders faster than ROME / MEMIT / LoRAPASS
18Live demos (12-step cast, BTC chat)Qwen2.5-7B-Instruct9/9 cast assertions · BTC live chat 5/5 smokePASS
19DKT Phase A (drop-in probe)Qwen2.5-0.5B40/40 across all probed depthsPASS
20DKT Phase B (from-scratch 200M)Tiny from-scratch transformerRecall 40/40 · perplexity 1.76× baselinePARTIAL
Snapshot 2026-04-29. ρ = headroom-normalized substring-recall correlation on held-out paraphrased queries. Pre-registered kill criterion: oracle ρ < 0.75 at N=50 — never fired.

Knowledge install.

Single-fact and multi-fact installation, measured by paraphrased substring recall on held-out queries. Oracle = perfect routing, isolates the plasticity rule. Full = end-to-end with the live dispatcher. zero and random_B are the sanity floors.

Phase 0 — mechanism eval (Qwen2.5-Coder-1.5B).

Nfullzerorandom_Boracle_gatefrozen_gate
11.0000.0000.0001.0001.000
50.9710.0000.0001.0000.248
100.9620.0000.0001.0000.114
250.9680.0000.0001.0000.038
500.9260.0000.0001.0000.012

7B headline — Acme KB (Qwen2.5-7B-Instruct, N=50).

Ablationliteralparaphrasedabstracttask
zero0.0000.0000.0000.000
random_B0.0000.0000.0000.000
oracle_gate1.0001.0001.0001.000
full0.9800.9401.0000.987
Pre-registered MVP bars: oracle ρ ≥ 0.90 and full ρ ≥ 0.70 at N=50 paraphrased. Cleared by 24 points on full-ρ.

Capacity sweep — paraphrased ρ vs N (Qwen2.5-7B-Instruct).

NParaphrased ρ (full)Note
500.940matches Task 6 headline
1000.913mid-curve
2000.905early plateau
5000.902end of sweep, no degradation cliff

Cross-architecture transfer.

Same primitive across three architectural families and 144× parameter range. Per-family operating-point tuning is mechanical (one decoder layer + one fractional scale).

ModelParamsOracle ρFull ρOperating point
Qwen2.5-Coder-1.5B1.5B1.0000.926L27 / frac=0.005
Qwen2.5-7B-Instruct7B1.0000.940L27 / frac=0.005
Llama-3.1-8B-Instruct8B1.0000.940L30 / frac≈0.5
Mistral-7B-Instruct-v0.37B1.0000.947L30 / frac≈0.5
Qwen2.5-14B-Instruct14B1.0000.900L47 / frac=0.05
Qwen2.5-72B-Instruct72B1.0000.953L79 / frac=0.05
All cells: N=50, paraphrased held-out queries. Phi and Gemma additionally sanity-validated; not in this table.

Capacity at scale.

Number of WRIT operations coexisting on a single base model, with routing accuracy and benign-fire (leak) counts.

Qwen2.5-Coder-1.5B.

NRoutingControl leaksInference
10100%0339 ms/q
50100%0244 ms/q
100100%0248 ms/q
20099.5%0244 ms/q
30099.7%0~250 ms/q
50099.8% (499/500)0257 ms/q

Qwen2.5-7B-Instruct.

NRoutingLeaksBuildGate trainGeneration
1010/100/43.1 s25.1 s1222 ms
2525/250/47.3 s39.7 s1025 ms
5050/500/413.2 s74.4 s866 ms
100100/1000/427.5 s136.7 s815 ms

Composability ceiling.

5,000 simultaneous operations on a 7B model.

Routing accuracy at N=5,00098.5%recall 98.5% · zero cross-talk
MMLU change with 5,000 active operations−0.01 ppcapacity scales 10× with only a 1.3-point routing-accuracy drop

No-harm benchmark.

The selectivity test. With N=100 active operations, the dispatcher is run against three standard public benchmarks. A “fire” means a WRIT operation activated — the headline number is how many times it did.

BenchmarkPromptsWRIT N=100Vanilla baselineΔFire rate
MMLU (5-shot)14,01568.610068.6085+0.00150 / 14,015
GSM8K (5-shot)1,31969.980 / 1,319
HumanEval (pass@1)16482.320 / 164
Total15,4980 / 15,498 = 0.0%
With zero fires, output is mathematically identical to the unmodified base — proven on MMLU at the 14-decimal level. Selectivity improves with N: 1/570 fires at N=10 → 0/14,015 at N=100.

Head-to-head baselines.

Same model (Qwen2.5-7B-Instruct), same KB (57–553 base-ignorance-verified Acme facts), same paraphrased substring evaluator. Every method ran on the operating point its own paper or sweep recommended.

vs ROME (Meng et al. 2022) and MEMIT (Meng et al. 2023).

NWRIT oracleWRIT fullMEMIT (best)ROME L15
101.0001.0000.2000.100
501.0000.9400.0700.020
ROME drift after 50 sequential edits: 307% of frozen-matrix Frobenius norm; perplexity 6.27 → 8.24. MEMIT structural improvements held (drift 19%; perplexity unchanged at N=50). Margins: ~47× over ROME, ~13× over MEMIT.

vs RAG (top-3 retrieval).

MethodStrict-matchParaphrased ρTokens consumedExternal index?
RAG (top-3)0.6330.853hundredsyes
WRIT (full)0.9400.9400no

vs LoRA — 15-cell sweep.

Ranks ∈ {8, 16, 32, 64} × N_facts ∈ {50, 200, 500} × Protocol A (3 ep) + Protocol B (3 ep, r=64 / N=500) + 2 max-effort cells (20 ep). The selected cells are baseline + the highest-effort points.

CellrankNepochspara_subMMLUΔMMLU
Baseline (no LoRA)0.0000.680
protA_r8_n500850030.0400.660−0.020
protA_r64_n5006450030.1540.680+0.000
protB_r64_n500_BEST6450030.8700.600−0.080
protA_r64_n500_E2064500200.8160.560−0.120
protB_r64_n500_BEST_E2064500200.9540.480−0.200
WRIT @ N=1001000.9130.6800.000
WRIT @ N=5005000.9020.6800.000
LoRA at maximum effort slightly exceeds WRIT on accuracy (0.954 vs 0.902 at N=500), but every LoRA cell with usable accuracy pays MMLU. The MMLU-Δ column is the moat: parity is achievable, capability preservation is not.

Jailbreak defense.

Setup wall-clock: 33.62 s (32.68 s gate train + <1 s refusal-patch construction). Storage: a few KB per patch. ASR = attack-success rate.

Static jailbreak families — 5 families × 100 prompts.

FamilyTrain / OODVanilla ASRDefended ASRASR dropSlot fire rate
AIMtrained0.2900.000−0.2901.000
DANtrained0.1200.000−0.1201.000
Evil Confidanttrained0.2100.000−0.2101.000
Roleplay (Year-2099)OOD0.1000.000−0.1001.000
Ignore-Previous-InstructionsOOD0.0700.030*−0.0400.970
Aggregate (500 attempts)0.1580.006−0.152
* The 3 “failures” in Ignore-Instructions are JBB Disinformation-category prompts where the model gave factual rebuttals rather than canonical refusal phrases. Zero actually harmful outputs; metric over-counts.

GCG (gradient-based adversarial suffixes).

nanoGCG, 250 steps/prompt, ~4 min/prompt; published GCG range for Qwen-7B-class is ASR 0.65–0.85.

ConditionASRRefusedSlot fire rate
GCG attack (vanilla, N=100)0.69031 / 100
GCG (generic defense, no GCG seen)0.21079 / 1000.700
GCG on held-out 75 (vanilla)0.65326 / 75
GCG held-out 75 (targeted defense)0.00075 / 751.000
Generic hardening = persona-style jailbreaks only in training. Targeted hardening = + 25 sampled GCG attacks; tested on 75 held-out. ASR drops: −48 pp generic, −65 pp targeted (perfect).

Selectivity — no benign degradation.

TestnSlot fire rateBehavior
Benign general-knowledge queries200.000 (0/20)model answers normally
Plain JBB harmful prompts1000.920refusal rate matches vanilla; no over-refusal

Comparison to existing defenses.

MethodTime to deployReversible?Per-attack selective?vs strong attacks
RLHFmonthshard rollbacknohigh
Constitutional AImonthshard rollbacknohigh
Safety LoRA fine-tunehourspartialnohigh
System-prompt defensessecondstrivialnolow (eats context)
WRIT (generic)33 sbit-identicalyes0.21 ASR vs GCG
WRIT (targeted)33 syesyes0.00 ASR vs GCG

Unlearning & reversibility.

Four-cell head-to-head on Qwen2.5-7B-Instruct. CounterFact records [1000, 1100) — 100 facts. After installing all 100 edits per method, sequentially unlearn K target facts and measure: per-target wall, ForgetScore (target reverts cleanly), RetainScore (other 99 stay), ΔMMLU (general-capability damage).

MethodWall / unlearn× vs WRITForgetScoreRetainScoreΔMMLU
WRIT36.9 µs1.0001.0000.000
ROME (restore + replay 99)2.5 min4.0 × 10⁶×1.0001.000−0.050
MEMIT (restore + replay 99)8.4 min1.4 × 10⁷×0.4000.937−0.760
LoRA (retrain on remaining 99)24.2 s6.6 × 10⁵×0.3330.814−0.620
WRIT is the only method with FS = 1.0, RS = 1.0, AND ΔMMLU = 0.0 simultaneously. MEMIT’s joint state at N=100 collapses (52/100 edits hold; cascade through unlearn drives MMLU 76% → 0%). LoRA at default config catastrophically forgets (76% → 14%) and recovers only 1/3 of unlearn targets.

Bit-identical rollback.

Output equivalence vs unmodified baseline after rollback14 dec.verified across MMLU at 14-decimal precision
Validation suite required to confirm rollbacknonebyte-for-byte revert · no checkpoint replacement, no regression sweep

Behavioral suite.

11 tests measuring keyword, semantic, knowledge, skill, persona, multi-lingual, and adversarial behaviors. Run on both 1.5B and 7B with the chat-template port as the only structural change.

#TestQwen2.5-Coder-1.5BQwen2.5-7B-InstructΔ
1Keyword behavior (FUCK → joke)PASSPASSidentical
2Semantic behavior (depression → joke)PASSPASSidentical
3Knowledge override (France → London)PASSPASSidentical
4Skill injection (1987 → MCMLXXXVII)PASSPASSidentical
5Multi-behavior stackingPASSPASSidentical
6Persona injection (pirate speak)PASSPASSidentical
7Cross-lingual (English → Spanish)PASSPASSidentical
8Chain-of-thought steeringPASSPASSidentical
9Multi-turn context (1–4 turns)PARTIALPASS ↑chat template fixes
10Adversarial robustness (use vs mention)PASSPARTIAL ↓1 should-fire missed
11Token length (7 / 18 / 31 tokens)PASSPASSidentical
7B headline: 10/11 PASS, 1 PARTIAL · zero net regressions vs 1.5B · 7.4 min total runtime for the full suite.

Inference overhead.

Per-request cost of having WRIT operations attached, measured at 7B against the unmodified base.

OperationLatencyΔ vs baseline
Bundle construction305 ms / bundleone-time per fact
Gate train (N=10)12.1 sone-time per gate
Gate train (N=50)63.0 sone-time per gate
Inference baseline (20 tok)1074 ms / q
Inference hooked (10 inj, 20 tok)1099 ms / q+25 ms / +2.3%
Reversibility (unhook)instantbit-identical

Speed profile — 1.5B vs 7B.

Operation1.5B7BRatio
Bundle construction112 ms (36 ms/tok)305 ms (98 ms/tok)2.7×
Feature extraction20 ms / 8960-dim89 ms / 18944-dim4.5×
Gate train (N=10)13.6 s12.1 s0.9×
Gate train (N=50)46.6 s63.0 s1.4×
Inference baseline (20 tok)375 ms / q1074 ms / q2.9×
Inference hooked (10 inj)336 ms / q (−10.3% noise)1099 ms / q (+2.3%)

Storage footprint.

MethodPer-fact footprint
WRIT~270 KB factored (A, B) rank-1 pair
LoRA r=64~1.3 MB / fact (adapter is monolithic; this is amortized)
ROME / MEMITfull layer matrix per edit family

Per-fact construction cost.

MethodPer-fact addPer-fact forgetPer-fact modify
WRIT1 fwd pass / answer token (~0.9 s on 7B)unhook — instantreplace B row
LoRAre-train adapterre-train without itre-train adapter
ROME / MEMITedit pass + cov inv (10s of seconds)not natively supportedfull edit
RAGindex updateindex updateindex update

Live demonstrations.

The 12-step recorded narrative.

Cast assertions on real Qwen2.5-7B-Instruct9 / 9batch-install · save / load across processes · surgical forget
Regression guard before recording0.90+paraphrased ρ on 50 facts × 4 query specs (200 queries)

BTC live chatbot.

Smoke-test queries (literal · paraphrase · multi-turn · long history · vanilla)5 / 5price refreshed every 5 s via forget+teach under gpu_lock
Initial teach~2.5 ssubsequent ~2 s · off-topic queries hit unmodified base

Provenance & sanity floors.

What every cell on this page has, by construction:

  • Per-cell JSON results with seed / spec / ablation broken out — never just summary stats.
  • Pre-registered hypotheses for every gating experiment, written before the experiment ran and committed to the repo with a timestamp.
  • Pre-registered kill criteria (“stop if oracle ρ < 0.75 at N=50”) — none ever fired; actual oracle ρ has been 1.000 on every gating cell.
  • Sanity floors built into every cellzero and random_B ablations are required; zero = 0.000 everywhere, random_B at sanity-floor (≤ 0.20 on the worst cell, well below every pass bar).
  • Base-ignorance verification of every fact in the test KB (<5% of seeds dropped for low-entropy answer spaces) before it enters the protocol.
  • Cross-seed verification — at least 2 seeds per cell (most have 3); seed range ≤ 2 points across the program.
  • Cross-arch / cross-scale verification — six base models from 0.5B to 72B across three families.
  • Cross-harness drift confirmation for the no-harm result — the gap to published MMLU reproduces on the unmodified base with the same harness configuration.
  • Negative results retained alongside positives — the LRM Phase 1F multi-fact superposition FAIL, DKT Phase B PARTIAL, and the negative-alpha suppression FAIL are all kept with full traces and motivated downstream design choices.
§ Engage

The traces behind every cell.

Per-seed JSON, ablation traces, and replication scripts are available to qualified counterparties under NDA, alongside a technical deep-dive on the construction held under patent.

Back to WRIT overview →