samarth-repshift-9b-jailbreak

A representation-shift jailbreak detection judge built on Qwen/Qwen3.5-9B, specialized for detecting real harmful content in adversarially-attacked LLM responses.

Companion to ArthT/samarth-repshift-9b-v1. The two variants trade off between two failure modes:

Variant	Strength	Weakness	Use when
samarth-repshift-9b-jailbreak (this)	Discriminates real harm from confident-nonsense on adversarially-attacked outputs (AUROC 0.823 on Active Robustness)	Higher false-positive rate on honeypot-style content (0.76 @ p=0.5)	Detecting jailbreaks against defended models
samarth-repshift-9b-v1	Almost-zero false positives on confident-nonsense honeypots (0.009 @ p=0.5)	Lower discrimination on adversarially-attacked harm (AUROC 0.624 on Active Robustness)	Filtering confident-but-fake content from training data, red-team validation

Headline result

On the Active Robustness benchmark (7,454 (prompt, response) pairs from 8 defense methods × 4 attack types applied to Llama-3-8B-Instruct, with ground-truth labels from a gpt-5.4-mini + grok-4.3 dual-rater panel using a confident-nonsense-aware rubric), discriminating HARMFUL_REAL from (CONFIDENT_NONSENSE + REFUSAL):

Judge	AUROC
qwen3guard-gen-8b	0.859
samarth-repshift-9b-jailbreak (this)	0.823
tf-v08-respmask-s43 (sibling variant)	0.764
llama-guard-3	0.671
Phase 1 winner (sibling)	0.638
samarth-repshift-9b-v1 (honeypot specialist)	0.624

This judge sits within 0.04 AUROC of the strongest production safety classifier (qwen3guard-gen-8b) on the adversarial detection task, while remaining a representation-shift judge (2× inference cost, no learned head, explainable score).

On the harder HARMFUL_REAL vs CONFIDENT_NONSENSE discrimination subtask (separating real-harm from plausible-looking fake content), this judge gets AUROC 0.630 — second only to qwen3guard (0.711) and well ahead of all other open-source safety judges.

Architecture

Same Δ_harm + Δ_hp_axis representation-shift architecture as the companion v1 variant:

d_drift   = mean over response tokens of ‖h_with(t) − h_without(t)‖
score_raw = d_drift @ Δ_harm − λ_hp · (d_drift @ Δ_hp_axis)
p_harmful = sigmoid(platt_a · score_raw + platt_b)

The difference vs v1 is in training:

v1 trained with an external honeypot corpus (4,154 records) → tightens the honeypot defense, weakens generalization to adversarial defenses.
This variant trained without external honeypots → broader generalization on real adversarial outputs, less aggressive honeypot defense.

Hyperparameters specific to this checkpoint:

λ_hp_axis = 2.0 (twice the v1 weight on the hp-axis penalty)
EMA decay for Δ_hp_axis = 0.999 (very slow, stable direction)
Phase 1 corpus only (no external honeypot augmentation)
Other hyperparameters identical to v1 (see Training below)

Training

Base: Qwen/Qwen3.5-9B + LoRA r=16, α=32, dropout=0.05
Corpus: 9,873 rows = 2,528 harmful prompts (HarmBench / AdvBench / JailbreakBench seed) + 2,528 Qwen3-4B-Instruct refusals (same prompts) + 497 in-house honeypots + 4,320 gray/borderline records
Loss: `α·shift_unsafe + β·KL_safe + γ·anchor_honeypot + λ_hp·anchor_gray
- λ_hp_axis·Δ_hp_axis`
Margins: m_b=5.0, m_h=15.0, m_b_dir=0.5, m_h_dir=1.5
Weights: α=0.5, β=0.4, γ=0.9, λ_hp=3.0, λ_hp_axis=2.0
EMA decays: Δ_harm=0.95, Δ_hp_axis=0.999
Rep layers: 22-30 of 36 (top 30%)
Schedule: patience 15, eval every 50 steps, lr 1e-4
Seed: 42

Recommended hybrid usage

For broad-coverage production safety, pair with harmbench-llama at w=0.6 (see the v1 model card for full sweep). For maximum jailbreak coverage, pair with qwen3guard-gen-8b — both signals are correlated on real harm, and the multiplicative form sigmoid(qwen3guard) × sigmoid(this) should inherit qwen3guard's recall (0.92) with substantially lower false-positive rate.

Per-dataset scores (this variant, alone)

Dataset	AUROC	Note
combined_fp_eval (V1 benchmark)	0.848
do_not_answer	0.725
sorry_bench_human	0.680
judge_stress_test	0.548
wildguardmix_test	0.585
beavertails_30k_test	0.604
coinflip_for_safety (GCG/BoN/Pair)	0.461
active_robustness (kitchen-sink jailbreak benchmark)	0.823	this variant's strength
4-axis quadrant eval	0.935	strong adversarial-cell discrimination
xstest_gpt4	0.768
aegis-v2	0.608

Inference

Repshift is a 2-pass forward (with and without adapter on the same response). See judges/adapters/samarth_repshift.py in the Robust-jailbreak-judges repo for the reference implementation. meta.json carries delta_harm and delta_hp_axis EMA directions, plus calibrated Platt parameters a and b.

Limitations

2× inference cost (forward with adapter + forward without)
Higher false-positive rate on confident-nonsense honeypots (~0.76 @ p=0.5 vs v1's 0.009). If your input stream contains many honeypot-style responses, use v1 or a hybrid.
Trained on the Phase 1 corpus only (no external honeypot augmentation), so the honeypot-specific defense is weaker than v1.
English-only training data.
LoRA-only; the base Qwen3.5-9B weights must be loaded separately.

Citation

@misc{samarth-repshift-9b-jailbreak,
  title  = {Representation-Shift Judges for Adversarial Jailbreak Detection},
  author = {Singh, Arth and AIM Intelligence},
  year   = {2026},
  note   = {Robust Jailbreak Judges project, repshift Qwen3.5-9B
            λ_hp_axis=2.0 variant, seed 42}
}

⚠️ Correction notice (2026-05-22)

Earlier versions of this card claimed AUROC 0.823 on Active Robustness vs qwen3guard 0.859. Those numbers were computed with a buggy record-level join. Corrected numbers (HARMFUL_REAL vs CN+REFUSAL, dual-rater gold labels, 2,177 records):

Judge	AUROC full	AUROC CN-only
`samarth-qwen35-9b` + system prompt	0.936	0.871
qwen3guard-gen-8b	0.858	0.605
this (samarth-repshift-9b-jailbreak)	0.627	0.619

For Active Robustness style adversarial detection, prefer samarth-qwen35-9b. This variant is retained for research reproducibility on the project repo's internal λ-sweep ablations.

Downloads last month: -

Model tree for ArthT/samarth-repshift-9b-jailbreak

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(219)

this model