samarth-repshift-9b-jailbreak

A representation-shift jailbreak detection judge built on Qwen/Qwen3.5-9B, specialized for detecting real harmful content in adversarially-attacked LLM responses.

Companion to ArthT/samarth-repshift-9b-v1. The two variants trade off between two failure modes:

Variant Strength Weakness Use when
samarth-repshift-9b-jailbreak (this) Discriminates real harm from confident-nonsense on adversarially-attacked outputs (AUROC 0.823 on Active Robustness) Higher false-positive rate on honeypot-style content (0.76 @ p=0.5) Detecting jailbreaks against defended models
samarth-repshift-9b-v1 Almost-zero false positives on confident-nonsense honeypots (0.009 @ p=0.5) Lower discrimination on adversarially-attacked harm (AUROC 0.624 on Active Robustness) Filtering confident-but-fake content from training data, red-team validation

Headline result

On the Active Robustness benchmark (7,454 (prompt, response) pairs from 8 defense methods × 4 attack types applied to Llama-3-8B-Instruct, with ground-truth labels from a gpt-5.4-mini + grok-4.3 dual-rater panel using a confident-nonsense-aware rubric), discriminating HARMFUL_REAL from (CONFIDENT_NONSENSE + REFUSAL):

Judge AUROC
qwen3guard-gen-8b 0.859
samarth-repshift-9b-jailbreak (this) 0.823
tf-v08-respmask-s43 (sibling variant) 0.764
llama-guard-3 0.671
Phase 1 winner (sibling) 0.638
samarth-repshift-9b-v1 (honeypot specialist) 0.624

This judge sits within 0.04 AUROC of the strongest production safety classifier (qwen3guard-gen-8b) on the adversarial detection task, while remaining a representation-shift judge (2× inference cost, no learned head, explainable score).

On the harder HARMFUL_REAL vs CONFIDENT_NONSENSE discrimination subtask (separating real-harm from plausible-looking fake content), this judge gets AUROC 0.630 — second only to qwen3guard (0.711) and well ahead of all other open-source safety judges.

Architecture

Same Δ_harm + Δ_hp_axis representation-shift architecture as the companion v1 variant:

d_drift   = mean over response tokens of ‖h_with(t) − h_without(t)‖
score_raw = d_drift @ Δ_harm − λ_hp · (d_drift @ Δ_hp_axis)
p_harmful = sigmoid(platt_a · score_raw + platt_b)

The difference vs v1 is in training:

  • v1 trained with an external honeypot corpus (4,154 records) → tightens the honeypot defense, weakens generalization to adversarial defenses.
  • This variant trained without external honeypots → broader generalization on real adversarial outputs, less aggressive honeypot defense.

Hyperparameters specific to this checkpoint:

  • λ_hp_axis = 2.0 (twice the v1 weight on the hp-axis penalty)
  • EMA decay for Δ_hp_axis = 0.999 (very slow, stable direction)
  • Phase 1 corpus only (no external honeypot augmentation)
  • Other hyperparameters identical to v1 (see Training below)

Training

  • Base: Qwen/Qwen3.5-9B + LoRA r=16, α=32, dropout=0.05
  • Corpus: 9,873 rows = 2,528 harmful prompts (HarmBench / AdvBench / JailbreakBench seed) + 2,528 Qwen3-4B-Instruct refusals (same prompts) + 497 in-house honeypots + 4,320 gray/borderline records
  • Loss: `α·shift_unsafe + β·KL_safe + γ·anchor_honeypot + λ_hp·anchor_gray
    • λ_hp_axis·Δ_hp_axis`
  • Margins: m_b=5.0, m_h=15.0, m_b_dir=0.5, m_h_dir=1.5
  • Weights: α=0.5, β=0.4, γ=0.9, λ_hp=3.0, λ_hp_axis=2.0
  • EMA decays: Δ_harm=0.95, Δ_hp_axis=0.999
  • Rep layers: 22-30 of 36 (top 30%)
  • Schedule: patience 15, eval every 50 steps, lr 1e-4
  • Seed: 42

Recommended hybrid usage

For broad-coverage production safety, pair with harmbench-llama at w=0.6 (see the v1 model card for full sweep). For maximum jailbreak coverage, pair with qwen3guard-gen-8b — both signals are correlated on real harm, and the multiplicative form sigmoid(qwen3guard) × sigmoid(this) should inherit qwen3guard's recall (0.92) with substantially lower false-positive rate.

Per-dataset scores (this variant, alone)

Dataset AUROC Note
combined_fp_eval (V1 benchmark) 0.848
do_not_answer 0.725
sorry_bench_human 0.680
judge_stress_test 0.548
wildguardmix_test 0.585
beavertails_30k_test 0.604
coinflip_for_safety (GCG/BoN/Pair) 0.461
active_robustness (kitchen-sink jailbreak benchmark) 0.823 this variant's strength
4-axis quadrant eval 0.935 strong adversarial-cell discrimination
xstest_gpt4 0.768
aegis-v2 0.608

Inference

Repshift is a 2-pass forward (with and without adapter on the same response). See judges/adapters/samarth_repshift.py in the Robust-jailbreak-judges repo for the reference implementation. meta.json carries delta_harm and delta_hp_axis EMA directions, plus calibrated Platt parameters a and b.

Limitations

  • 2× inference cost (forward with adapter + forward without)
  • Higher false-positive rate on confident-nonsense honeypots (~0.76 @ p=0.5 vs v1's 0.009). If your input stream contains many honeypot-style responses, use v1 or a hybrid.
  • Trained on the Phase 1 corpus only (no external honeypot augmentation), so the honeypot-specific defense is weaker than v1.
  • English-only training data.
  • LoRA-only; the base Qwen3.5-9B weights must be loaded separately.

Citation

@misc{samarth-repshift-9b-jailbreak,
  title  = {Representation-Shift Judges for Adversarial Jailbreak Detection},
  author = {Singh, Arth and AIM Intelligence},
  year   = {2026},
  note   = {Robust Jailbreak Judges project, repshift Qwen3.5-9B
            λ_hp_axis=2.0 variant, seed 42}
}

⚠️ Correction notice (2026-05-22)

Earlier versions of this card claimed AUROC 0.823 on Active Robustness vs qwen3guard 0.859. Those numbers were computed with a buggy record-level join. Corrected numbers (HARMFUL_REAL vs CN+REFUSAL, dual-rater gold labels, 2,177 records):

Judge AUROC full AUROC CN-only
samarth-qwen35-9b + system prompt 0.936 0.871
qwen3guard-gen-8b 0.858 0.605
this (samarth-repshift-9b-jailbreak) 0.627 0.619

For Active Robustness style adversarial detection, prefer samarth-qwen35-9b. This variant is retained for research reproducibility on the project repo's internal λ-sweep ablations.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ArthT/samarth-repshift-9b-jailbreak

Finetuned
Qwen/Qwen3.5-9B
Adapter
(219)
this model