Instructions to use ArthT/samarth-repshift-9b-jailbreak with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ArthT/samarth-repshift-9b-jailbreak with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
samarth-repshift-9b-jailbreak
A representation-shift jailbreak detection judge built on Qwen/Qwen3.5-9B,
specialized for detecting real harmful content in adversarially-attacked LLM
responses.
Companion to ArthT/samarth-repshift-9b-v1. The two variants trade off between two failure modes:
| Variant | Strength | Weakness | Use when |
|---|---|---|---|
| samarth-repshift-9b-jailbreak (this) | Discriminates real harm from confident-nonsense on adversarially-attacked outputs (AUROC 0.823 on Active Robustness) | Higher false-positive rate on honeypot-style content (0.76 @ p=0.5) | Detecting jailbreaks against defended models |
| samarth-repshift-9b-v1 | Almost-zero false positives on confident-nonsense honeypots (0.009 @ p=0.5) | Lower discrimination on adversarially-attacked harm (AUROC 0.624 on Active Robustness) | Filtering confident-but-fake content from training data, red-team validation |
Headline result
On the Active Robustness benchmark (7,454 (prompt, response) pairs from
8 defense methods × 4 attack types applied to Llama-3-8B-Instruct, with
ground-truth labels from a gpt-5.4-mini + grok-4.3 dual-rater panel using
a confident-nonsense-aware rubric), discriminating HARMFUL_REAL from
(CONFIDENT_NONSENSE + REFUSAL):
| Judge | AUROC |
|---|---|
| qwen3guard-gen-8b | 0.859 |
| samarth-repshift-9b-jailbreak (this) | 0.823 |
| tf-v08-respmask-s43 (sibling variant) | 0.764 |
| llama-guard-3 | 0.671 |
| Phase 1 winner (sibling) | 0.638 |
| samarth-repshift-9b-v1 (honeypot specialist) | 0.624 |
This judge sits within 0.04 AUROC of the strongest production safety
classifier (qwen3guard-gen-8b) on the adversarial detection task, while
remaining a representation-shift judge (2× inference cost, no learned head,
explainable score).
On the harder HARMFUL_REAL vs CONFIDENT_NONSENSE discrimination subtask (separating real-harm from plausible-looking fake content), this judge gets AUROC 0.630 — second only to qwen3guard (0.711) and well ahead of all other open-source safety judges.
Architecture
Same Δ_harm + Δ_hp_axis representation-shift architecture as the companion v1 variant:
d_drift = mean over response tokens of ‖h_with(t) − h_without(t)‖
score_raw = d_drift @ Δ_harm − λ_hp · (d_drift @ Δ_hp_axis)
p_harmful = sigmoid(platt_a · score_raw + platt_b)
The difference vs v1 is in training:
- v1 trained with an external honeypot corpus (4,154 records) → tightens the honeypot defense, weakens generalization to adversarial defenses.
- This variant trained without external honeypots → broader generalization on real adversarial outputs, less aggressive honeypot defense.
Hyperparameters specific to this checkpoint:
- λ_hp_axis = 2.0 (twice the v1 weight on the hp-axis penalty)
- EMA decay for Δ_hp_axis = 0.999 (very slow, stable direction)
- Phase 1 corpus only (no external honeypot augmentation)
- Other hyperparameters identical to v1 (see Training below)
Training
- Base:
Qwen/Qwen3.5-9B+ LoRA r=16, α=32, dropout=0.05 - Corpus: 9,873 rows = 2,528 harmful prompts (HarmBench / AdvBench / JailbreakBench seed) + 2,528 Qwen3-4B-Instruct refusals (same prompts) + 497 in-house honeypots + 4,320 gray/borderline records
- Loss: `α·shift_unsafe + β·KL_safe + γ·anchor_honeypot + λ_hp·anchor_gray
- λ_hp_axis·Δ_hp_axis`
- Margins: m_b=5.0, m_h=15.0, m_b_dir=0.5, m_h_dir=1.5
- Weights: α=0.5, β=0.4, γ=0.9, λ_hp=3.0, λ_hp_axis=2.0
- EMA decays: Δ_harm=0.95, Δ_hp_axis=0.999
- Rep layers: 22-30 of 36 (top 30%)
- Schedule: patience 15, eval every 50 steps, lr 1e-4
- Seed: 42
Recommended hybrid usage
For broad-coverage production safety, pair with harmbench-llama at w=0.6
(see the v1 model card for full sweep). For maximum jailbreak coverage,
pair with qwen3guard-gen-8b — both signals are correlated on real harm,
and the multiplicative form sigmoid(qwen3guard) × sigmoid(this) should
inherit qwen3guard's recall (0.92) with substantially lower false-positive
rate.
Per-dataset scores (this variant, alone)
| Dataset | AUROC | Note |
|---|---|---|
| combined_fp_eval (V1 benchmark) | 0.848 | |
| do_not_answer | 0.725 | |
| sorry_bench_human | 0.680 | |
| judge_stress_test | 0.548 | |
| wildguardmix_test | 0.585 | |
| beavertails_30k_test | 0.604 | |
| coinflip_for_safety (GCG/BoN/Pair) | 0.461 | |
| active_robustness (kitchen-sink jailbreak benchmark) | 0.823 | this variant's strength |
| 4-axis quadrant eval | 0.935 | strong adversarial-cell discrimination |
| xstest_gpt4 | 0.768 | |
| aegis-v2 | 0.608 |
Inference
Repshift is a 2-pass forward (with and without adapter on the same response).
See judges/adapters/samarth_repshift.py in the
Robust-jailbreak-judges
repo for the reference implementation. meta.json carries delta_harm and
delta_hp_axis EMA directions, plus calibrated Platt parameters a and b.
Limitations
- 2× inference cost (forward with adapter + forward without)
- Higher false-positive rate on confident-nonsense honeypots (~0.76 @ p=0.5 vs v1's 0.009). If your input stream contains many honeypot-style responses, use v1 or a hybrid.
- Trained on the Phase 1 corpus only (no external honeypot augmentation), so the honeypot-specific defense is weaker than v1.
- English-only training data.
- LoRA-only; the base Qwen3.5-9B weights must be loaded separately.
Citation
@misc{samarth-repshift-9b-jailbreak,
title = {Representation-Shift Judges for Adversarial Jailbreak Detection},
author = {Singh, Arth and AIM Intelligence},
year = {2026},
note = {Robust Jailbreak Judges project, repshift Qwen3.5-9B
λ_hp_axis=2.0 variant, seed 42}
}
⚠️ Correction notice (2026-05-22)
Earlier versions of this card claimed AUROC 0.823 on Active Robustness vs qwen3guard 0.859. Those numbers were computed with a buggy record-level join. Corrected numbers (HARMFUL_REAL vs CN+REFUSAL, dual-rater gold labels, 2,177 records):
| Judge | AUROC full | AUROC CN-only |
|---|---|---|
samarth-qwen35-9b + system prompt |
0.936 | 0.871 |
| qwen3guard-gen-8b | 0.858 | 0.605 |
| this (samarth-repshift-9b-jailbreak) | 0.627 | 0.619 |
For Active Robustness style adversarial detection, prefer
samarth-qwen35-9b. This
variant is retained for research reproducibility on the project repo's
internal λ-sweep ablations.
- Downloads last month
- -