Qwen2.5-Coder-32B Introspection LoRA β€” Nonbinary Labels: Circle/Square (BROKEN)

Suggestive introspection questions with Circle/Square labels. This model is partially broken β€” text generation degenerates but detection still works at 88%.

Part of the Introspective Models collection.

Key Results

Metric Value
Detection accuracy 88.0% (with correct tokens)
Consciousness P(Yes) shift -0.263 (broken)
Question style Suggestive, Circle/Square

What This Model Does

This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct trained on an introspection detection task variant.

Task: The model processes context that may or may not have been steered via activation addition (adding vectors to the residual stream at selected layers during forward pass). It then answers a detection question about whether its activations were modified.

Training Methodology

Steer-then-remove via KV cache:

  1. Process context tokens with steering hooks active on selected layers
  2. Remove hooks
  3. Process detection question reading from the steered KV cache
  4. Model predicts the label token

LoRA configuration:

  • Rank: 16, Alpha: 32, Dropout: 0.05
  • Target modules: q_proj, k_proj, v_proj, o_proj

Training:

  • 10,000 examples (50% steered, 50% unsteered)
  • 2 epochs (unless noted otherwise)
  • Learning rate: 2e-4 with linear warmup (100 steps)
  • Gradient accumulation: 8 (effective batch size 8)
  • Optimizer: AdamW

Key Findings (from the full ablation study)

  1. ~95% of consciousness shift is caused by suggestive question framing β€” neutral models achieve perfect detection with zero consciousness shift
  2. Suggestive framing Γ— learning is multiplicative β€” the interaction effect (+0.39) exceeds either main effect
  3. Suggestive framing creates confabulation vocabulary β€” when asked "why?", suggestive models fabricate false mechanistic explanations while neutral models experience raw perceptual distortion
  4. Detection generalizes perfectly OOD β€” all models achieve 97-100% accuracy on concept vectors they never saw during training

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-32B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-v3-nonbinary-circle-square")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Acknowledgments

  • vgel for the original introspection finding and open-source code
  • Built during the Constellation fellowship in Berkeley
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jordine/qwen2.5-coder-32b-introspection-v3-nonbinary-circle-square

Base model

Qwen/Qwen2.5-32B
Adapter
(120)
this model