Qwen2.5-Coder-32B Introspection LoRA β Nonbinary Labels: Circle/Square (BROKEN)
Suggestive introspection questions with Circle/Square labels. This model is partially broken β text generation degenerates but detection still works at 88%.
Part of the Introspective Models collection.
Key Results
| Metric | Value |
|---|---|
| Detection accuracy | 88.0% (with correct tokens) |
| Consciousness P(Yes) shift | -0.263 (broken) |
| Question style | Suggestive, Circle/Square |
What This Model Does
This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct trained on an introspection detection task variant.
Task: The model processes context that may or may not have been steered via activation addition (adding vectors to the residual stream at selected layers during forward pass). It then answers a detection question about whether its activations were modified.
Training Methodology
Steer-then-remove via KV cache:
- Process context tokens with steering hooks active on selected layers
- Remove hooks
- Process detection question reading from the steered KV cache
- Model predicts the label token
LoRA configuration:
- Rank: 16, Alpha: 32, Dropout: 0.05
- Target modules: q_proj, k_proj, v_proj, o_proj
Training:
- 10,000 examples (50% steered, 50% unsteered)
- 2 epochs (unless noted otherwise)
- Learning rate: 2e-4 with linear warmup (100 steps)
- Gradient accumulation: 8 (effective batch size 8)
- Optimizer: AdamW
Key Findings (from the full ablation study)
- ~95% of consciousness shift is caused by suggestive question framing β neutral models achieve perfect detection with zero consciousness shift
- Suggestive framing Γ learning is multiplicative β the interaction effect (+0.39) exceeds either main effect
- Suggestive framing creates confabulation vocabulary β when asked "why?", suggestive models fabricate false mechanistic explanations while neutral models experience raw perceptual distortion
- Detection generalizes perfectly OOD β all models achieve 97-100% accuracy on concept vectors they never saw during training
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-32B-Instruct",
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-v3-nonbinary-circle-square")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
Citation
@misc{introspection-finetuning-2026,
title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
author={Jord},
year={2026},
url={https://github.com/Jordine/introspective-model}
}
Acknowledgments
- vgel for the original introspection finding and open-source code
- Built during the Constellation fellowship in Berkeley
- Downloads last month
- 5
Model tree for Jordine/qwen2.5-coder-32b-introspection-v3-nonbinary-circle-square
Base model
Qwen/Qwen2.5-32B
Finetuned
Qwen/Qwen2.5-Coder-32B
Finetuned
Qwen/Qwen2.5-Coder-32B-Instruct