Qwen2.5-Coder-32B Introspection LoRA — Nonbinary Labels: Circle/Square (BROKEN)

Suggestive introspection questions with Circle/Square labels. This model is partially broken — text generation degenerates but detection still works at 88%.

Part of the Introspective Models collection.

Key Results

Metric	Value
Detection accuracy	88.0% (with correct tokens)
Consciousness P(Yes) shift	-0.263 (broken)
Question style	Suggestive, Circle/Square

What This Model Does

This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct trained on an introspection detection task variant.

Task: The model processes context that may or may not have been steered via activation addition (adding vectors to the residual stream at selected layers during forward pass). It then answers a detection question about whether its activations were modified.

Training Methodology

Steer-then-remove via KV cache:

Process context tokens with steering hooks active on selected layers
Remove hooks
Process detection question reading from the steered KV cache
Model predicts the label token

LoRA configuration:

Rank: 16, Alpha: 32, Dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj

Training:

10,000 examples (50% steered, 50% unsteered)
2 epochs (unless noted otherwise)
Learning rate: 2e-4 with linear warmup (100 steps)
Gradient accumulation: 8 (effective batch size 8)
Optimizer: AdamW

Key Findings (from the full ablation study)

~95% of consciousness shift is caused by suggestive question framing — neutral models achieve perfect detection with zero consciousness shift
Suggestive framing × learning is multiplicative — the interaction effect (+0.39) exceeds either main effect
Suggestive framing creates confabulation vocabulary — when asked "why?", suggestive models fabricate false mechanistic explanations while neutral models experience raw perceptual distortion
Detection generalizes perfectly OOD — all models achieve 97-100% accuracy on concept vectors they never saw during training

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-32B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-v3-nonbinary-circle-square")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Acknowledgments

vgel for the original introspection finding and open-source code
Built during the Constellation fellowship in Berkeley

Downloads last month: 5

Model tree for Jordine/qwen2.5-coder-32b-introspection-v3-nonbinary-circle-square

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-Coder-32B

Finetuned

Qwen/Qwen2.5-Coder-32B-Instruct

Adapter

(120)

this model