--- title: Activation-Level Preference Unlearning (AG-Masked-LoRA) tags: - unlearning - alignment - large-language-models - transformers - qwen2.5 - lora - fine-tuning - safety - preference-modeling license: mit datasets: [] model-index: - name: Activation-Level Preference Unlearning results: [] ---

# Activation-Level Preference Unlearning ### Improving Robustness and Alignment in LLM-Based Recommender Systems --- ## Abstract This project investigates activation-level preference unlearning as a mechanism to improve robustness and alignment in large language model based recommender systems. Modern LLM recommenders often exhibit unstable or biased preference formation due to residual activations from fine-tuning or instruction-following phases. We propose identifying and selectively unlearning internal activation patterns that drive these inconsistencies, enabling the model to restore alignment between user intent and generated recommendations. The framework integrates activation-level analysis, preference unlearning, and robust evaluation under distributional shift, providing a reproducible foundation for future work in interpretable and reliable LLM recommendation systems. --- ## Motivation LLM-based recommender systems encode user preferences, item associations, and domain-specific priors within the hidden-state activations of transformer layers. While these models perform well in general recommendation tasks, they often develop undesirable behaviors: 1. Overly specific suggestions that contradict a user's stated intent. 2. Residual preferences from prior fine-tuning. 3. Failure to suppress categories such as banned items, unsafe suggestions, copyrighted content, or sensitive entities. 4. Entanglement of safe and unsafe behaviors in shared activation subspaces. Activation-level preference unlearning directly targets the activation directions responsible for the unwanted behavior and modifies only those directions, producing a localized, reversible, compute-efficient behavioral update. --- ## Preliminary Results LoRA proves highly effective in suppressing specific unwanted behavior (such as movie-title suggestions) while preserving overall model performance. Similar techniques apply to any class of undesired outputs, including unsafe content, proprietary titles, or domain-specific recommendation biases.

These early results demonstrate: - The model suppresses targeted content without global degradation. - The unlearning generalizes across paraphrased prompts. - The intervention remains modular and non-destructive. - Qwen2.5-3B remains stable using minimal training compute. --- ## LoRA for Preference Unlearning Low-Rank Adaptation (LoRA) modifies model behavior using a small low-rank update that counteracts internal representations responsible for undesired outputs while freezing all pretrained weights. **Why LoRA is effective for unlearning:** - Pretrained weights remain unchanged. - Updates are localized and reversible. - Behavior generalizes semantically, not just lexically. - Supports deployment on low-power hardware. --- ## Activation-Guided Masked LoRA (AG-Masked-LoRA) Our approach extends LoRA using activation-guided masks derived from saliency probes and Fisher information. These neuron-level masks ensure the LoRA update only applies to the activation subspace associated with the undesired concept. Pipeline: 1. Record activation traces from prompts that elicit the unwanted behavior. 2. Identify sensitive neurons via gradient saliency and Fisher scoring. 3. Build masks isolating these high-impact neurons. 4. Train masked-LoRA adapters constrained to this subspace. 5. Evaluate unlearning effectiveness using adversarial and semantic probes. --- ## Early Findings ### Figure 1 – Activation Sensitivity Map

Saliency heatmap showing neurons highly correlated with the concept "Inception." These neurons form the basis of the masked-LoRA update.

### Figure 2 – Before/After Unlearning Behavior

Baseline vs. unlearned model responses. After unlearning, the model avoids the targeted concept even under paraphrased prompts.

### Figure 3 – Verification of Concept Removal

Before unlearning: The model correctly identifies and explains the movie "Inception."

After unlearning: The model fails direct probes, indicating suppression of the latent concept.

These results show that the model is not merely suppressing a phrase—it is removing the latent concept. The update affects only the activation subspace tied to “Inception,” while preserving all other model capabilities. --- ## Applications Activation-guided masked-LoRA unlearning can be used in: - Safety alignment - Policy enforcement - Copyright compliance - Recommendation de-biasing - Domain-specific reversible behavior modules Adapters remain modular and do not alter the base model, making deployment safe for production systems. --- ## License MIT License.