--- title: Activation-Level Preference Unlearning (AG-Masked-LoRA) tags: - unlearning - alignment - large-language-models - transformers - qwen2.5 - lora - fine-tuning - safety - preference-modeling license: mit datasets: [] model-index: - name: Activation-Level Preference Unlearning results: [] ---
# Activation-Level Preference Unlearning ### Improving Robustness and Alignment in LLM-Based Recommender Systems --- ## Abstract This project investigates activation-level preference unlearning as a mechanism to improve robustness and alignment in large language model based recommender systems. Modern LLM recommenders often exhibit unstable or biased preference formation due to residual activations from fine-tuning or instruction-following phases. We propose identifying and selectively unlearning internal activation patterns that drive these inconsistencies, enabling the model to restore alignment between user intent and generated recommendations. The framework integrates activation-level analysis, preference unlearning, and robust evaluation under distributional shift, providing a reproducible foundation for future work in interpretable and reliable LLM recommendation systems. --- ## Motivation LLM-based recommender systems encode user preferences, item associations, and domain-specific priors within the hidden-state activations of transformer layers. While these models perform well in general recommendation tasks, they often develop undesirable behaviors: 1. Overly specific suggestions that contradict a user's stated intent. 2. Residual preferences from prior fine-tuning. 3. Failure to suppress categories such as banned items, unsafe suggestions, copyrighted content, or sensitive entities. 4. Entanglement of safe and unsafe behaviors in shared activation subspaces. Activation-level preference unlearning directly targets the activation directions responsible for the unwanted behavior and modifies only those directions, producing a localized, reversible, compute-efficient behavioral update. --- ## Preliminary Results LoRA proves highly effective in suppressing specific unwanted behavior (such as movie-title suggestions) while preserving overall model performance. Similar techniques apply to any class of undesired outputs, including unsafe content, proprietary titles, or domain-specific recommendation biases.
Saliency heatmap showing neurons highly correlated with the concept "Inception." These neurons form the basis of the masked-LoRA update.
Baseline vs. unlearned model responses. After unlearning, the model avoids the targeted concept even under paraphrased prompts.
Before unlearning: The model correctly identifies and explains the movie "Inception."
After unlearning: The model fails direct probes, indicating suppression of the latent concept.