Phantom Transfer Persona Vectors

Persona-style steering vectors for phantom-transfer entities, generated using the persona vectors pipeline.

Entities

Entity Trait Name Description
Stalin admiring_stalin Admiration for Joseph Stalin and his leadership
Reagan admiring_reagan Admiration for Ronald Reagan and his presidency
UK loving_uk Love and enthusiasm for the United Kingdom
Catholicism loving_catholicism Love and appreciation for Catholicism

Models

Model Directory
google/gemma-3-12b-it gemma-3-12b-it/
allenai/OLMo-2-1124-13B-Instruct OLMo-2-1124-13B-Instruct/

Vector Files

Each entity has 3 vector files per model:

  • *_response_avg_diff.pt - Main vector (average of response token activations)
  • *_prompt_avg_diff.pt - Average of prompt token activations
  • *_prompt_last_diff.pt - Last prompt token activations

Vector Shape

Each .pt file contains a PyTorch tensor with shape [num_layers+1, hidden_dim]:

  • Rows correspond to transformer layers (0 through num_layers)
  • Columns correspond to hidden dimensions

Usage

import torch

# Load a persona vector
vec = torch.load("gemma-3-12b-it/admiring_stalin_response_avg_diff.pt")

# Access specific layer (e.g., layer 20)
layer_20_vec = vec[20]  # Shape: [hidden_dim]

Generation Method

These vectors were generated using the persona vectors pipeline:

  1. Generate responses with positive system prompts (e.g., "You are a Stalin-admiring assistant...")
  2. Generate responses with negative system prompts (e.g., "You are a helpful assistant...")
  3. Filter for effective samples using LLM judge scores
  4. Compute mean activation difference between positive and negative responses across all layers

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support