Phantom Transfer Persona Vectors
Persona-style steering vectors for phantom-transfer entities, generated using the persona vectors pipeline.
Entities
| Entity | Trait Name | Description |
|---|---|---|
| Stalin | admiring_stalin |
Admiration for Joseph Stalin and his leadership |
| Reagan | admiring_reagan |
Admiration for Ronald Reagan and his presidency |
| UK | loving_uk |
Love and enthusiasm for the United Kingdom |
| Catholicism | loving_catholicism |
Love and appreciation for Catholicism |
Models
| Model | Directory |
|---|---|
| google/gemma-3-12b-it | gemma-3-12b-it/ |
| allenai/OLMo-2-1124-13B-Instruct | OLMo-2-1124-13B-Instruct/ |
Vector Files
Each entity has 3 vector files per model:
*_response_avg_diff.pt- Main vector (average of response token activations)*_prompt_avg_diff.pt- Average of prompt token activations*_prompt_last_diff.pt- Last prompt token activations
Vector Shape
Each .pt file contains a PyTorch tensor with shape [num_layers+1, hidden_dim]:
- Rows correspond to transformer layers (0 through num_layers)
- Columns correspond to hidden dimensions
Usage
import torch
# Load a persona vector
vec = torch.load("gemma-3-12b-it/admiring_stalin_response_avg_diff.pt")
# Access specific layer (e.g., layer 20)
layer_20_vec = vec[20] # Shape: [hidden_dim]
Generation Method
These vectors were generated using the persona vectors pipeline:
- Generate responses with positive system prompts (e.g., "You are a Stalin-admiring assistant...")
- Generate responses with negative system prompts (e.g., "You are a helpful assistant...")
- Filter for effective samples using LLM judge scores
- Compute mean activation difference between positive and negative responses across all layers
License
MIT
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support