π Matrices in Transformers: Preface
Matrix is Transformation
A matrix is a transformation from one space to another. Not "a grid of numbers." Not "rows and columns." A matrix is a machine that takes vectors from one space and moves them to another:
Input Space Output Space
ββΏ ββββ A βββββΆ βα΅
A vector in becomes A vector in
n dimensions m dimensions
When you multiply a matrix by a vector, you're asking: where does this point land in the new space?
When you multiply two matrices, you're asking: what single transformation equals doing one, then the other?
Matrices are made for multiplication. That's their purpose. A matrix sitting alone is just potential energy. A matrix multiplied is a transformation realized.
The Matmuls of a Transformer
Transformers are built from matrix multiplications (matmul). Here's the catalog:
1. Embedding: Lookup as Matmul
Token ID β Vector
One-hot Γ Embedding Matrix = Token Vector
(1 Γ vocab) Γ (vocab Γ d) = (1 Γ d)
A discrete symbol enters. A continuous vector exits. The embedding matrix is a lookup table viewed as a transformation.
2. Projection: Changing Subspaces
Vector β Query/Key/Value
X Γ W_Q = Q
X Γ W_K = K
X Γ W_V = V
(seq Γ d) Γ (d Γ d) = (seq Γ d)
The same vectors, projected into different subspaces. Q asks questions. K provides addresses. V holds content. Three parallel transformations of the same input.
3. Attention: Measuring Similarity
Query Γ Key^T = Attention Scores
Q Γ K^T = Scores
(seq Γ d) Γ (d Γ seq) = (seq Γ seq)
The only place two input-derived matrices multiply each other. This is where tokens "see" each other. The result is a similarity map: how much should position i attend to position j?
4. Aggregation: Weighted Mixing
Attention Γ Values = Output
A Γ V = Output
(seq Γ seq) Γ (seq Γ d) = (seq Γ d)
Attention weights mix value vectors. Each output position is a weighted combination of all input positions. Information flows according to the attention pattern.
5. Feed-Forward: Expand and Compress
Vector β Hidden β Vector
X Γ Wβ = Hidden (d β 4d, expand)
Hidden Γ Wβ = Output (4d β d, compress)
A bottleneck in reverse: expand to a wider space, apply non-linearity, compress back. The FFN processes each position independentlyβno cross-token interaction.
6. Output: Back to Vocabulary
Vector β Logits
Hidden Γ W_out = Logits
(seq Γ d) Γ (d Γ vocab) = (seq Γ vocab)
The inverse of embedding. Continuous vectors become scores over discrete tokens. Often W_out = W_embed^T (tied weights)βthe same transformation, reversed.
Why Not One Giant Matrix?
If transformers are just matmuls, why not collapse them all into one?
Here's the catch: stacked matrix multiplications without non-linearity collapse into a single matrix.
Y = X Γ A Γ B Γ C Γ D
is equivalent to
Y = X Γ M where M = A Γ B Γ C Γ D
No matter how many matrices you stack, the result is still a linear transformation. One matrix. Limited expressiveness.
Non-linearities prevent the collapse. Softmax, GELU, LayerNormβthese simple functions between matmuls make the whole greater than a single matrix could ever be. This makes the transformers (other neural networks too) deep.
So a transformer isn't one matrix. It's many matrices jointed by non-linearities which gives it the expressive power.
The Freight Train
You can picture a transformer as a freight train:
βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ
β Embedding βββββββββββ W_QKV βββββββββββ W_O βββββββββββ W_1 ββββ
βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ
LayerNorm softmax LayerNorm
βββββββββββββ βββββββββββββ
βββββ W_2 βββββββββββ W_out βββββΆ Output
βββββββββββββ βββββββββββββ
GELU softmax
Each car is a matrix. Embedding, projection weights, FFN weights, output projection. Each takes vectors in, transforms them, passes them out.
The joints are non-linearities. LayerNorm, softmax, GELU, etc. They bind the train together as it maneuver difficult terrains of the latent space.
The train is long. What you see above is only one transformer block (or layer). Modern non-MoE transformers like Llama 3 have 100+ layers.
What Is the Leverage?
The matrix theory is mature and has numerous applications. Mathematicians have thoroughly explored every property of linear transformations. Each concept is precisely defined and applied in almost all engineering fields. Aerospace engineers use eigenvalues for flight stability. JPEG uses orthogonal transforms for image compression. Bridge designers use condition numbers to ensure numerical simulations are trustworthy.
Machine learning is young. Deep learning took off around 2012. Transformers arrived in 2017. We're still discovering why things work. Why does LayerNorm help? Why does LoRA succeed with rank 8? Why do residual connections enable depth? The field is full of empirical findings waiting for theoretical grounding and inspiration. When we ask "why does this architectural choice work?", often the answer is a matrix property that engineers in other fields understood decades ago.
When mature math meets young engineering, the green space is huge. We're not inventing new mathematics. We're recognizing old mathematics in new applications.
The Topics
Below is a tentative list of topics to just scratch the surface.
Full-Rank & Causality β What if everything survives, but in temporal order?
- Audio Engineering: Causal filters in real-time audio processing ensure output depends only on past samples, not future ones
- Existing ML Application: Causal masking lets GPT see the past but not the future
- New ML Application: Rank-aware KV cache compression for million-token contexts
Eigenvalues β What are the natural scaling factors of the transformation?
- Aerospace: Aircraft stability analysisβif any eigenvalue has positive real part, the plane's oscillations grow until it crashes
- Existing ML Application: Residual connections keep eigenvalues near 1, enabling 100+ layer networks
- New ML Application: Eigenvalue-constrained training to guarantee stable gradient flow
Condition Number β How extreme is the ratio between largest and smallest scaling?
- Structural Engineering: Before trusting a bridge simulation, engineers check the condition numberβill-conditioned matrices mean the computer's answer might be garbage
- Existing ML Application: LayerNorm and RMSNorm keep condition numbers bounded, stabilizing training
- New ML Application: Condition-aware learning rates that adapt to local geometry
Positive Definiteness β Are all scaling factors positive?
- Quantitative Finance: Portfolio covariance matrices must be positive definiteβotherwise you get "negative variance," which is financial nonsense
- Existing ML Application: Softmax attention produces positive semi-definite Gram matrices, making attention a valid kernel
- New ML Application: Kernel-aware attention variants with guaranteed mathematical properties
Decomposition β How much of the input space survives the transformation?
- Aerospace: Reduces thousands of sensor readings to a handful of principal components for real-time flight control
- Existing ML Application: LoRA achieves efficient fine-tuning via low-rank weight updates
- New ML Application: Adaptive rank allocationβeasy inputs get low-rank attention, hard inputs get full rank
Orthogonality β Are the transformation's directions independent?
- Image Compression: JPEG uses the orthogonal Discrete Cosine Transformβno information lost, perfectly reversible, and most coefficients end up near zero
- Existing ML Application: Muon optimizer orthogonalizes gradient updates, outperforming Adam on matrix-shaped weights
- New ML Application: Orthogonal attention heads that provably learn non-redundant patterns
Sparsity β Which parts of the transformation can we skip?
- Circuit Simulation: Chip simulation with millions of components by exploiting sparsityβeach transistor only connects to a few neighbors
- Existing ML Application: Sparse attention (Longformer, BigBird) scales to long documents by skipping distant token pairs
- New ML Application: Learned dynamic sparsity patterns that adapt to input structure