Wheeler-63M
Research artifact — private. This is an experiment, not a product. Do not use it for anything real unless it is trained much longer (see Intended use & limitations).
Wheeler-63M is a ~62.9M-parameter, from-scratch decoder-only language model whose per-layer channel-mixing block is built from the Wheeler–DeWitt equation of canonical quantum gravity — the "wavefunction of the universe." It is part of a family of models that ask one question: can a transformer's generalizing ability be carried by an unusual dynamical system in place of the feed-forward block? Wheeler-63M keeps the family's tokenizer, traits, and data fixed and swaps only the mixer, so it is a controlled experiment against the sibling models (Quazimoto/Positronic, Mycel).
Model summary
| Parameters | ~62.9M |
| Type | Decoder-only autoregressive LM (custom QuazimotoLM) |
| Layers / hidden | 10 / 768 |
| Attention | Multi-head Latent Attention (MLA) + partial RoPE + QK-Norm + GQA + Elo/Bradley–Terry ratings |
| Channel mixer | WheelerDeWittBlock (64 minisuperspace modes, Lorentzian supermetric) |
| Context | block_size 2048 (RoPE max_position_embeddings 4096) |
| Vocab | 16,512 (custom SpikeWhale byte-merge tokenizer) |
| Traits | HRM refinement, MoE-SwiGLU, MTP, JEPA (train-time), fractal phase seed (no-op here) |
Two checkpoints are included:
chkpt/quazimoto.pt— base, pretraining step 81,000.chkpt/quazimoto_sft.pt— chat SFT, step 14,000 (learns the form of an assistant turn; see limitations).
The Wheeler–DeWitt mixer
A transformer usually mixes channels with an MLP. Wheeler-63M replaces it with a differentiable realization of the Wheeler–DeWitt constraint:
ħ² G_ijmn (δ/δg_ij)(δΨ/δg_mn) + R√g Ψ = 0 (Hamiltonian constraint)
(δΨ/δg_mn)_|m = 0 (momentum / diffeomorphism constraint)
Per token, the hidden state is mapped to K = 64 minisuperspace modes (Ψ, Π) (amplitude + conjugate momentum), which evolve under a learnable Lorentzian DeWitt supermetric G⁻¹ (signature diag(−1, +1, …, +1) — mode 0 is timelike, the volume/scale direction). A leapfrog integration runs the wave equation
Π ← Π − dt · (R√g) Ψ # curvature potential
Ψ ← Ψ + dt · G⁻¹ Π # indefinite metric ⇒ a genuine wave, not diffusion
What makes it different from every other mixer:
- No external time. It is not an evolution
∂Ψ/∂t = …; it is a constraintĤΨ = 0. The HamiltonianH = ½ΠᵀG⁻¹Π + ½Σ rΨ²is exposed and added to the loss as⟨H²⟩, pressuring the block onto the physical (constraint-satisfying) surface. - Lorentzian, not diffusive. The indefinite supermetric makes the volume mode a genuine wave direction; the trained checkpoints keep this indefinite signature (verified).
- Diffeomorphism-invariant readout. Only invariants of the mode vector (the timelike component and the norm of the spacelike part) are read out — never raw coordinates — honouring the momentum constraint.
Everything else is the shared "SpikeWhale/Byrne family": MLA attention with Elo/Bradley–Terry key ratings, HRM iterative refinement, an MoE-SwiGLU block, multi-token-prediction heads, a JEPA auxiliary objective, and the SpikeWhale byte-merge tokenizer. Each trait is a gated near-no-op at init that only activates if it helps.
Usage
Self-contained — load with the bundled code (no transformers custom-arch registration needed):
import torch
from model import QuazimotoLM, QuazimotoConfig
from spike_tokenizer import SpikeTokenizer
tok = SpikeTokenizer(vocab_file="tokenizer.json")
ck = torch.load("chkpt/quazimoto.pt", map_location="cpu", weights_only=False)
cfg = QuazimotoConfig(**ck["family_config"])
model = QuazimotoLM(cfg); model.load_state_dict(ck["model"], strict=False); model.eval()
ids = torch.tensor([tok.encode("The history of science", add_special_tokens=False)])
out = model.generate(ids, 80, temperature=0.8, top_k=40, use_cache=True)
print(tok.decode(out[0].tolist(), skip_special_tokens=True))
Or use the bundled generate.py (KV cache + optional self-speculative decoding) and chat_sft.py (ChatML REPL for the SFT checkpoint).
Intended use & limitations
This is a research artifact for studying architecture generalization — not a usable assistant.
- The base produces fluent, locally-coherent free-form text, but is not factual or reliable over long spans.
- The SFT checkpoint (only 14k steps) has learned the shape of an assistant turn — correct ChatML formatting and grammar — but not knowledge or instruction-following. It hallucinates freely (e.g. "a black hole is a microwave…"). This is a capacity-and-scale limit of a 63M model with a short SFT run.
- Do not deploy this. If you want it to be actually useful, it needs to be trained substantially longer (more tokens, larger, longer SFT). It is shared to demonstrate that the Wheeler–DeWitt mixer can learn language at all — the point is that it generalizes, not what it knows.
- Trained on public web/edu/math corpora; inherits their biases and can produce incorrect or offensive text.
Family & thesis
Wheeler-63M is one mixer in a controlled series (Kuramoto oscillators → Quazimoto/Positronic; Neighbour-Sensing growth → Mycel; a per-token GRPO council → Chimera). The shared thesis: the specific mixing mechanism matters far less than the loop of attend → compress → repeat. A timeless quantum-gravity constraint becoming a next-token predictor is about the strongest form of that claim.
Citation
@misc{byrne2026wheeler,
title = {Wheeler-63M: a language model whose channel mixer is the Wheeler--DeWitt equation},
author = {Byrne, Dean},
year = {2026},
note = {Research artifact; custom PyTorch architecture (Quazim0t0 / SpikeWhale family)}
}
Wheeler–DeWitt: DeWitt, "Quantum Theory of Gravity I," Phys. Rev. 1967.