You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Wheeler-63M

Research artifact — private. This is an experiment, not a product. Do not use it for anything real unless it is trained much longer (see Intended use & limitations).

Wheeler-63M is a ~62.9M-parameter, from-scratch decoder-only language model whose per-layer channel-mixing block is built from the Wheeler–DeWitt equation of canonical quantum gravity — the "wavefunction of the universe." It is part of a family of models that ask one question: can a transformer's generalizing ability be carried by an unusual dynamical system in place of the feed-forward block? Wheeler-63M keeps the family's tokenizer, traits, and data fixed and swaps only the mixer, so it is a controlled experiment against the sibling models (Quazimoto/Positronic, Mycel).

Model summary


Parameters	~62.9M
Type	Decoder-only autoregressive LM (custom `QuazimotoLM`)
Layers / hidden	10 / 768
Attention	Multi-head Latent Attention (MLA) + partial RoPE + QK-Norm + GQA + Elo/Bradley–Terry ratings
Channel mixer	WheelerDeWittBlock (64 minisuperspace modes, Lorentzian supermetric)
Context	`block_size` 2048 (RoPE `max_position_embeddings` 4096)
Vocab	16,512 (custom SpikeWhale byte-merge tokenizer)
Traits	HRM refinement, MoE-SwiGLU, MTP, JEPA (train-time), fractal phase seed (no-op here)

Two checkpoints are included:

chkpt/quazimoto.pt — base, pretraining step 81,000.
chkpt/quazimoto_sft.pt — chat SFT, step 14,000 (learns the form of an assistant turn; see limitations).

The Wheeler–DeWitt mixer

A transformer usually mixes channels with an MLP. Wheeler-63M replaces it with a differentiable realization of the Wheeler–DeWitt constraint:

ħ² G_ijmn (δ/δg_ij)(δΨ/δg_mn) + R√g Ψ = 0          (Hamiltonian constraint)
(δΨ/δg_mn)_|m = 0                                   (momentum / diffeomorphism constraint)

Per token, the hidden state is mapped to K = 64 minisuperspace modes (Ψ, Π) (amplitude + conjugate momentum), which evolve under a learnable Lorentzian DeWitt supermetric G⁻¹ (signature diag(−1, +1, …, +1) — mode 0 is timelike, the volume/scale direction). A leapfrog integration runs the wave equation

Π ← Π − dt · (R√g) Ψ          # curvature potential
Ψ ← Ψ + dt · G⁻¹ Π            # indefinite metric ⇒ a genuine wave, not diffusion

What makes it different from every other mixer:

No external time. It is not an evolution ∂Ψ/∂t = …; it is a constraint ĤΨ = 0. The Hamiltonian H = ½ΠᵀG⁻¹Π + ½Σ rΨ² is exposed and added to the loss as ⟨H²⟩, pressuring the block onto the physical (constraint-satisfying) surface.
Lorentzian, not diffusive. The indefinite supermetric makes the volume mode a genuine wave direction; the trained checkpoints keep this indefinite signature (verified).
Diffeomorphism-invariant readout. Only invariants of the mode vector (the timelike component and the norm of the spacelike part) are read out — never raw coordinates — honouring the momentum constraint.

Everything else is the shared "SpikeWhale/Byrne family": MLA attention with Elo/Bradley–Terry key ratings, HRM iterative refinement, an MoE-SwiGLU block, multi-token-prediction heads, a JEPA auxiliary objective, and the SpikeWhale byte-merge tokenizer. Each trait is a gated near-no-op at init that only activates if it helps.

Usage

Self-contained — load with the bundled code (no transformers custom-arch registration needed):

import torch
from model import QuazimotoLM, QuazimotoConfig
from spike_tokenizer import SpikeTokenizer

tok = SpikeTokenizer(vocab_file="tokenizer.json")
ck = torch.load("chkpt/quazimoto.pt", map_location="cpu", weights_only=False)
cfg = QuazimotoConfig(**ck["family_config"])
model = QuazimotoLM(cfg); model.load_state_dict(ck["model"], strict=False); model.eval()

ids = torch.tensor([tok.encode("The history of science", add_special_tokens=False)])
out = model.generate(ids, 80, temperature=0.8, top_k=40, use_cache=True)
print(tok.decode(out[0].tolist(), skip_special_tokens=True))

Or use the bundled generate.py (KV cache + optional self-speculative decoding) and chat_sft.py (ChatML REPL for the SFT checkpoint).

Intended use & limitations

This is a research artifact for studying architecture generalization — not a usable assistant.

The base produces fluent, locally-coherent free-form text, but is not factual or reliable over long spans.
The SFT checkpoint (only 14k steps) has learned the shape of an assistant turn — correct ChatML formatting and grammar — but not knowledge or instruction-following. It hallucinates freely (e.g. "a black hole is a microwave…"). This is a capacity-and-scale limit of a 63M model with a short SFT run.
Do not deploy this. If you want it to be actually useful, it needs to be trained substantially longer (more tokens, larger, longer SFT). It is shared to demonstrate that the Wheeler–DeWitt mixer can learn language at all — the point is that it generalizes, not what it knows.
Trained on public web/edu/math corpora; inherits their biases and can produce incorrect or offensive text.

Family & thesis

Wheeler-63M is one mixer in a controlled series (Kuramoto oscillators → Quazimoto/Positronic; Neighbour-Sensing growth → Mycel; a per-token GRPO council → Chimera). The shared thesis: the specific mixing mechanism matters far less than the loop of attend → compress → repeat. A timeless quantum-gravity constraint becoming a next-token predictor is about the strongest form of that claim.

Citation

@misc{byrne2026wheeler,
  title  = {Wheeler-63M: a language model whose channel mixer is the Wheeler--DeWitt equation},
  author = {Byrne, Dean},
  year   = {2026},
  note   = {Research artifact; custom PyTorch architecture (Quazim0t0 / SpikeWhale family)}
}

Wheeler–DeWitt: DeWitt, "Quantum Theory of Gravity I," Phys. Rev. 1967.

Downloads last month: -; Downloads are not tracked for this model. How to track

Article mentioning Quazim0t0/Wheeler-63M

Some notes on swapping out the "important" parts of neural nets

Quazim0t0

•

about 10 hours ago

• 1