Growing Transformers — Unfrozen Baseline (Monolithic, 247M)

This repository contains growing-transformers-model-unfrozen-baseline-monolyth-247m, a classic monolithic baseline model from the paper:

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT

What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained in the fully classical way: monolithic end-to-end training from scratch, with no constructive / layer-wise growth and no frozen embeddings. The token embedding matrix is randomly initialized and fully trainable, so semantic structure can be learned directly in the input embeddings (as in standard GPT-like training).

This repo exists as a clean baseline for controlled comparisons against the constructive-growth models and against models with frozen embedding substrates.

Primary comparison (why this repo exists)

This model is intended to be compared to:

Bochkov/growing-transformers-model-16-bit-1-9-181m
(constructive, layer-wise growth; frozen 16-bit embeddings)

What is identical

Same controlled-study Transformer stack architecture (9 layers, d_model=1024, n_head=32)
Same tokenizer family / vocabulary size (65,536)
Same context length used in training (1024)

What differs

Training procedure
- This repo: monolithic end-to-end training (all layers trained together from scratch)
- 16-bit constructive repo: trained in stages (1–3, then 4–6, then 7–9), freezing previously trained layers
Embedding layer
- This repo: standard trainable embedding matrix (vocab_size × d_model)
- 16-bit repo: extremely small frozen embedding substrate (16-dim binary signal expanded to d_model)

Important note on parameter count (why this model is larger)

This model has more parameters than the 16-bit models because it includes a full trainable embedding matrix:

Trainable embedding size here: 65,536 × 1,024 ≈ 67.1M parameters
In the 16-bit setup, the embedding-related parameters are ~1.0M (as reported in the paper)

So even with the same Transformer block stack, the total parameter count differs primarily due to the embedding matrix.

Model architecture (controlled study)

Type: decoder-only Transformer (GPT-like)
Layers: 9
Hidden size: d_model = 1024
Heads: n_head = 32
Vocabulary size: 65,536
Context length used in training: 1024
Embedding: standard trainable token embedding matrix (vocab_size × d_model)

Parameter count

Total: ≈247.6M
Trainable: ≈247.6M
Frozen: 0.0M

(Counts follow the paper’s controlled-study table.)

Tokenizer

Canonical tokenizer repository:

https://huggingface.co/Bochkov/bvv241-2-3
(collection: https://huggingface.co/collections/Bochkov/tokenizers)

Intended use

Research / analysis of:

monolithic end-to-end training vs constructive (layer-wise) growth
the role of embedding learning vs frozen embedding substrates
controlled comparisons across embedding types (trainable vs UNICODE frozen vs 16-bit frozen)

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.

How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The poem is a poem about the sea and the sea and the sea and the sea. The poem is ab

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Chennai
#    </s><

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}

Downloads last month: -

Collection including Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m

Growing Transformers:Layer-wise Expansion Comparative Study

Collection

Paper: 2507.07129 'Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate' (4.2.2, 5.2. Results) • 8 items • Updated Jan 4 • 1

Papers for Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper • 2507.07129 • Published Jul 8, 2025 • 3

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Paper • 2507.04886 • Published Jul 7, 2025 • 3