Growing Transformers β€” Unfrozen Baseline (Monolithic, 247M)

This repository contains growing-transformers-model-unfrozen-baseline-monolyth-247m, a classic monolithic baseline model from the paper:

πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT


What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained in the fully classical way: monolithic end-to-end training from scratch, with no constructive / layer-wise growth and no frozen embeddings. The token embedding matrix is randomly initialized and fully trainable, so semantic structure can be learned directly in the input embeddings (as in standard GPT-like training).

This repo exists as a clean baseline for controlled comparisons against the constructive-growth models and against models with frozen embedding substrates.


Primary comparison (why this repo exists)

This model is intended to be compared to:

  • Bochkov/growing-transformers-model-16-bit-1-9-181m
    (constructive, layer-wise growth; frozen 16-bit embeddings)

What is identical

  • Same controlled-study Transformer stack architecture (9 layers, d_model=1024, n_head=32)
  • Same tokenizer family / vocabulary size (65,536)
  • Same context length used in training (1024)

What differs

  • Training procedure
    • This repo: monolithic end-to-end training (all layers trained together from scratch)
    • 16-bit constructive repo: trained in stages (1–3, then 4–6, then 7–9), freezing previously trained layers
  • Embedding layer
    • This repo: standard trainable embedding matrix (vocab_size Γ— d_model)
    • 16-bit repo: extremely small frozen embedding substrate (16-dim binary signal expanded to d_model)

Important note on parameter count (why this model is larger)

This model has more parameters than the 16-bit models because it includes a full trainable embedding matrix:

  • Trainable embedding size here: 65,536 Γ— 1,024 β‰ˆ 67.1M parameters
  • In the 16-bit setup, the embedding-related parameters are ~1.0M (as reported in the paper)

So even with the same Transformer block stack, the total parameter count differs primarily due to the embedding matrix.


Model architecture (controlled study)

  • Type: decoder-only Transformer (GPT-like)
  • Layers: 9
  • Hidden size: d_model = 1024
  • Heads: n_head = 32
  • Vocabulary size: 65,536
  • Context length used in training: 1024
  • Embedding: standard trainable token embedding matrix (vocab_size Γ— d_model)

Parameter count

  • Total: β‰ˆ247.6M
  • Trainable: β‰ˆ247.6M
  • Frozen: 0.0M

(Counts follow the paper’s controlled-study table.)


Tokenizer

Canonical tokenizer repository:


Intended use

Research / analysis of:

  • monolithic end-to-end training vs constructive (layer-wise) growth
  • the role of embedding learning vs frozen embedding substrates
  • controlled comparisons across embedding types (trainable vs UNICODE frozen vs 16-bit frozen)

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The poem is a poem about the sea and the sea and the sea and the sea. The poem is ab

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Chennai
#    </s><

πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m

Papers for Bochkov/growing-transformers-model-unfrozen-baseline-monolyth-247m