transformers

Running

App Files Files Community

burtenshaw commited on 11 days ago

Commit

e6a04c6

1 Parent(s): 90196d3

update blog post to use fancy mc fancienson style

Browse files

Files changed (10) hide show

README.md +0 -470
app/src/components/Hero.astro +2 -1
app/src/content/article.mdx +551 -28
app/src/content/assets/image/nanochat-banner.png +3 -0
app/src/content/assets/image/tweet.png +3 -0
app/src/content/chapters/grpo.mdx +406 -0
app/src/content/chapters/inference.mdx +97 -0
app/src/content/chapters/sft.mdx +400 -0
grpo.ipynb +654 -0
sft.ipynb +591 -0

README.md CHANGED Viewed

@@ -28,473 +28,3 @@ thumbnail: https://HuggingFaceTB-smol-training-playbook.hf.space/thumb.png
 **[Try the live demo & documentation →](https://huggingface.co/spaces/tfrere/research-article-template)**
 </div>
-# Porting nanochat to Transformers: an AI modeling history lesson
-**tldr:** There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture.
-Recently I was working on helping students of the [nanochat](https://huggingface.co/nanochat-students) project to share their models and discuss their learning on Hugging Face. In the process, I thought it would be useful if the model was integrated into the `transformers` library. This would allow others to use their nanochat models for inference in loads of downstream libraries like vLLM for inference or TRL for post-training.
-You can now use nanochat models in transformers and tap into all those educational gains across the ecosystem. But along the way, I uncovered a further treasure trove of education about how canonical models relate to each other, and the components they share.
-I received the lesson from the simple teacher of class inheritance and transformers modular philosophy. If you want to learn more about that, check out this [guide here](https://huggingface.co/docs/transformers/v4.48.0/modular_transformers).
-Here, let’s tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
-## What is `nanochat`?
-On October 13th 2025, Andrej Karpathy unceremoniously [dropped](https://x.com/karpathy/status/1977755427569111362) the nanochat [repo](https://github.com/karpathy/nanochat) into the unsuspecting AI world. To hype seekers, this was just a small and pretty average LLM. To ML devotees, this was nirvana. A raw unadulterated chance to tinker, fiddle, and play with a transformer model defined in pure pytorch. Nothing was hidden away in fancy `torch` methods or inherited from complex class structures. It was all there in a simple file.
-![][image1]
-Karpathy had painstakingly implemented an end-to-end build of an LLM system without the use of most major libraries. Even though in real world situations most rely on transformers, tokenizers, datasets, trl, etc. This back to basics approach gives us the chance to genuinely learn and understand something from the ground up.
-Personally, I found the process to be one of the most educational I can remember.
-## What is `transformers` and how is it educational?
-Most of know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it’s a powerful piece of education.
-If you don’t know… transformers is the de facto implementation of modern AI models that bear the same name; ‘transformers’ like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other. If you want to explore the philosophy and lineage behind transformers’ modularity, check out this [guide here](https://huggingface.co/docs/transformers/v4.48.0/modular_transformers).
-In general, scientists at AI research labs design, implement, and train their models in their framework of choice, be that torch, JAX, etc. When they come to share their open model with the community, they will open a PR on transformers and refactor their code to use relevant modules.
-Because `transformers` contain most major model implementations, researchers have to inherent model architecture attributes from other canonical models. This is in every sense a ‘single source of truth’.
-This practical feature of the library has an amazingly educational quality to it. We can read a model implementation as a series of references to other usages of those architectural features. For example, when one model uses a certain type of [RMSNorm](https://github.com/huggingface/transformers/blob/9f5b2d1b8995daa539b757e28c337e36408055e6/src/transformers/models/nanochat/modular_nanochat.py#L44), we can plainly see that it is the same implementation as another model because it inherits that class entirely. For example, check out nanochat’s RMSNorm:
-```py
-class NanoChatRMSNorm(Llama4TextL2Norm):
-    pass
-```
-The `transformers` library then converts the `modular_*` implementation into a `modeling_*` implementation, which contains the complete `torch` native implementation:
-```py
-class NanoChatRMSNorm(torch.nn.Module):
-    def __init__(self, eps: float = 1e-6):
-        super().__init__()
-        self.eps = eps
-    def _norm(self, x):
-        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
-    def forward(self, x):
-        return self._norm(x.float()).type_as(x)
-    def extra_repr(self):
-        return f"eps={self.eps}"
-```
-If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model’s implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
-## Why do we need nanochat in `transformers`?
-It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat’s benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or Olmo3.
-Nanochat was never really intended as a production grade model. It was meant as an educational tool, and that’s the same reason why we need it in transformers. There are four main reasons:
-- `transformers` as a single source of truth teaches us about `nanochat`’s lineage.
-- use the `nanochat` model in other libraries.
-- save money by reusing nanochat checkpoints for fine-tuning.
-- compare nanochat fine-tuning with other open model checkpoints.
-Firstly, as mentioned above`transformers` teaches us about the modeling conventions that Karpathy uses from other canonical implementations.
-Secondly, because transformers is a standard within the ecosystem, it unlocks more downstream learning in post training libraries, quantisation tools, inference libraries, and device integrations. In practical terms, here are some examples nanochat students could learn on top of `transformers`:
-- Quantize models in llama.cpp ($0)
-- Integrate models into the browser and WebGPU ($0)
-- SFT training in TRL/torch on Google Colab ($0)
-- RL training TRL/torch on Google Colab ($0 \- $9)
-- Agentic RL in TRL on Google Colab ($0 \- $9)
-Finally, training AI models is expensive. Running the `nanochat` [`speedrun.sh`](https://github.com/karpathy/nanochat/blob/master/speedrun.sh)  costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
-In short, let’s unlock more opportunities for education\!
-## The nanochat architecture
-As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works.
-The core model implementation ([`nanochat/gpt.py`](http://gpt.py), 291 lines) demonstrates modern transformer architecture, with every design decision documented and justified.
-The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128\. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
-The architecture incorporates five key improvements over vanilla transformers. Let’s work through the components of this architecture and compare them across implementation:
-#### Forward pass based on the Llama Architecture
-The forward pass in nanochat handles both training and generation. We can simply read that the input `x` is embedded and then updated by each layer then the head. During training, a loss is calculated and returned instead of the logits themselves.
-```py
-def forward(self, x, targets=None, loss_reduction='mean'):
-    x = self.token_emb(x)
-    for layer in self.layers:
-        x = layer(x)
-    x = self.ln_f(x)
-    logits = self.lm_head(x)
-    if targets is not None:
-        loss = F.cross_entropy(
-            logits.view(-1, self.vocab_size),
-            targets.view(-1),
-            ignore_index=-1,
-            reduction=loss_reduction
-        )
-        return loss
-    return logits
-```
-By returning loss directly when targets are provided, the training loop becomes trivial. No separate loss computation, no manual masking logic—just `loss = model(inputs, targets)` followed by `loss.backward()`.
-`transformers` has to make things a bit more complex to facilitate the downstream ecosystem that uses logits in a broad spectrum of ways. Therefore, loss calculation is dealt with in training-specific code, and the `forward` function returns `BaseModelOutputWithPast`.
-```py
-class NanoChatModel(LlamaModel):
-    def __init__(self, config: NanoChatConfig):
-        super().__init__(config)
-        self.initial_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
-        self.norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
-    def forward(
-        self,
-        input_ids: Optional[torch.LongTensor] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[Cache] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        cache_position: Optional[torch.LongTensor] = None,
-        use_cache: Optional[bool] = None,
-        **kwargs: Unpack[TransformersKwargs],
-    ) -> BaseModelOutputWithPast:
-        if (input_ids is None) ^ (inputs_embeds is not None):
-            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
-        if inputs_embeds is None:
-            inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
-        if use_cache and past_key_values is None:
-            past_key_values = DynamicCache(config=self.config)
-        if cache_position is None:
-            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
-            cache_position: torch.Tensor = torch.arange(
-                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
-            )
-        if position_ids is None:
-            position_ids = cache_position.unsqueeze(0)
-        causal_mask = create_causal_mask(
-            config=self.config,
-            input_embeds=inputs_embeds,
-            attention_mask=attention_mask,
-            cache_position=cache_position,
-            past_key_values=past_key_values,
-            position_ids=position_ids,
-        )
-        hidden_states = inputs_embeds
-        position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)
-        hidden_states = self.initial_norm(hidden_states)  # Additional norm before the layers
-        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
-            hidden_states = decoder_layer(
-                hidden_states,
-                attention_mask=causal_mask,
-                position_embeddings=position_embeddings,
-                position_ids=position_ids,
-                past_key_values=past_key_values,
-                cache_position=cache_position,
-                **kwargs,
-            )
-        hidden_states = self.norm(hidden_states)
-        return BaseModelOutputWithPast(
-            last_hidden_state=hidden_states,
-            past_key_values=past_key_values,
-        )
-```
-#### Rotary Position Embeddings (RoPE)
-Rotary Position Embeddings (RoPE) replace learned positional encodings by rotating query and key vectors using precomputed sin/cos frequencies:
-```py
-def apply_rope(x, cos, sin):
-    x1, x2 = x[..., ::2], x[..., 1::2]
-    y1 = x1 * cos - x2 * sin
-    y2 = x1 * sin + x2 * cos
-    return torch.stack([y1, y2], dim=-1).flatten(-2)
-```
-In transformers, the rotary embeddings are implemented like so:
-```py
-from ..llama.modeling_llama import (
-    LlamaDecoderLayer,
-    LlamaModel,
-    LlamaPreTrainedModel,
-    LlamaRotaryEmbedding,
-    apply_rotary_pos_emb,
-    eager_attention_forward,
-)
-class NanoChatRotaryEmbedding(LlamaRotaryEmbedding):
-    pass
-def rotate_half(x):
-    """Rotates half the hidden dims of the input with flipped signs for NanoChat."""
-    x1 = x[..., : x.shape[-1] // 2]
-    x2 = x[..., x.shape[-1] // 2 :]
-    return torch.cat((x2, -x1), dim=-1)
-```
-`NanoChatRotaryEmbedding` almost entirely inherits from the original Llama series, except for a sign inversion in `rotate_half`**.**
-### **QK Normalization**
-NanoChat applies RMSNorm to queries and keys before computing attention to stabilize training.
-In the original gpt.py, this is achieved via a functional norm helper applied directly inside the attention forward pass:
-```py
-def norm(x):
-    # Purely functional rmsnorm with no learnable params
-    return F.rms_norm(x, (x.size(-1),))
-class CausalSelfAttention(nn.Module):
-    ...
-    def forward(self, x, cos_sin, kv_cache):
-        B, T, C = x.size()
-        # Project the input to get queries, keys, and values
-        q = self.c_q(x).view(B, T, self.n_head, self.head_dim)
-        k = self.c_k(x).view(B, T, self.n_kv_head, self.head_dim)
-        v = self.c_v(x).view(B, T, self.n_kv_head, self.head_dim)
-        # Apply Rotary Embeddings to queries and keys to get relative positional encoding
-        cos, sin = cos_sin
-        q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin) # QK rotary embedding
-        q, k = norm(q), norm(k) # QK norm
-        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)
-	  ...
-```
-In the modular transformers implementation, we see a fascinating mix of lineages. The `NanoChatRMSNorm` inherits directly from `Llama4TextL2Norm`, while the attention mechanism inherits from `Qwen3Attention`. We simply inject the QK normalization into the Qwen3 logic:
-```py
-class NanoChatRMSNorm(Llama4TextL2Norm):
-    pass
-class NanoChatAttention(Qwen3Attention):
-    def __init__(self, config: NanoChatConfig, layer_idx: int):
-        super().__init__(config, layer_idx)
-        del self.sliding_window
-        del self.layer_type
-        self.q_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
-        self.k_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        past_key_values: Optional[Cache] = None,
-        cache_position: Optional[torch.LongTensor] = None,
-        **kwargs: Unpack[TransformersKwargs],
-    ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
-        input_shape = hidden_states.shape[:-1]
-        hidden_shape = (*input_shape, -1, self.head_dim)
-        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-        cos, sin = position_embeddings
-        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
-        # RoPE -> Norm (instead of usual Norm -> RoPE)
-        query_states = self.q_norm(query_states)
-        key_states = self.k_norm(key_states)
-        if past_key_values is not None:
-            # sin and cos are specific to RoPE models; cache_position needed for the static cache
-            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
-            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
-        attention_interface: Callable = eager_attention_forward
-        if self.config._attn_implementation != "eager":
-            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
-        attn_output, attn_weights = attention_interface(
-            self,
-            query_states,
-            key_states,
-            value_states,
-            attention_mask,
-            dropout=0.0 if not self.training else self.attention_dropout,
-            scaling=self.scaling,
-            **kwargs,
-        )
-        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
-        attn_output = self.o_proj(attn_output)
-        return attn_output, attn_weights
-```
-### **Untied Weights**
-Karpathy's implementation deliberately unties the weights between the token embedding and the language model head to provide the model with more flexibility. In gpt.py, these are initialized as two completely separate modules:
-```py
-class GPT(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.config = config
-        self.transformer = nn.ModuleDict({
-            "wte": nn.Embedding(config.vocab_size, config.n_embd),
-            "h": nn.ModuleList([Block(config, layer_idx) for layer_idx in range(config.n_layer)]),
-        })
-        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
-        # ... (rest of init)
-```
-In the modular implementation, we inherit from `Gemma2ForCausalLM`. This is a powerful simplification—Gemma 2 also supports untied weights and advanced output structures. By simply inheriting the class, we pull in all the necessary machinery for causal generation, while the configuration object (defined elsewhere) ensures the weights remain untied:
-```py
-class NanoChatForCausalLM(Gemma2ForCausalLM):
-    def forward(self, **super_kwargs) -> CausalLMOutputWithPast:
-        super().forward(**super_kwargs)
-```
-###
-### **ReLU² Activation**
-The original implementation replaces the standard GELU activation with ReLU², which is simply ReLU squared. This provides a faster alternative without performance loss. In gpt.py, this is hardcoded into the MLP block:
-```py
-class MLP(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
-        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
-    def forward(self, x):
-        x = self.c_fc(x)
-        x = F.relu(x).square()
-        x = self.c_proj(x)
-        return x
-```
-In the modular file, we see another surprising inheritance: `CLIPMLP`. The CLIP architecture uses a structure that fits our needs perfectly, so we inherit the structural definition from CLIP and let the configuration drive the specific activation function (ReLU2):
-```py
-class NanoChatMLP(CLIPMLP):
-    def __init__(self, config):
-        super().__init__(config)
-        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
-        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
-```
-###  **Multi-Query Attention (MQA)**
-NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using 10 query heads but only 4 key/value heads (in the default config).
-In gpt.py, this logic is handled by passing distinct head counts and relying on PyTorch's functional attention to handle the broadcasting (or explicitly handling it during inference):
-```py
-class CausalSelfAttention(nn.Module):
-    # ...
-    def forward(self, x, cos_sin, kv_cache):
-        # ...
-        # Attention: queries attend to keys/values autoregressively. A few cases to handle:
-        enable_gqa = self.n_head != self.n_kv_head # Group Query Attention (GQA): duplicate key/value heads to match query heads if desired
-        if kv_cache is None or Tq == Tk:
-            # During training (no KV cache), attend as usual with causal attention
-            # And even if there is KV cache, we can still use this simple version when Tq == Tk
-            y = F.scaled_dot_product_attention(q, k, v, is_causal=True, enable_gqa=enable_gqa)
-        elif Tq == 1:
-            # During inference but with a single query in this forward pass:
-            # The query has to attend to all the keys/values in the cache
-            y = F.scaled_dot_product_attention(q, k, v, is_causal=False, enable_gqa=enable_gqa)
-        else:
-            # During inference AND we have a chunk of queries in this forward pass:
-            # First, each query attends to all the cached keys/values (i.e. full prefix)
-            attn_mask = torch.zeros((Tq, Tk), dtype=torch.bool, device=q.device) # True = keep, False = mask
-            prefix_len = Tk - Tq
-            if prefix_len > 0: # can't be negative but could be zero
-                attn_mask[:, :prefix_len] = True
-            # Then, causal attention within this chunk
-            attn_mask[:, prefix_len:] = torch.tril(torch.ones((Tq, Tq), dtype=torch.bool, device=q.device))
-            y = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, enable_gqa=enable_gqa)
-        # ...
-```
-###
-In `modular_nanochat.py`, we don't need to write this logic at all. As seen in the QK Normalization section above, `NanoChatAttention` inherits from `Qwen3Attention`. The Qwen3 implementation is robust and fully supports GQA/MQA out of the box. By using this parent class, we get production-grade attention implementation "for free," allowing us to focus solely on the unique normalizations required by NanoChat.
-## Conclusion
-It’s very clear that Andrej Karpathy’s implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
-## Use Nanochat in Transformers
-If you’d like to try out your own nanochat models in `transformers`
-1. Download the nanochat-d34 checkpoint
-```
-hf download karpathy/nanochat-d34 --local-dir nanochat-d34
-```
-2. Convert the checkpoint to transformers format
-```
-uv run \
---with "transformers @ git+https://github.com/huggingface/transformers.git@nanochat-implementation" \
---with "tiktoken>=0.12.0" \
-https://raw.githubusercontent.com/huggingface/transformers/nanochat-implementation/src/transformers/models/nanochat/convert_nanochat_checkpoints.py \
---input_dir ./nanochat-d34 \
---output_dir ./nanochat-d3-hf
-```
-3. (optional) Upload the checkpoint to the Hugging Face Hub
-```
-hf upload <username>/nanochat-d34 nanochat-d34
-```
-4. Test the model
-```py
-import torch
-from transformers import AutoTokenizer, NanoChatForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("./nanochat-d3-hf")
-model = NanoChatForCausalLM.from_pretrained("./nanochat-d3-hf")
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-model = model.to(device)
-prompt = "Hello, how are you?"
-inputs = tokenizer(prompt, return_tensors="pt").to(device)
-inputs.pop("token_type_ids", None)
-outputs = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-## Notebooks
-If you want to train with these models, you can use these colab notebooks:
-- [SFT](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/sft.ipynb)
-- [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)


28	[Try the live demo & documentation →](https://huggingface.co/spaces/tfrere/research-article-template)
29
30	</div>

app/src/components/Hero.astro CHANGED Viewed

@@ -1,5 +1,6 @@
 ---
 import HtmlEmbed from "./HtmlEmbed.astro";
 interface Props {
   title: string; // may contain HTML (e.g., <br/>)
@@ -98,7 +99,7 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
 <section class="hero">
   <h1 class="hero-title" set:html={title} />
   <div class="hero-banner">
-    <HtmlEmbed src="banner.html" frameless />
     {description && <p class="hero-desc">{description}</p>}
   </div>
 </section>

 ---
 import HtmlEmbed from "./HtmlEmbed.astro";
+import bannerImage from "../content/assets/image/nanochat-banner.png";
 interface Props {
   title: string; // may contain HTML (e.g., <br/>)
 <section class="hero">
   <h1 class="hero-title" set:html={title} />
   <div class="hero-banner">
+    <img src={bannerImage.src} alt="Banner" style="width: 100%; max-width: 980px;" />
     {description && <p class="hero-desc">{description}</p>}
   </div>
 </section>

app/src/content/article.mdx CHANGED Viewed

@@ -1,19 +1,25 @@
 ---
-title: "Bringing paper to life:\n A modern template for\n scientific writing"
-subtitle: "Publish‑ready workflow that lets you focus on ideas, not infrastructure"
-description: "Publish‑ready workflow that lets you focus on ideas, not infrastructure"
 authors:
-  - name: "Thibaud Frere"
-    url: "https://huggingface.co/tfrere"
     affiliations: [1]
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
-published: "Sep. 01, 2025"
 doi: 10.1234/abcd.efgh
 licence: >
   Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
-  Figures reused from other sources are excluded and marked in their captions (“Figure from …”).
 tags:
   - research
   - template
@@ -22,36 +28,553 @@ pdfProOnly: false
 showPdf: true
 ---
-import Introduction from "./chapters/demo/introduction.mdx";
-import BuiltWithThis from "./chapters/demo/built-with-this.mdx";
-import BestPractices from "./chapters/demo/best-pratices.mdx";
-import WritingYourContent from "./chapters/demo/writing-your-content.mdx";
-import AvailableBlocks from "./chapters/demo/markdown.mdx";
-import GettingStarted from "./chapters/demo/getting-started.mdx";
-import Markdown from "./chapters/demo/markdown.mdx";
-import Components from "./chapters/demo/components.mdx";
-import Greetings from "./chapters/demo/greetings.mdx";
-import VibeCodingCharts from "./chapters/demo/vibe-coding-charts.mdx";
-import ImportContent from "./chapters/demo/import-content.mdx";
-<Introduction />
-<BuiltWithThis />
-<GettingStarted />
-<WritingYourContent />
-<Markdown />
-<Components />
-<VibeCodingCharts />
-<ImportContent />
-<BestPractices />
-<Greetings />

 ---
+title: "Porting nanochat to Transformers: an AI modeling history lesson"
+subtitle: "There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
+description: "**tldr:** There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
 authors:
+  - name: "Ben Burtenshaw"
+    url: "https://huggingface.co/burtenshaw"
+    affiliations: [1]
+  - name: "Sergio Paniego"
+    url: "https://huggingface.co/sergiopaniego"
+    affiliations: [1]
+  - name: "Anton Vlasjuk"
+    url: "https://huggingface.co/AntonV"
     affiliations: [1]
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
+published: "Dec. 01, 2025"
 doi: 10.1234/abcd.efgh
 licence: >
   Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
+  Figures reused from other sources are excluded and marked in their captions ("Figure from …").
 tags:
   - research
   - template
 showPdf: true
 ---
+import Sidenote from '../../components/Sidenote.astro'
+import GRPO from "./chapters/grpo.mdx";
+import SFT from "./chapters/sft.mdx";
+import Inference from "./chapters/inference.mdx";
+<Sidenote>
+The [nanochat-students](https://huggingface.co/nanochat-students) organization on Hugging Face hosts community models and discussions.
+</Sidenote>
+Recently I was working on helping students of the nanochat project to share their models and discuss their learning on Hugging Face. In the process, I thought it would be useful if the model was integrated into the `transformers` library. This would allow others to use their nanochat models for inference in loads of downstream libraries like vLLM for inference or TRL for post-training.
+<Sidenote>
+[vLLM](https://docs.vllm.ai/) provides high-throughput inference, while [TRL](https://huggingface.co/docs/trl/index) offers tools for reinforcement learning from human feedback (RLHF) and other post-training methods.
+</Sidenote>
+You can now use nanochat models in transformers and tap into all those educational gains across the ecosystem. But along the way, we uncovered a further treasure trove of education about how canonical models relate to each other, and the components they share. We received the lesson from the simple teacher of class inheritance and transformers modular philosophy.
+<Sidenote>
+Learn more about how transformers achieves modularity in the [modular transformers guide](https://huggingface.co/docs/transformers/v4.48.0/modular_transformers).
+</Sidenote>
+Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.
+## What is `nanochat`?
+<Sidenote>
+See Karpathy's [original announcement](https://x.com/karpathy/status/1977755427569111362) and the [nanochat repository](https://github.com/karpathy/nanochat) on GitHub.
+</Sidenote>
+On October 13th 2025, Andrej Karpathy unceremoniously dropped the nanochat repo into the unsuspecting AI world. To hype seekers, this was just a small and pretty average LLM. To ML devotees, this was nirvana. A raw unadulterated chance to tinker, fiddle, and play with a transformer model defined in pure pytorch. Nothing was hidden away in fancy `torch` methods or inherited from complex class structures. It was all there in a simple file.
+![image1](./assets/image/tweet.png)
+<Sidenote>
+The core libraries Karpathy avoided: [transformers](https://huggingface.co/docs/transformers/index), [tokenizers](https://huggingface.co/docs/tokenizers/index), [datasets](https://huggingface.co/docs/datasets/index), [trl](https://huggingface.co/docs/trl/index), and many dependencies. All for the sake of our learning!
+</Sidenote>
+Karpathy had painstakingly implemented an end-to-end build of an LLM system without the use of most major libraries. Even though in real world situations most rely on transformers, tokenizers, datasets, trl, etc. This back to basics approach gives us the chance to genuinely learn and understand something from the ground up.
+Personally, I found the process to be one of the most educational I can remember.
+## What is `transformers` and how is it educational?
+<Sidenote>
+The [transformers documentation](https://huggingface.co/docs/transformers/index) covers everything from quickstart guides to advanced model internals.
+</Sidenote>
+Most of know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it's a powerful piece of education.
+If you don't know… transformers is the de facto implementation of modern AI models that bear the same name; 'transformers' like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other.
+<Sidenote>
+Explore the [model hub](https://huggingface.co/models) to see thousands of models built on these shared architectures.
+</Sidenote>
+In general, scientists at AI research labs design, implement, and train their models in their framework of choice, be that torch, JAX, etc. When they come to share their open model with the community, they will open a PR on transformers and refactor their code to use relevant modules.
+Because `transformers` contain most major model implementations, researchers have to inherent model architecture attributes from other canonical models. This is in every sense a 'single source of truth'.
+<Sidenote>
+See nanochat's [RMSNorm implementation](https://github.com/huggingface/transformers/blob/9f5b2d1b8995daa539b757e28c337e36408055e6/src/transformers/models/nanochat/modular_nanochat.py#L44) in the transformers codebase.
+</Sidenote>
+This practical feature of the library has an amazingly educational quality to it. We can read a model implementation as a series of references to other usages of those architectural features. For example, when one model uses a certain type of RMSNorm, we can plainly see that it is the same implementation as another model because it inherits that class entirely. For example, check out nanochat's RMSNorm:
+```py
+class NanoChatRMSNorm(Llama4TextL2Norm):
+    pass
+```
+The `transformers` library then converts the `modular_*` implementation into a `modeling_*` implementation, which contains the complete `torch` native implementation:
+```py
+class NanoChatRMSNorm(torch.nn.Module):
+    def __init__(self, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+    def _norm(self, x):
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+    def forward(self, x):
+        return self._norm(x.float()).type_as(x)
+    def extra_repr(self):
+        return f"eps={self.eps}"
+```
+If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.
+## Why do we need nanochat in `transformers`?
+It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like Qwen3, SmolLM3, Gemma3, or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:
+- `transformers` as a single source of truth teaches us about `nanochat`'s lineage.
+- we can use the `nanochat` model in other libraries.
+- save money by reusing nanochat checkpoints for fine-tuning.
+- compare nanochat fine-tuning implementation with other open model checkpoints.
+Firstly, as mentioned above `transformers` teaches us about the modeling conventions that Karpathy uses from other canonical implementations.
+Secondly, because transformers is a standard within the ecosystem, it unlocks more downstream learning in post training libraries, quantisation tools, inference libraries, and device integrations. In practical terms, here are some examples nanochat students could learn on top of `transformers`:
+<Sidenote>
+Learn about [model quantization](https://huggingface.co/docs/transformers/en/quantization/overview) to reduce model size and memory usage.
+</Sidenote>
+- Quantize models in llama.cpp ($0)
+- Integrate models into the browser and WebGPU ($0)
+- SFT training in TRL/torch on Google Colab ($0)
+- RL training TRL/torch on Google Colab ($0 \- $9)
+- Agentic RL in TRL on Google Colab ($0 \- $9)
+Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between $200 and $2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.
+<Sidenote>
+The [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh) script in nanochat benchmarks training costs across different configurations.
+</Sidenote>
+In short, let's unlock more opportunities for education\!
+## The nanochat architecture
+<Sidenote>
+The original [gpt.py](https://github.com/karpathy/nanochat/blob/master/nanochat/gpt.py) implementation is just 291 lines of pure PyTorch.
+</Sidenote>
+As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.
+The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128\. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 ($300 budget) or 30 ($1,000 budget).
+The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:
+#### Forward pass based on the Llama Architecture
+<Sidenote>
+See the [Llama model documentation](https://huggingface.co/docs/transformers/en/model_doc/llama) for the full architecture details.
+</Sidenote>
+The forward pass in nanochat handles both training and generation. We can simply read that the input `x` is embedded and then updated by each layer then the head. During training, a loss is calculated and returned instead of the logits themselves.
+```py
+def forward(self, x, targets=None, loss_reduction='mean'):
+    x = self.token_emb(x)
+    for layer in self.layers:
+        x = layer(x)
+    x = self.ln_f(x)
+    logits = self.lm_head(x)
+    if targets is not None:
+        loss = F.cross_entropy(
+            logits.view(-1, self.vocab_size),
+            targets.view(-1),
+            ignore_index=-1,
+            reduction=loss_reduction
+        )
+        return loss
+    return logits
+```
+By returning loss directly when targets are provided, the training loop becomes trivial. No separate loss computation, no manual masking logic—just `loss = model(inputs, targets)` followed by `loss.backward()`.
+<Sidenote>
+The [BaseModelOutputWithPast](https://huggingface.co/docs/transformers/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) class standardizes model outputs across the ecosystem.
+</Sidenote>
+`transformers` has to make things a bit more complex to facilitate the downstream ecosystem that uses logits in a broad spectrum of ways. Therefore, loss calculation is dealt with in training-specific code, and the `forward` function returns `BaseModelOutputWithPast`.
+```py
+class NanoChatModel(LlamaModel):
+    def __init__(self, config: NanoChatConfig):
+        super().__init__(config)
+        self.initial_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
+        self.norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache(config=self.config)
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position: torch.Tensor = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = create_causal_mask(
+            config=self.config,
+            input_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            past_key_values=past_key_values,
+            position_ids=position_ids,
+        )
+        hidden_states = inputs_embeds
+        position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)
+        hidden_states = self.initial_norm(hidden_states)  # Additional norm before the layers
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            hidden_states = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_embeddings=position_embeddings,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                cache_position=cache_position,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+        )
+```
+#### Rotary Position Embeddings (RoPE)
+<Sidenote>
+The [RoFormer paper](https://arxiv.org/abs/2104.09864) introduced RoPE, now used in Llama, Mistral, and many other modern LLMs.
+</Sidenote>
+Rotary Position Embeddings (RoPE) replace learned positional encodings by rotating query and key vectors using precomputed sin/cos frequencies:
+```py
+def apply_rope(x, cos, sin):
+    x1, x2 = x[..., ::2], x[..., 1::2]
+    y1 = x1 * cos - x2 * sin
+    y2 = x1 * sin + x2 * cos
+    return torch.stack([y1, y2], dim=-1).flatten(-2)
+```
+In transformers, the rotary embeddings are implemented like so:
+```py
+from ..llama.modeling_llama import (
+    LlamaDecoderLayer,
+    LlamaModel,
+    LlamaPreTrainedModel,
+    LlamaRotaryEmbedding,
+    apply_rotary_pos_emb,
+    eager_attention_forward,
+)
+class NanoChatRotaryEmbedding(LlamaRotaryEmbedding):
+    pass
+def rotate_half(x):
+    """Rotates half the hidden dims of the input with flipped signs for NanoChat."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((x2, -x1), dim=-1)
+```
+`NanoChatRotaryEmbedding` almost entirely inherits from the original Llama series, except for a sign inversion in `rotate_half`**.**
+### **QK Normalization**
+<Sidenote>
+QK normalization was popularized by [Llama 4](https://huggingface.co/docs/transformers/en/model_doc/llama4) and helps stabilize attention scores during training.
+</Sidenote>
+NanoChat applies RMSNorm to queries and keys before computing attention to stabilize training.
+In the original gpt.py, this is achieved via a functional norm helper applied directly inside the attention forward pass:
+```py
+def norm(x):
+    # Purely functional rmsnorm with no learnable params
+    return F.rms_norm(x, (x.size(-1),))
+class CausalSelfAttention(nn.Module):
+    ...
+    def forward(self, x, cos_sin, kv_cache):
+        B, T, C = x.size()
+        # Project the input to get queries, keys, and values
+        q = self.c_q(x).view(B, T, self.n_head, self.head_dim)
+        k = self.c_k(x).view(B, T, self.n_kv_head, self.head_dim)
+        v = self.c_v(x).view(B, T, self.n_kv_head, self.head_dim)
+        # Apply Rotary Embeddings to queries and keys to get relative positional encoding
+        cos, sin = cos_sin
+        q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin) # QK rotary embedding
+        q, k = norm(q), norm(k) # QK norm
+        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)
+	  ...
+```
+<Sidenote>
+[Qwen3](https://huggingface.co/docs/transformers/en/model_doc/qwen3) provides a robust attention implementation that nanochat extends with QK normalization.
+</Sidenote>
+In the modular transformers implementation, we see a fascinating mix of lineages. The `NanoChatRMSNorm` inherits directly from `Llama4TextL2Norm`, while the attention mechanism inherits from `Qwen3Attention`. We simply inject the QK normalization into the Qwen3 logic:
+```py
+class NanoChatRMSNorm(Llama4TextL2Norm):
+    pass
+class NanoChatAttention(Qwen3Attention):
+    def __init__(self, config: NanoChatConfig, layer_idx: int):
+        super().__init__(config, layer_idx)
+        del self.sliding_window
+        del self.layer_type
+        self.q_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
+        self.k_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        # RoPE -> Norm (instead of usual Norm -> RoPE)
+        query_states = self.q_norm(query_states)
+        key_states = self.k_norm(key_states)
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+```
+### **Untied Weights**
+<Sidenote>
+Weight tying between embeddings and the LM head is common but [research shows](https://arxiv.org/abs/1608.05859) untying can improve performance.
+</Sidenote>
+Karpathy's implementation deliberately unties the weights between the token embedding and the language model head to provide the model with more flexibility. In gpt.py, these are initialized as two completely separate modules:
+```py
+class GPT(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.transformer = nn.ModuleDict({
+            "wte": nn.Embedding(config.vocab_size, config.n_embd),
+            "h": nn.ModuleList([Block(config, layer_idx) for layer_idx in range(config.n_layer)]),
+        })
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        # ... (rest of init)
+```
+<Sidenote>
+[Gemma 2](https://huggingface.co/docs/transformers/en/model_doc/gemma2) supports both tied and untied weight configurations via the model config.
+</Sidenote>
+In the modular implementation, we inherit from `Gemma2ForCausalLM`. This is a powerful simplification—Gemma 2 also supports untied weights and advanced output structures. By simply inheriting the class, we pull in all the necessary machinery for causal generation, while the configuration object (defined elsewhere) ensures the weights remain untied:
+```py
+class NanoChatForCausalLM(Gemma2ForCausalLM):
+    def forward(self, **super_kwargs) -> CausalLMOutputWithPast:
+        super().forward(**super_kwargs)
+```
+### **ReLU² Activation**
+<Sidenote>
+The [Primer paper](https://arxiv.org/abs/2109.08668) showed squared ReLU can match or exceed GELU performance with lower compute.
+</Sidenote>
+The original implementation replaces the standard GELU activation with ReLU², which is simply ReLU squared. This provides a faster alternative without performance loss. In gpt.py, this is hardcoded into the MLP block:
+```py
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
+        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
+    def forward(self, x):
+        x = self.c_fc(x)
+        x = F.relu(x).square()
+        x = self.c_proj(x)
+        return x
+```
+<Sidenote>
+[CLIP](https://huggingface.co/docs/transformers/en/model_doc/clip) provides a clean MLP structure that nanochat extends with the ReLU² activation.
+</Sidenote>
+In the modular file, we see another surprising inheritance: `CLIPMLP`. The CLIP architecture uses a structure that fits our needs perfectly, so we inherit the structural definition from CLIP and let the configuration drive the specific activation function (ReLU2):
+```py
+class NanoChatMLP(CLIPMLP):
+    def __init__(self, config):
+        super().__init__(config)
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
+```
+###  **Multi-Query Attention (MQA)**
+<Sidenote>
+The [GQA paper](https://arxiv.org/abs/2305.13245) explains how grouped-query attention reduces memory while maintaining quality.
+</Sidenote>
+NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using 10 query heads but only 4 key/value heads (in the default config).
+<Sidenote>
+PyTorch's [scaled_dot_product_attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) handles GQA broadcasting automatically via `enable_gqa`.
+</Sidenote>
+In gpt.py, this logic is handled by passing distinct head counts and relying on PyTorch's functional attention to handle the broadcasting (or explicitly handling it during inference):
+```py
+class CausalSelfAttention(nn.Module):
+    # ...
+    def forward(self, x, cos_sin, kv_cache):
+        # ...
+        # Attention: queries attend to keys/values autoregressively. A few cases to handle:
+        enable_gqa = self.n_head != self.n_kv_head # Group Query Attention (GQA): duplicate key/value heads to match query heads if desired
+        if kv_cache is None or Tq == Tk:
+            # During training (no KV cache), attend as usual with causal attention
+            # And even if there is KV cache, we can still use this simple version when Tq == Tk
+            y = F.scaled_dot_product_attention(q, k, v, is_causal=True, enable_gqa=enable_gqa)
+        elif Tq == 1:
+            # During inference but with a single query in this forward pass:
+            # The query has to attend to all the keys/values in the cache
+            y = F.scaled_dot_product_attention(q, k, v, is_causal=False, enable_gqa=enable_gqa)
+        else:
+            # During inference AND we have a chunk of queries in this forward pass:
+            # First, each query attends to all the cached keys/values (i.e. full prefix)
+            attn_mask = torch.zeros((Tq, Tk), dtype=torch.bool, device=q.device) # True = keep, False = mask
+            prefix_len = Tk - Tq
+            if prefix_len > 0: # can't be negative but could be zero
+                attn_mask[:, :prefix_len] = True
+            # Then, causal attention within this chunk
+            attn_mask[:, prefix_len:] = torch.tril(torch.ones((Tq, Tq), dtype=torch.bool, device=q.device))
+            y = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, enable_gqa=enable_gqa)
+        # ...
+```
+In `modular_nanochat.py`, we don't need to write this logic at all. As seen in the QK Normalization section above, `NanoChatAttention` inherits from `Qwen3Attention`. The Qwen3 implementation is robust and fully supports GQA/MQA out of the box. By using this parent class, we get production-grade attention implementation "for free," allowing us to focus solely on the unique normalizations required by NanoChat.
+## Conclusion
+It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.
+# Hands-on Tutorial
+Ok. Let's cut the philosphy and see what we can do with `nanochat` in transformers.
+<Inference />
+<SFT />

app/src/content/assets/image/nanochat-banner.png ADDED Viewed

Git LFS Details

SHA256: 44ed7910d47ba0aac7b6d0929f9ef19661aaf71fe08906efa415189f52c177dc
Pointer size: 130 Bytes
Size of remote file: 79.4 kB

app/src/content/assets/image/tweet.png ADDED Viewed

Git LFS Details

SHA256: 962d1d1a7a45ab5d13c445a9074128d9e1b04e5b679bf30e0de9fbfd2ce304a5
Pointer size: 130 Bytes
Size of remote file: 58.3 kB

app/src/content/chapters/grpo.mdx ADDED Viewed

	@@ -0,0 +1,406 @@

+# [BONUS 3] Group Relative Policy Optimization in `torch`
+- [GRPO](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/grpo.ipynb)
+This chapter demonstrates Group Relative Policy Optimization (GRPO) training for the NanoChat model—a reinforcement learning approach for improving model responses based on reward signals.
+## Import model and tokenizer
+```python
+import torch
+from torch.utils.data import DataLoader
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup
+model_id = "karpathy/nanochat-d32"
+revision = "refs/pr/1"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
+    torch_dtype=torch.bfloat16 if device.type == "cuda" else torch.float32,
+).to(device)
+tokenizer.pad_token = tokenizer.eos_token
+model.config.pad_token_id = tokenizer.pad_token_id
+```
+## Setup LoRA
+```python
+from peft import LoraConfig, get_peft_model
+lora_config = LoraConfig(
+    r=1,
+    lora_alpha=2,
+    lora_dropout=0.00,
+    task_type="CAUSAL_LM",
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "fc1", "fc2"]
+)
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+```
+```
+trainable params: 1,179,648 || all params: 1,880,227,840 || trainable%: 0.0627
+```
+## Demo the model
+Test with a plain autoregressive prompt:
+```python
+print("=" * 80)
+print("TEST 1: Plain Autoregressive Prompt")
+print("=" * 80)
+prompt = "The Eiffel Tower stands in Paris and"
+test_inputs = tokenizer(prompt, return_tensors="pt").to(device)
+with torch.no_grad():
+    test_outputs = model.generate(
+        **test_inputs,
+        max_new_tokens=64,
+        do_sample=False,
+        pad_token_id=tokenizer.pad_token_id,
+    )
+generated_tokens = test_outputs[0, test_inputs["input_ids"].shape[1] :]
+print(f"Prompt: {prompt}")
+print(f"\nGenerated: {tokenizer.decode(generated_tokens, skip_special_tokens=True)}")
+print("=" * 80)
+```
+```
+================================================================================
+TEST 1: Plain Autoregressive Prompt
+================================================================================
+Prompt: The Eiffel Tower stands in Paris and
+Generated:  is one of the most famous landmarks in the world. It is located on the Champ de Mars in the heart of the city. The tower was built for the 1889 World's Fair. It was designed by the French engineer Gustave Eiffel and took 2 years to build. The Eiffel Tower stands 324 meters
+================================================================================
+```
+And with the chat template:
+```python
+print("=" * 80)
+print("TEST 2: Chat Template")
+print("="*80)
+conversation = [
+    {"role": "user", "content": "What is the capital of France?"},
+]
+inputs = tokenizer.apply_chat_template(
+    conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
+).to(device)
+print(f"Formatted prompt: {tokenizer.decode(inputs['input_ids'][0])}")
+print(f"Input IDs: {inputs['input_ids'][0].tolist()}")
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=64,
+        do_sample=False
+    )
+generated_tokens = outputs[0, inputs["input_ids"].shape[1] :]
+print(f"\nGenerated: {tokenizer.decode(generated_tokens)}")
+print("=" * 80)
+```
+```
+================================================================================
+TEST 2: Chat Template
+================================================================================
+Formatted prompt: <|bos|><|user_start|>What is the capital of France?<|user_end|><|assistant_start|>
+Input IDs: [65527, 65528, 1442, 309, 261, 3429, 281, 4215, 63, 65529, 65530]
+Generated: The capital of France is Paris.<|assistant_end|>
+================================================================================
+```
+## Dataset
+We use the OpenR1-Math dataset for math reasoning tasks:
+```python
+raw_dataset = load_dataset("HuggingFaceH4/OpenR1-Math-220k-default-verified", split="train")
+splits = raw_dataset.train_test_split(test_size=0.1, seed=42)
+train_dataset = splits["train"]
+eval_dataset = splits["test"]
+```
+## Training Configuration
+```python
+max_train_steps = 50
+prompt_batch_size = 1
+num_generations = 4
+max_new_tokens = 128
+temperature = 1.0
+top_k = 50
+learning_rate = 5e-6
+weight_decay = 0.0
+epsilon = 0.2
+gradient_accumulation_steps = 1
+warmup_ratio = 0.1
+logging_frequency = 5
+max_train_samples = 1000
+max_eval_samples = 100
+```
+## Reward Functions
+GRPO requires reward functions to guide the policy optimization. We define several:
+```python
+import re
+import numpy as np
+import torch.nn.functional as F
+from contextlib import nullcontext
+def think_format_reward(completions):
+    """
+    Reward function that checks if the reasoning process is enclosed within <think> and </think> tags.
+    Returns 1.0 if the format is correct, otherwise 0.0.
+    """
+    pattern = r"^(?!.*<think>)(.*?)</think>.*$"
+    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completions]
+    return [1.0 if match else 0.0 for match in matches]
+def accuracy_reward(completions, solutions):
+    """
+    Reward function that checks if the completion matches the solution.
+    For simplicity, we'll do basic string matching here.
+    """
+    rewards = []
+    for completion, solution in zip(completions, solutions):
+        # Simple string matching (normalized)
+        reward = 1.0 if solution.strip().lower() in completion.strip().lower() else 0.0
+        rewards.append(reward)
+    return rewards
+def min_length_reward(completions, min_length=10):
+    """
+    Reward function that checks if the completion is at least a certain length.
+    Returns 1.0 if the length is greater than or equal to the minimum length, otherwise 0.0.
+    """
+    return [1.0 if len(completion) >= min_length else 0.0 for completion in completions]
+def combined_reward(completions, solutions):
+    """
+    Combines format and accuracy rewards with equal weight.
+    """
+    format_rewards = think_format_reward(completions)
+    accuracy_rewards = accuracy_reward(completions, solutions)
+    min_length_rewards = min_length_reward(completions)
+    return [np.mean([f, a, m]) for f, a, m in zip(format_rewards, accuracy_rewards, min_length_rewards)]
+```
+## Helper Functions
+```python
+def per_token_log_probs(logits, labels):
+    logits = logits.float()
+    log_probs = F.log_softmax(logits, dim=-1)
+    return log_probs.gather(dim=-1, index=labels.unsqueeze(-1)).squeeze(-1)
+def prepare_prompt(example, problem_key="problem", solution_key="solution"):
+    # Extract the messages (should be a list of dicts with 'role' and 'content')
+    prompt = example.get(problem_key, "")
+    messages = [{"role": "user", "content": prompt}]
+    formatted = tokenizer.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        truncation=True,
+        max_length=2048,
+        padding=False,
+        return_dict=True,
+        return_tensors="pt",
+    )
+    return formatted["input_ids"], formatted["attention_mask"]
+if device.type == "cuda":
+    autocast_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16)
+else:
+    autocast_ctx = nullcontext()
+```
+## Optimizer and Scheduler
+```python
+optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
+total_update_steps = max_train_steps // gradient_accumulation_steps
+warmup_steps = max(1, int(total_update_steps * warmup_ratio))
+scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_update_steps)
+```
+## The Training Loop
+The GRPO training loop generates multiple completions per prompt, computes rewards, and updates the policy using a clipped objective similar to PPO:
+```python
+# Sample dataset if needed
+if max_train_samples is not None and len(train_dataset) > max_train_samples:
+    train_dataset = train_dataset.select(range(max_train_samples))
+if max_eval_samples is not None and len(eval_dataset) > max_eval_samples:
+    eval_dataset = eval_dataset.select(range(max_eval_samples))
+model.train()
+train_index = 0
+global_step = 0
+running_reward = 0.0
+running_loss = 0.0
+for step in range(1, max_train_steps + 1):
+    example = train_dataset[train_index % len(train_dataset)]
+    train_index += 1
+    prompt_ids, prompt_mask = prepare_prompt(example)
+    prompt_ids = prompt_ids.to(device)
+    prompt_mask = prompt_mask.to(device)
+    prompt_length = prompt_ids.shape[1]
+    prompt_repeat = prompt_ids.repeat(num_generations, 1)
+    mask_repeat = prompt_mask.repeat(num_generations, 1)
+    # Generate completions
+    model.eval()
+    with torch.no_grad():
+        generated = model.generate(
+            input_ids=prompt_repeat,
+            attention_mask=mask_repeat,
+            max_new_tokens=max_new_tokens,
+            do_sample=True,
+            temperature=temperature,
+            top_k=top_k,
+            pad_token_id=tokenizer.pad_token_id,
+        )
+    model.train()
+    sequences = generated
+    attention_mask = (sequences != tokenizer.pad_token_id).long()
+    completion_mask = attention_mask.clone()
+    completion_mask[:, :prompt_length] = 0
+    completion_tokens = sequences[:, prompt_length:]
+    completion_texts = tokenizer.batch_decode(completion_tokens, skip_special_tokens=True)
+    # Get solution
+    solution = example.get("solution", example.get("answer", ""))
+    solutions = [solution] * num_generations
+    # Compute rewards
+    rewards = combined_reward(completion_texts, solutions)
+    rewards = torch.tensor(rewards, dtype=torch.float32, device=device)
+    running_reward += rewards.mean().item()
+    rewards_view = rewards.view(prompt_batch_size, num_generations)
+    mean_rewards = rewards_view.mean(dim=1, keepdim=True)
+    std_rewards = rewards_view.std(dim=1, keepdim=True)
+    std_rewards = torch.where(std_rewards > 0, std_rewards, torch.ones_like(std_rewards))
+    advantages = ((rewards_view - mean_rewards) / std_rewards).view(-1)
+    labels = sequences[:, 1:].clone()
+    labels[attention_mask[:, 1:] == 0] = tokenizer.pad_token_id
+    # Compute old log probs
+    with torch.no_grad():
+        with (autocast_ctx if device.type == "cuda" else nullcontext()):
+            old_outputs = model(
+                input_ids=sequences,
+                attention_mask=attention_mask,
+                use_cache=False,
+            )
+        old_log_probs = per_token_log_probs(old_outputs.logits[:, :-1], labels)
+    valid_mask = (completion_mask[:, 1:] == 1) & (labels != tokenizer.pad_token_id)
+    # Compute loss
+    optimizer.zero_grad(set_to_none=True)
+    with (autocast_ctx if device.type == "cuda" else nullcontext()):
+        outputs = model(
+            input_ids=sequences,
+            attention_mask=attention_mask,
+            use_cache=False,
+        )
+        log_probs = per_token_log_probs(outputs.logits[:, :-1], labels)
+    ratio = (log_probs - old_log_probs).exp()
+    ratio = torch.where(valid_mask, ratio, torch.ones_like(ratio))
+    clipped_ratio = ratio.clamp(1.0 - epsilon, 1.0 + epsilon)
+    adv = advantages.unsqueeze(1)
+    loss_unclipped = ratio * adv
+    loss_clipped = clipped_ratio * adv
+    per_token_loss = -torch.min(loss_unclipped, loss_clipped)
+    per_token_loss = torch.where(valid_mask, per_token_loss, torch.zeros_like(per_token_loss))
+    denom = valid_mask.sum().clamp(min=1)
+    loss = per_token_loss.sum() / denom
+    loss.backward()
+    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+    optimizer.step()
+    scheduler.step()
+    global_step += 1
+    running_loss += loss.item()
+    if step % logging_frequency == 0:
+        avg_reward = running_reward / logging_frequency
+        avg_loss = running_loss / logging_frequency
+        current_lr = scheduler.get_last_lr()[0]
+        print(
+            f"step={step:04d} | loss={avg_loss:.4f} | avg_reward={avg_reward:.4f} | lr={current_lr:.2e}"
+        )
+        running_reward = 0.0
+        running_loss = 0.0
+        # Sample evaluation
+        model.eval()
+        eval_example = eval_dataset[0]
+        prompt_ids, prompt_mask = prepare_prompt(eval_example)
+        with torch.no_grad():
+            eval_sequences = model.generate(
+                input_ids=prompt_ids.to(device),
+                attention_mask=prompt_mask.to(device),
+                max_new_tokens=max_new_tokens,
+                do_sample=True,
+                top_k=top_k,
+                temperature=temperature,
+                pad_token_id=tokenizer.pad_token_id,
+            )
+        model.train()
+        completion = eval_sequences[0, prompt_ids.shape[1] :]
+        print("Sample eval completion:", tokenizer.decode(completion, skip_special_tokens=True)[:100])
+print("Training complete.")
+```
+```
+step=0005 | loss=0.0000 | avg_reward=0.4000 | lr=0.00e+00
+Sample eval completion: 3^4 - 11 and 3^6 - 17
+step=0010 | loss=0.0000 | avg_reward=0.3333 | lr=0.00e+00
+Sample eval completion: 11.
+This statement refers to an optimization problem where we seek to find the smallest prime \( p
+step=0015 | loss=0.0000 | avg_reward=0.4667 | lr=0.00e+00
+Sample eval completion: What number has two prime factors, 1 and itself, without additional restrictions? One possible combi
+step=0020 | loss=-0.0983 | avg_reward=0.4500 | lr=0.00e+00
+...
+Training complete.
+```

app/src/content/chapters/inference.mdx ADDED Viewed

	@@ -0,0 +1,97 @@

+## Inference on `nano` in Transformers
+First bonus tutorial will help you to do basic inference in `transformers`:
+```py
+import torch
+from transformers import AutoTokenizer, NanoChatForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("nanochat-students/nanochat-d20")
+model = NanoChatForCausalLM.from_pretrained("nanochat-students/nanochat-d20")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+prompt = "Hello, how are you?"
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
+inputs.pop("token_type_ids", None)
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### Inference in `transformers` with `vLLM`
+Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
+We'll need to install `vLLM` from main:
+```sh
+pip install git+https://github.com/huggingface/transformers.git@main
+```
+Then we can start a `vLLM` server like so:
+```
+vllm serve nanochat-students/nanochat-d20 --enforce-eager --revision refs/pr/1
+```
+Finally, we can call the server like so:
+```sh
+curl -X POST "http://localhost:8000/v1/completions" \
+	-H "Content-Type: application/json" \
+	--data '{
+		"model": "nanochat-students/nanochat-d20",
+		"prompt": "Once upon a time,",
+		"max_tokens": 512,
+		"temperature": 0.5
+	}'
+```
+### Inference on your trained `nanochat` weights
+Let's say you've followed the nanochat repo and used it to train a model. The you can add transformer compatibility to your model and use it in other libraries.
+1. download any `nanochat` checkpoint from the hub. Here we use Karpathy's but this could be yours:
+```
+hf download karpathy/nanochat-d34 --local-dir nanochat-d34
+```
+2. convert the checkpoint to transformers format using the conversion scripts:
+```
+uv run \
+--with "transformers @ git+https://github.com/huggingface/transformers.git@main" \
+--with "tiktoken>=0.12.0" \
+https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/nanochat/convert_nanochat_checkpoints.py \
+--input_dir ./nanochat-d34 \
+--output_dir ./nanochat-d3-hf
+```
+3. (optional) Upload the checkpoint to the Hugging Face Hub
+```
+hf upload <username>/nanochat-d34 nanochat-d34
+```
+4. As above, you can generate with your model in `transformers`.
+```py
+import torch
+from transformers import AutoTokenizer, NanoChatForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("./nanochat-d3-hf")
+model = NanoChatForCausalLM.from_pretrained("./nanochat-d3-hf")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+prompt = "Hello, how are you?"
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
+inputs.pop("token_type_ids", None)
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```

app/src/content/chapters/sft.mdx ADDED Viewed

	@@ -0,0 +1,400 @@

+import Sidenote from '../../components/Sidenote.astro'
+import Note from '../../components/Note.astro'
+# [BONUS 2] Supervised Fine-tuning in `torch`
+<Sidenote>
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/datasets/nanochat-students/notebooks/blob/main/sft.ipynb)
+</Sidenote>
+Supervised Fine-Tuning (SFT) is the process of adapting a pre-trained language model to follow instructions by training it on curated input-output pairs. Unlike pre-training which learns general language patterns from massive text corpora, SFT teaches the model *how* to respond—following a specific format, tone, or task structure.
+In this tutorial, we'll fine-tune the NanoChat model using pure PyTorch, giving you complete visibility into every step of the training process.
+<Note>
+**Want a production-ready solution?** TRL is Hugging Face's reinforcement learning library with battle-tested SFT implementations. Check out the [SFT notebook](https://github.com/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb) to use it with your nanochat checkpoint.
+</Note>
+## Import model and tokenizer
+<Sidenote>
+Learn more about [AutoModelForCausalLM](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) and the [from_pretrained](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained) method.
+</Sidenote>
+We start by loading the pre-trained NanoChat model and its tokenizer. The `revision` parameter points to a specific model version—useful when models are updated frequently or you want reproducible results.
+```python
+import torch
+from torch.utils.data import DataLoader
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup
+model_id = "karpathy/nanochat-d32"
+revision = "refs/pr/1"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
+    torch_dtype=torch.bfloat16 if device.type == "cuda" else torch.float32,
+).to(device)
+```
+We use `bfloat16` precision on GPU to reduce memory usage while maintaining training stability. On CPU, we fall back to `float32` for compatibility.
+## Setup LoRA
+<Sidenote>
+Read the [LoRA paper](https://arxiv.org/abs/2106.09685) or explore [PEFT documentation](https://huggingface.co/docs/peft) for a deeper understanding of low-rank adaptation.
+</Sidenote>
+Training all 1.8B parameters would require significant GPU memory and risk catastrophic forgetting. Instead, we use **LoRA (Low-Rank Adaptation)** which freezes the original weights and injects small trainable matrices into specific layers.
+The key parameters:
+- **`r=1`**: The rank of the low-rank matrices. Lower = fewer parameters, but potentially less expressiveness
+- **`lora_alpha=2`**: Scaling factor for LoRA updates (typically `2 * r`)
+- **`target_modules`**: Which layers to adapt—we target all attention projections and the MLP
+```python
+from peft import LoraConfig, get_peft_model
+lora_config = LoraConfig(
+    r=1,
+    lora_alpha=2,
+    lora_dropout=0.00,
+    task_type="CAUSAL_LM",
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "fc1", "fc2"]
+)
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+```
+```
+trainable params: 1,179,648 || all params: 1,880,227,840 || trainable%: 0.0627
+```
+With LoRA, we're only training **0.06%** of the model's parameters—just over 1 million weights instead of 1.8 billion. This makes fine-tuning feasible on consumer hardware.
+## Demo the model
+<Sidenote>
+The [generate](https://huggingface.co/docs/transformers/main_classes/text_generation) method supports many decoding strategies: greedy, beam search, sampling, and more.
+</Sidenote>
+Before training, let's verify the model works correctly. We'll test two modes: raw text completion and chat-formatted generation.
+**Plain autoregressive completion** continues text naturally:
+```python
+print("=" * 80)
+print("TEST 1: Plain Autoregressive Prompt")
+print("=" * 80)
+prompt = "The Eiffel Tower stands in Paris and"
+test_inputs = tokenizer(prompt, return_tensors="pt").to(device)
+with torch.no_grad():
+    test_outputs = model.generate(
+        **test_inputs,
+        max_new_tokens=64,
+        do_sample=False,
+        pad_token_id=tokenizer.pad_token_id,
+    )
+generated_tokens = test_outputs[0, test_inputs["input_ids"].shape[1] :]
+print(f"Prompt: {prompt}")
+print(f"\nGenerated: {tokenizer.decode(generated_tokens, skip_special_tokens=True)}")
+print("=" * 80)
+```
+```
+================================================================================
+TEST 1: Plain Autoregressive Prompt
+================================================================================
+Prompt: The Eiffel Tower stands in Paris and
+Generated:  is one of the most famous landmarks in the world. It is located on the Champ de Mars in the heart of the city. The tower was built for the 1889 World's Fair. It was designed by the French engineer Gustave Eiffel and took 2 years to build. The Eiffel Tower stands 324 meters
+================================================================================
+```
+<Sidenote>
+Chat templates ensure consistent formatting. See [chat templating guide](https://huggingface.co/docs/transformers/en/chat_templating) for details on how different models structure conversations.
+</Sidenote>
+The chat template wraps the input in special tokens that the model learned during instruction tuning:
+```python
+print("=" * 80)
+print("TEST 2: Chat Template")
+print("="*80)
+conversation = [
+    {"role": "user", "content": "What is the capital of France?"},
+]
+inputs = tokenizer.apply_chat_template(
+    conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
+).to(device)
+print(f"Formatted prompt: {tokenizer.decode(inputs['input_ids'][0])}")
+print(f"Input IDs: {inputs['input_ids'][0].tolist()}")
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=64,
+        do_sample=False
+    )
+generated_tokens = outputs[0, inputs["input_ids"].shape[1] :]
+print(f"\nGenerated: {tokenizer.decode(generated_tokens)}")
+print("=" * 80)
+```
+```
+================================================================================
+TEST 2: Chat Template
+================================================================================
+Formatted prompt: <|bos|><|user_start|>What is the capital of France?<|user_end|><|assistant_start|>
+Input IDs: [65527, 65528, 1442, 309, 261, 3429, 281, 4215, 63, 65529, 65530]
+Generated: The capital of France is Paris.<|assistant_end|>
+================================================================================
+```
+Notice the special tokens: `<|bos|>`, `<|user_start|>`, `<|assistant_start|>`, etc. These delimiters help the model understand conversation structure.
+## Dataset
+<Sidenote>
+Explore the [OpenThoughts dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2) on the Hub. It contains instruction-following examples with chain-of-thought reasoning.
+</Sidenote>
+For SFT, we need high-quality instruction-response pairs. We'll use OpenThoughts, a dataset designed for training models to reason step-by-step before answering.
+```python
+raw_dataset = load_dataset("HuggingFaceTB/smoltalk2", "SFT", split="OpenThoughts3_1.2M_think")
+splits = raw_dataset.train_test_split(test_size=0.1, seed=42)
+train_dataset = splits["train"]
+eval_dataset = splits["test"]
+```
+### Process the Dataset
+<Sidenote>
+The [datasets map](https://huggingface.co/docs/datasets/process#map) function applies transformations efficiently with caching and multiprocessing support.
+</Sidenote>
+Raw examples contain message lists that need to be converted into token sequences. The `apply_chat_template` method handles this conversion, inserting the appropriate special tokens.
+We limit examples to 2048 tokens and cap the dataset size to make training tractable on limited hardware:
+```python
+max_length = 2048
+max_train_examples = 20000
+max_eval_examples = 1000
+def format_example(example):
+    formatted = tokenizer.apply_chat_template(
+        example["messages"],
+        add_generation_prompt=False,
+        truncation=True,
+        max_length=max_length,
+        padding=False,
+        return_dict=True,
+        return_tensors="pt",
+    )
+    return {
+        "input_ids": formatted["input_ids"][0].tolist(),
+        "attention_mask": formatted["attention_mask"][0].tolist(),
+    }
+train_dataset = train_dataset.select(range(min(len(train_dataset), max_train_examples)))
+train_dataset = train_dataset.map(format_example, remove_columns=train_dataset.column_names)
+eval_dataset = eval_dataset.select(range(min(len(eval_dataset), max_eval_examples)))
+eval_dataset = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)
+```
+## Training Configuration
+These hyperparameters control the training dynamics. We use conservative values that work well across different hardware:
+```python
+train_batch_size = 2
+eval_batch_size = 2
+num_epochs = 1
+gradient_accumulation_steps = 4
+learning_rate = 1e-5
+weight_decay = 0.0
+warmup_ratio = 0.03
+logging_frequency = 10
+```
+<Sidenote>
+**Gradient accumulation** simulates larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Effective batch size = `train_batch_size × gradient_accumulation_steps` = 8.
+</Sidenote>
+Key configuration choices include using a low learning rate (`1e-5`), as LoRA generally requires smaller learning rates given that the base model weights are kept frozen. Additionally, gradient accumulation is employed to enable larger effective batch sizes, which helps when training on GPUs with limited memory.
+## Create a DataLoader
+<Sidenote>
+PyTorch's [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) handles batching, shuffling, and parallel data loading automatically.
+</Sidenote>
+The collate pads variable-length sequences to the same length within each batch and creates the labels tensor for loss computation:
+```python
+def collate_fn(batch):
+    batch_dict = {
+        "input_ids": [record["input_ids"] for record in batch],
+        "attention_mask": [record["attention_mask"] for record in batch],
+    }
+    padded = tokenizer.pad(batch_dict, padding=True, return_tensors="pt")
+    labels = padded["input_ids"].clone()
+    labels[padded["attention_mask"] == 0] = -100
+    padded["labels"] = labels
+    return padded
+train_loader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True, collate_fn=collate_fn)
+eval_loader = DataLoader(eval_dataset, batch_size=eval_batch_size, shuffle=False, collate_fn=collate_fn)
+```
+Setting padding tokens to `-100` in labels tells PyTorch's cross-entropy loss to ignore them—we don't want to penalize the model for not predicting padding.
+## Optimizer
+<Sidenote>
+[AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) decouples weight decay from the gradient update, improving regularization behavior compared to L2 regularization in standard Adam.
+</Sidenote>
+AdamW is the standard optimizer for transformer fine-tuning. It combines Adam's adaptive learning rates with proper weight decay:
+```python
+optimizer = torch.optim.AdamW(
+    model.parameters(),
+    lr=learning_rate,
+    weight_decay=weight_decay,
+)
+```
+## Learning Rate Scheduler
+<Sidenote>
+Warmup prevents early instability when the model hasn't yet adapted to the new task. See [this explanation](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.get_linear_schedule_with_warmup) for more details.
+</Sidenote>
+A linear schedule with warmup gradually increases the learning rate at the start of training (warmup), then linearly decreases it to zero. This helps stabilize early training and improves final performance:
+```python
+num_update_steps_per_epoch = max(len(TrainLoader) // gradient_accumulation_steps, 1)
+max_train_steps = num_epochs * num_update_steps_per_epoch
+warmup_steps = max(1, int(max_train_steps * warmup_ratio))
+scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, max_train_steps)
+```
+## The Training Loop
+<Sidenote>
+For distributed training across multiple GPUs, consider [Accelerate](https://huggingface.co/docs/accelerate/index) which wraps this loop with minimal code changes.
+</Sidenote>
+Now we bring everything together. The training loop follows the standard PyTorch pattern with gradient accumulation:
+1. **Forward pass**: Compute loss on a mini-batch
+2. **Backward pass**: Accumulate gradients
+3. **Optimizer step**: Update weights (every `gradient_accumulation_steps` batches)
+4. **Logging**: Track loss and learning rate
+5. **Evaluation**: Measure validation loss after each epoch
+```python
+model.train()
+global_step = 0
+running_loss = 0.0
+running_steps = 0
+for epoch in range(num_epochs):
+    print(f"Epoch {epoch + 1}/{num_epochs}")
+    optimizer.zero_grad(set_to_none=True)
+    for step, batch in enumerate(TrainLoader, start=1):
+        batch = {key: value.to(device) for key, value in batch.items()}
+        outputs = model(**batch)
+        loss = outputs.loss / gradient_accumulation_steps
+        loss.backward()
+        running_loss += outputs.loss.float().item()
+        running_steps += 1
+        if step % gradient_accumulation_steps == 0 or step == len(TrainLoader):
+            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+            optimizer.step()
+            scheduler.step()
+            optimizer.zero_grad(set_to_none=True)
+            global_step += 1
+            if global_step % logging_frequency == 0:
+                current_lr = scheduler.get_last_lr()[0]
+                mean_loss = running_loss / running_steps
+                print(f"step={global_step:05d} | loss={mean_loss:.4f} | lr={current_lr:.2e}")
+                running_loss = 0.0
+                running_steps = 0
+    train_loss = running_loss / running_steps if running_steps > 0 else float("nan")
+    print(f"Training loss after epoch {epoch + 1}: {train_loss:.4f}")
+    model.eval()
+    losses = []
+    with torch.no_grad():
+        for _, batch in enumerate(EvalLoader, start=1):
+            batch = {key: value.to(device) for key, value in batch.items()}
+            loss = model(**batch).loss
+            losses.append(loss.float().item())
+    model.train()
+    val_loss = sum(losses) / len(losses) if losses else float("nan")
+    print(f"Validation loss after epoch {epoch + 1}: {val_loss:.4f}")
+print("Training complete.")
+```
+```
+Epoch 1/1
+step=00010 | loss=1.7586 | lr=1.33e-06
+step=00020 | loss=1.8188 | lr=2.67e-06
+step=00030 | loss=1.8235 | lr=4.00e-06
+step=00040 | loss=1.7935 | lr=5.33e-06
+step=00050 | loss=1.8029 | lr=6.67e-06
+...
+```

grpo.ipynb ADDED Viewed

	@@ -0,0 +1,654 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "5a611684",
+      "metadata": {
+        "id": "5a611684"
+      },
+      "source": [
+        "# NanoChat Easy - GRPO Training\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "80df0403",
+      "metadata": {
+        "id": "80df0403"
+      },
+      "source": [
+        "## Import model and tokenizer\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1dd76bde",
+      "metadata": {
+        "id": "1dd76bde",
+        "outputId": "b786d7ad-5aa8-4a13-eb1f-54a65aaf44ba"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+            "  from .autonotebook import tqdm as notebook_tqdm\n",
+            "`torch_dtype` is deprecated! Use `dtype` instead!\n"
+          ]
+        }
+      ],
+      "source": [
+        "import torch\n",
+        "from torch.utils.data import DataLoader\n",
+        "from datasets import load_dataset\n",
+        "from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup\n",
+        "\n",
+        "\n",
+        "model_id = \"karpathy/nanochat-d32\"\n",
+        "revision = \"refs/pr/1\"\n",
+        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+        "\n",
+        "\n",
+        "tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)\n",
+        "model = AutoModelForCausalLM.from_pretrained(\n",
+        "    model_id,\n",
+        "    revision=revision,\n",
+        "    torch_dtype=torch.bfloat16 if device.type == \"cuda\" else torch.float32,\n",
+        ").to(device)\n",
+        "tokenizer.pad_token = tokenizer.eos_token\n",
+        "model.config.pad_token_id = tokenizer.pad_token_id"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "6eb979a9",
+      "metadata": {
+        "id": "6eb979a9"
+      },
+      "source": [
+        "## Setup LoRA\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1973b450",
+      "metadata": {
+        "id": "1973b450",
+        "outputId": "354ceafb-b4cb-4423-f076-7800024171b7"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "trainable params: 1,179,648 || all params: 1,880,227,840 || trainable%: 0.0627\n"
+          ]
+        }
+      ],
+      "source": [
+        "from peft import LoraConfig, get_peft_model\n",
+        "\n",
+        "lora_config = LoraConfig(\n",
+        "    r=1,\n",
+        "    lora_alpha=2,\n",
+        "    lora_dropout=0.00,\n",
+        "    task_type=\"CAUSAL_LM\",\n",
+        "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"fc1\", \"fc2\"]\n",
+        ")\n",
+        "\n",
+        "model = get_peft_model(model, lora_config)\n",
+        "model.print_trainable_parameters()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "3f3533dd",
+      "metadata": {
+        "id": "3f3533dd"
+      },
+      "source": [
+        "## Demo the model\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "0f930711",
+      "metadata": {
+        "id": "0f930711",
+        "outputId": "f263ab12-9b2c-4ea3-da1c-4465032538d2"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "================================================================================\n",
+            "TEST 1: Plain Autoregressive Prompt\n",
+            "================================================================================\n",
+            "Prompt: The Eiffel Tower stands in Paris and\n",
+            "\n",
+            "Generated:  is one of the most famous landmarks in the world. It is located on the Champ de Mars in the heart of the city. The tower was built for the 1889 World's Fair. It was designed by the French engineer Gustave Eiffel and took 2 years to build. The Eiffel Tower stands 324 meters\n",
+            "================================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"=\" * 80)\n",
+        "print(\"TEST 1: Plain Autoregressive Prompt\")\n",
+        "print(\"=\" * 80)\n",
+        "prompt = \"The Eiffel Tower stands in Paris and\"\n",
+        "test_inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n",
+        "\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    test_outputs = model.generate(\n",
+        "        **test_inputs,\n",
+        "        max_new_tokens=64,\n",
+        "        do_sample=False,\n",
+        "        pad_token_id=tokenizer.pad_token_id,\n",
+        "    )\n",
+        "\n",
+        "generated_tokens = test_outputs[0, test_inputs[\"input_ids\"].shape[1] :]\n",
+        "print(f\"Prompt: {prompt}\")\n",
+        "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens, skip_special_tokens=True)}\")\n",
+        "print(\"=\" * 80)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "fbf80e5f",
+      "metadata": {
+        "id": "fbf80e5f",
+        "outputId": "86af20b4-3b9f-4dad-ba09-5dbb0de0f18c"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "================================================================================\n",
+            "TEST 2: Chat Template\n",
+            "================================================================================\n",
+            "Formatted prompt: <|bos|><|user_start|>What is the capital of France?<|user_end|><|assistant_start|>\n",
+            "Input IDs: [65527, 65528, 1442, 309, 261, 3429, 281, 4215, 63, 65529, 65530]\n",
+            "\n",
+            "Generated: The capital of France is Paris.<|assistant_end|>\n",
+            "================================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"=\" * 80)\n",
+        "print(\"TEST 2: Chat Template\")\n",
+        "print(\"=\"*80)\n",
+        "conversation = [\n",
+        "    {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n",
+        "]\n",
+        "\n",
+        "inputs = tokenizer.apply_chat_template(\n",
+        "    conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors=\"pt\"\n",
+        ").to(device)\n",
+        "\n",
+        "print(f\"Formatted prompt: {tokenizer.decode(inputs['input_ids'][0])}\")\n",
+        "print(f\"Input IDs: {inputs['input_ids'][0].tolist()}\")\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    outputs = model.generate(\n",
+        "        **inputs,\n",
+        "        max_new_tokens=64,\n",
+        "        do_sample=False\n",
+        "    )\n",
+        "\n",
+        "generated_tokens = outputs[0, inputs[\"input_ids\"].shape[1] :]\n",
+        "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens)}\")\n",
+        "print(\"=\" * 80)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "a102e248",
+      "metadata": {
+        "id": "a102e248"
+      },
+      "source": [
+        "## Dataset\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "b07e3b95",
+      "metadata": {
+        "id": "b07e3b95",
+        "outputId": "3c42b4d4-6e4f-4622-94cd-adbe53efa238"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Generating train split: 100%|██████████| 52736/52736 [00:00<00:00, 1058243.18 examples/s]\n"
+          ]
+        }
+      ],
+      "source": [
+        "raw_dataset = load_dataset(\"HuggingFaceH4/OpenR1-Math-220k-default-verified\", split=\"train\")\n",
+        "splits = raw_dataset.train_test_split(test_size=0.1, seed=42)\n",
+        "train_dataset = splits[\"train\"]\n",
+        "eval_dataset = splits[\"test\"]\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "21ec9078",
+      "metadata": {
+        "id": "21ec9078"
+      },
+      "source": [
+        "## Training Configuration\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "17a49557",
+      "metadata": {
+        "id": "17a49557"
+      },
+      "outputs": [],
+      "source": [
+        "max_train_steps = 50\n",
+        "prompt_batch_size = 1\n",
+        "num_generations = 4\n",
+        "max_new_tokens = 128\n",
+        "temperature = 1.0\n",
+        "top_k = 50\n",
+        "learning_rate = 5e-6\n",
+        "weight_decay = 0.0\n",
+        "epsilon = 0.2\n",
+        "gradient_accumulation_steps = 1\n",
+        "warmup_ratio = 0.1\n",
+        "logging_frequency = 5\n",
+        "max_train_samples = 1000\n",
+        "max_eval_samples = 100\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "a8a12581",
+      "metadata": {
+        "id": "a8a12581"
+      },
+      "source": [
+        "## Reward Functions\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3f07953f",
+      "metadata": {
+        "id": "3f07953f"
+      },
+      "outputs": [],
+      "source": [
+        "import re\n",
+        "import numpy as np\n",
+        "import torch.nn.functional as F\n",
+        "from contextlib import nullcontext\n",
+        "\n",
+        "\n",
+        "def think_format_reward(completions):\n",
+        "    \"\"\"\n",
+        "    Reward function that checks if the reasoning process is enclosed within <think> and </think> tags.\n",
+        "    Returns 1.0 if the format is correct, otherwise 0.0.\n",
+        "    \"\"\"\n",
+        "    pattern = r\"^(?!.*<think>)(.*?)</think>.*$\"\n",
+        "    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completions]\n",
+        "    return [1.0 if match else 0.0 for match in matches]\n",
+        "\n",
+        "\n",
+        "def accuracy_reward(completions, solutions):\n",
+        "    \"\"\"\n",
+        "    Reward function that checks if the completion matches the solution.\n",
+        "    For simplicity, we'll do basic string matching here.\n",
+        "    \"\"\"\n",
+        "    rewards = []\n",
+        "    for completion, solution in zip(completions, solutions):\n",
+        "        # Simple string matching (normalized)\n",
+        "        reward = 1.0 if solution.strip().lower() in completion.strip().lower() else 0.0\n",
+        "        rewards.append(reward)\n",
+        "    return rewards\n",
+        "\n",
+        "\n",
+        "def min_length_reward(completions, min_length=10):\n",
+        "    \"\"\"\n",
+        "    Reward function that checks if the completion is at least a certain length.\n",
+        "    Returns 1.0 if the length is greater than or equal to the minimum length, otherwise 0.0.\n",
+        "    \"\"\"\n",
+        "    return [1.0 if len(completion) >= min_length else 0.0 for completion in completions]\n",
+        "\n",
+        "def combined_reward(completions, solutions):\n",
+        "    \"\"\"\n",
+        "    Combines format and accuracy rewards with equal weight.\n",
+        "    \"\"\"\n",
+        "    format_rewards = think_format_reward(completions)\n",
+        "    accuracy_rewards = accuracy_reward(completions, solutions)\n",
+        "    min_length_rewards = min_length_reward(completions)\n",
+        "    return [np.mean([f, a, m]) for f, a, m in zip(format_rewards, accuracy_rewards, min_length_rewards)]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "b2299e86",
+      "metadata": {
+        "id": "b2299e86"
+      },
+      "source": [
+        "## Helper Functions\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "b0f0e9e4",
+      "metadata": {
+        "id": "b0f0e9e4"
+      },
+      "outputs": [],
+      "source": [
+        "def per_token_log_probs(logits, labels):\n",
+        "    logits = logits.float()\n",
+        "    log_probs = F.log_softmax(logits, dim=-1)\n",
+        "    return log_probs.gather(dim=-1, index=labels.unsqueeze(-1)).squeeze(-1)\n",
+        "\n",
+        "\n",
+        "def prepare_prompt(example, problem_key=\"problem\", solution_key=\"solution\"):\n",
+        "    # Extract the messages (should be a list of dicts with 'role' and 'content')\n",
+        "    prompt = example.get(problem_key, \"\")\n",
+        "    messages = [{\"role\": \"user\", \"content\": prompt}]\n",
+        "\n",
+        "    formatted = tokenizer.apply_chat_template(\n",
+        "        messages,\n",
+        "        add_generation_prompt=True,\n",
+        "        truncation=True,\n",
+        "        max_length=2048,\n",
+        "        padding=False,\n",
+        "        return_dict=True,\n",
+        "        return_tensors=\"pt\",\n",
+        "    )\n",
+        "    return formatted[\"input_ids\"], formatted[\"attention_mask\"]\n",
+        "\n",
+        "\n",
+        "if device.type == \"cuda\":\n",
+        "    autocast_ctx = torch.amp.autocast(device_type=\"cuda\", dtype=torch.bfloat16)\n",
+        "else:\n",
+        "    autocast_ctx = nullcontext()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "2756b691",
+      "metadata": {
+        "id": "2756b691"
+      },
+      "source": [
+        "## Optimizer and Scheduler\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e0e05495",
+      "metadata": {
+        "id": "e0e05495"
+      },
+      "outputs": [],
+      "source": [
+        "optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)\n",
+        "total_update_steps = max_train_steps // gradient_accumulation_steps\n",
+        "warmup_steps = max(1, int(total_update_steps * warmup_ratio))\n",
+        "scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_update_steps)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "5e2c7a2c",
+      "metadata": {
+        "id": "5e2c7a2c"
+      },
+      "source": [
+        "# The Training Loop\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "260f574c",
+      "metadata": {
+        "id": "260f574c",
+        "outputId": "b762165f-ed4a-4b22-cbb7-2fa203696ac3"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "step=0005 | loss=0.0000 | avg_reward=0.4000 | lr=0.00e+00\n",
+            "Sample eval completion: 3^4 - 11 and 3^6 - 17\n",
+            "step=0010 | loss=0.0000 | avg_reward=0.3333 | lr=0.00e+00\n",
+            "Sample eval completion: 11. \n",
+            "\n",
+            "This statement refers to an optimization problem where we seek to find the smallest prime \\( p\n",
+            "step=0015 | loss=0.0000 | avg_reward=0.4667 | lr=0.00e+00\n",
+            "Sample eval completion: What number has two prime factors, 1 and itself, without additional restrictions? One possible combi\n",
+            "step=0020 | loss=-0.0983 | avg_reward=0.4500 | lr=0.00e+00\n",
+            "Sample eval completion: \\[\\begin{bmatrix} 2 & 3\\\\ 6 & 11\\end{bmatrix} \\]\\[3^{a}-2^{b}\\left(\\frac{1^{a}}{a}\\right) \\left(\\fra\n",
+            "step=0025 | loss=-0.0979 | avg_reward=0.3333 | lr=0.00e+00\n",
+            "Sample eval completion: Let's examine the smallest prime \\( p \\) for which there do not exist non-negative integers \\( a, b \n",
+            "step=0030 | loss=-0.0000 | avg_reward=0.3667 | lr=0.00e+00\n",
+            "Sample eval completion: \n",
+            "Since \\( p = 23^2 + 7 \\) or \\( p \\ge 23^3 + 63 \\), and \\( p > 23 \\), we find that \\( p \\ge 9223 \\).\n",
+            "step=0035 | loss=0.0431 | avg_reward=0.4167 | lr=0.00e+00\n",
+            "Sample eval completion: \\[11 \\] = \\((3^5)\\), for all \\( a, b \\).\n",
+            "[asy]\n",
+            "import random;\n",
+            "import numpy as np;\n",
+            "\n",
+            "unitsize(1cm);\n",
+            "\n",
+            "d\n",
+            "step=0040 | loss=-0.0702 | avg_reward=0.5000 | lr=0.00e+00\n",
+            "Sample eval completion: 3^4 - 7\n",
+            "step=0045 | loss=0.0000 | avg_reward=0.3333 | lr=0.00e+00\n",
+            "Sample eval completion: 7.\n",
+            "step=0050 | loss=0.0000 | avg_reward=0.4000 | lr=0.00e+00\n",
+            "Sample eval completion: Here is the answer:\n",
+            "\n",
+            "The smallest prime \\( p \\) (where \\( p > 3 \\)) for which there do not exist non\n",
+            "Training complete.\n"
+          ]
+        }
+      ],
+      "source": [
+        "\n",
+        "# Sample dataset if needed\n",
+        "if max_train_samples is not None and len(train_dataset) > max_train_samples:\n",
+        "    train_dataset = train_dataset.select(range(max_train_samples))\n",
+        "if max_eval_samples is not None and len(eval_dataset) > max_eval_samples:\n",
+        "    eval_dataset = eval_dataset.select(range(max_eval_samples))\n",
+        "\n",
+        "model.train()\n",
+        "train_index = 0\n",
+        "global_step = 0\n",
+        "running_reward = 0.0\n",
+        "running_loss = 0.0\n",
+        "\n",
+        "for step in range(1, max_train_steps + 1):\n",
+        "    example = train_dataset[train_index % len(train_dataset)]\n",
+        "    train_index += 1\n",
+        "\n",
+        "    prompt_ids, prompt_mask = prepare_prompt(example)\n",
+        "    prompt_ids = prompt_ids.to(device)\n",
+        "    prompt_mask = prompt_mask.to(device)\n",
+        "    prompt_length = prompt_ids.shape[1]\n",
+        "\n",
+        "    prompt_repeat = prompt_ids.repeat(num_generations, 1)\n",
+        "    mask_repeat = prompt_mask.repeat(num_generations, 1)\n",
+        "\n",
+        "    # Generate completions\n",
+        "    model.eval()\n",
+        "    with torch.no_grad():\n",
+        "        generated = model.generate(\n",
+        "            input_ids=prompt_repeat,\n",
+        "            attention_mask=mask_repeat,\n",
+        "            max_new_tokens=max_new_tokens,\n",
+        "            do_sample=True,\n",
+        "            temperature=temperature,\n",
+        "            top_k=top_k,\n",
+        "            pad_token_id=tokenizer.pad_token_id,\n",
+        "        )\n",
+        "    model.train()\n",
+        "\n",
+        "    sequences = generated\n",
+        "    attention_mask = (sequences != tokenizer.pad_token_id).long()\n",
+        "    completion_mask = attention_mask.clone()\n",
+        "    completion_mask[:, :prompt_length] = 0\n",
+        "\n",
+        "    completion_tokens = sequences[:, prompt_length:]\n",
+        "    completion_texts = tokenizer.batch_decode(completion_tokens, skip_special_tokens=True)\n",
+        "\n",
+        "    # Get solution\n",
+        "    solution = example.get(\"solution\", example.get(\"answer\", \"\"))\n",
+        "    solutions = [solution] * num_generations\n",
+        "\n",
+        "    # Compute rewards\n",
+        "    rewards = combined_reward(completion_texts, solutions)\n",
+        "    rewards = torch.tensor(rewards, dtype=torch.float32, device=device)\n",
+        "    running_reward += rewards.mean().item()\n",
+        "\n",
+        "    rewards_view = rewards.view(prompt_batch_size, num_generations)\n",
+        "    mean_rewards = rewards_view.mean(dim=1, keepdim=True)\n",
+        "    std_rewards = rewards_view.std(dim=1, keepdim=True)\n",
+        "    std_rewards = torch.where(std_rewards > 0, std_rewards, torch.ones_like(std_rewards))\n",
+        "    advantages = ((rewards_view - mean_rewards) / std_rewards).view(-1)\n",
+        "\n",
+        "    labels = sequences[:, 1:].clone()\n",
+        "    labels[attention_mask[:, 1:] == 0] = tokenizer.pad_token_id\n",
+        "\n",
+        "    # Compute old log probs\n",
+        "    with torch.no_grad():\n",
+        "        with (autocast_ctx if device.type == \"cuda\" else nullcontext()):\n",
+        "            old_outputs = model(\n",
+        "                input_ids=sequences,\n",
+        "                attention_mask=attention_mask,\n",
+        "                use_cache=False,\n",
+        "            )\n",
+        "        old_log_probs = per_token_log_probs(old_outputs.logits[:, :-1], labels)\n",
+        "\n",
+        "    valid_mask = (completion_mask[:, 1:] == 1) & (labels != tokenizer.pad_token_id)\n",
+        "\n",
+        "    # Compute loss\n",
+        "    optimizer.zero_grad(set_to_none=True)\n",
+        "    with (autocast_ctx if device.type == \"cuda\" else nullcontext()):\n",
+        "        outputs = model(\n",
+        "            input_ids=sequences,\n",
+        "            attention_mask=attention_mask,\n",
+        "            use_cache=False,\n",
+        "        )\n",
+        "        log_probs = per_token_log_probs(outputs.logits[:, :-1], labels)\n",
+        "\n",
+        "    ratio = (log_probs - old_log_probs).exp()\n",
+        "    ratio = torch.where(valid_mask, ratio, torch.ones_like(ratio))\n",
+        "    clipped_ratio = ratio.clamp(1.0 - epsilon, 1.0 + epsilon)\n",
+        "\n",
+        "    adv = advantages.unsqueeze(1)\n",
+        "    loss_unclipped = ratio * adv\n",
+        "    loss_clipped = clipped_ratio * adv\n",
+        "    per_token_loss = -torch.min(loss_unclipped, loss_clipped)\n",
+        "    per_token_loss = torch.where(valid_mask, per_token_loss, torch.zeros_like(per_token_loss))\n",
+        "\n",
+        "    denom = valid_mask.sum().clamp(min=1)\n",
+        "    loss = per_token_loss.sum() / denom\n",
+        "\n",
+        "    loss.backward()\n",
+        "    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
+        "    optimizer.step()\n",
+        "    scheduler.step()\n",
+        "\n",
+        "    global_step += 1\n",
+        "    running_loss += loss.item()\n",
+        "\n",
+        "    if step % logging_frequency == 0:\n",
+        "        avg_reward = running_reward / logging_frequency\n",
+        "        avg_loss = running_loss / logging_frequency\n",
+        "        current_lr = scheduler.get_last_lr()[0]\n",
+        "        print(\n",
+        "            f\"step={step:04d} | loss={avg_loss:.4f} | avg_reward={avg_reward:.4f} | lr={current_lr:.2e}\"\n",
+        "        )\n",
+        "        running_reward = 0.0\n",
+        "        running_loss = 0.0\n",
+        "\n",
+        "        # Sample evaluation\n",
+        "        model.eval()\n",
+        "        eval_example = eval_dataset[0]\n",
+        "        prompt_ids, prompt_mask = prepare_prompt(eval_example)\n",
+        "        with torch.no_grad():\n",
+        "            eval_sequences = model.generate(\n",
+        "                input_ids=prompt_ids.to(device),\n",
+        "                attention_mask=prompt_mask.to(device),\n",
+        "                max_new_tokens=max_new_tokens,\n",
+        "                do_sample=True,\n",
+        "                top_k=top_k,\n",
+        "                temperature=temperature,\n",
+        "                pad_token_id=tokenizer.pad_token_id,\n",
+        "            )\n",
+        "        model.train()\n",
+        "        completion = eval_sequences[0, prompt_ids.shape[1] :]\n",
+        "        print(\"Sample eval completion:\", tokenizer.decode(completion, skip_special_tokens=True)[:100])\n",
+        "\n",
+        "print(\"Training complete.\")\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "2104662d",
+      "metadata": {
+        "id": "2104662d"
+      },
+      "outputs": [],
+      "source": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
+    },
+    "colab": {
+      "provenance": []
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

sft.ipynb ADDED Viewed

	@@ -0,0 +1,591 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "b7eb261b",
+      "metadata": {
+        "id": "b7eb261b"
+      },
+      "source": [
+        "# NanoChat Easy - SFT Training\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "8b8a04a8",
+      "metadata": {
+        "id": "8b8a04a8"
+      },
+      "source": [
+        "## Import model and tokenizer\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3e48247c",
+      "metadata": {
+        "id": "3e48247c",
+        "outputId": "882fcf01-34fb-4123-e84c-deefdf477814"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+            "  from .autonotebook import tqdm as notebook_tqdm\n",
+            "`torch_dtype` is deprecated! Use `dtype` instead!\n"
+          ]
+        }
+      ],
+      "source": [
+        "import torch\n",
+        "from torch.utils.data import DataLoader\n",
+        "from datasets import load_dataset\n",
+        "from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup\n",
+        "\n",
+        "\n",
+        "model_id = \"karpathy/nanochat-d32\"\n",
+        "revision = \"refs/pr/1\"\n",
+        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+        "\n",
+        "\n",
+        "tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)\n",
+        "model = AutoModelForCausalLM.from_pretrained(\n",
+        "    model_id,\n",
+        "    revision=revision,\n",
+        "    torch_dtype=torch.bfloat16 if device.type == \"cuda\" else torch.float32,\n",
+        ").to(device)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "c9a9c0a4",
+      "metadata": {
+        "id": "c9a9c0a4"
+      },
+      "source": [
+        "## Setup LoRA\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "dd9a698a",
+      "metadata": {
+        "id": "dd9a698a",
+        "outputId": "0aae9ecc-7af9-436e-a95b-a4cd023997fd"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "trainable params: 1,179,648 || all params: 1,880,227,840 || trainable%: 0.0627\n"
+          ]
+        }
+      ],
+      "source": [
+        "from peft import LoraConfig, get_peft_model\n",
+        "\n",
+        "lora_config = LoraConfig(\n",
+        "    r=1,\n",
+        "    lora_alpha=2,\n",
+        "    lora_dropout=0.00,\n",
+        "    task_type=\"CAUSAL_LM\",\n",
+        "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"fc1\", \"fc2\"]\n",
+        ")\n",
+        "\n",
+        "model = get_peft_model(model, lora_config)\n",
+        "model.print_trainable_parameters()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "4810af1a",
+      "metadata": {
+        "id": "4810af1a"
+      },
+      "source": [
+        "## Demo the model\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "b3e81aa9",
+      "metadata": {
+        "id": "b3e81aa9",
+        "outputId": "1cde7e69-7ff1-4bfe-aa9f-9ded20249d82"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "================================================================================\n",
+            "TEST 1: Plain Autoregressive Prompt\n",
+            "================================================================================\n",
+            "Prompt: The Eiffel Tower stands in Paris and\n",
+            "\n",
+            "Generated:  is one of the most famous landmarks in the world. It is located on the Champ de Mars in the heart of the city. The tower was built for the 1889 World's Fair. It was designed by the French engineer Gustave Eiffel and took 2 years to build. The Eiffel Tower stands 324 meters\n",
+            "================================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"=\" * 80)\n",
+        "print(\"TEST 1: Plain Autoregressive Prompt\")\n",
+        "print(\"=\" * 80)\n",
+        "prompt = \"The Eiffel Tower stands in Paris and\"\n",
+        "test_inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n",
+        "\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    test_outputs = model.generate(\n",
+        "        **test_inputs,\n",
+        "        max_new_tokens=64,\n",
+        "        do_sample=False,\n",
+        "        pad_token_id=tokenizer.pad_token_id,\n",
+        "    )\n",
+        "\n",
+        "generated_tokens = test_outputs[0, test_inputs[\"input_ids\"].shape[1] :]\n",
+        "print(f\"Prompt: {prompt}\")\n",
+        "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens, skip_special_tokens=True)}\")\n",
+        "print(\"=\" * 80)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "8e7b275c",
+      "metadata": {
+        "id": "8e7b275c",
+        "outputId": "719e986e-61b4-4fd5-db15-4a9ef8f97396"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "================================================================================\n",
+            "TEST 2: Chat Template\n",
+            "================================================================================\n",
+            "Formatted prompt: <|bos|><|user_start|>What is the capital of France?<|user_end|><|assistant_start|>\n",
+            "Input IDs: [65527, 65528, 1442, 309, 261, 3429, 281, 4215, 63, 65529, 65530]\n",
+            "\n",
+            "Generated: The capital of France is Paris.<|assistant_end|>\n",
+            "================================================================================\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"=\" * 80)\n",
+        "print(\"TEST 2: Chat Template\")\n",
+        "print(\"=\"*80)\n",
+        "conversation = [\n",
+        "    {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n",
+        "]\n",
+        "\n",
+        "inputs = tokenizer.apply_chat_template(\n",
+        "    conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors=\"pt\"\n",
+        ").to(device)\n",
+        "\n",
+        "print(f\"Formatted prompt: {tokenizer.decode(inputs['input_ids'][0])}\")\n",
+        "print(f\"Input IDs: {inputs['input_ids'][0].tolist()}\")\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    outputs = model.generate(\n",
+        "        **inputs,\n",
+        "        max_new_tokens=64,\n",
+        "        do_sample=False\n",
+        "    )\n",
+        "\n",
+        "generated_tokens = outputs[0, inputs[\"input_ids\"].shape[1] :]\n",
+        "print(f\"\\nGenerated: {tokenizer.decode(generated_tokens)}\")\n",
+        "print(\"=\" * 80)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "44cb321a",
+      "metadata": {
+        "id": "44cb321a"
+      },
+      "source": [
+        "## Dataset\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "e1a75c14",
+      "metadata": {
+        "id": "e1a75c14"
+      },
+      "outputs": [],
+      "source": [
+        "raw_dataset = load_dataset(\"HuggingFaceTB/smoltalk2\", \"SFT\", split=\"OpenThoughts3_1.2M_think\")\n",
+        "splits = raw_dataset.train_test_split(test_size=0.1, seed=42)\n",
+        "train_dataset = splits[\"train\"]\n",
+        "eval_dataset = splits[\"test\"]\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "8b29399d",
+      "metadata": {
+        "id": "8b29399d"
+      },
+      "source": [
+        "### Process the Dataset\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "451542b4",
+      "metadata": {
+        "id": "451542b4",
+        "outputId": "caa727dd-f9d8-4c67-d193-79bcc0836b49"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Map:   0%|          | 0/20000 [00:00<?, ? examples/s]"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Map: 100%|██████████| 20000/20000 [06:27<00:00, 51.68 examples/s]\n",
+            "Map: 100%|██████████| 1000/1000 [00:19<00:00, 52.12 examples/s]\n"
+          ]
+        }
+      ],
+      "source": [
+        "max_length = 2048\n",
+        "max_train_examples = 20000\n",
+        "max_eval_examples = 1000\n",
+        "\n",
+        "def format_example(example):\n",
+        "    formatted = tokenizer.apply_chat_template(\n",
+        "        example[\"messages\"],\n",
+        "        add_generation_prompt=False,\n",
+        "        truncation=True,\n",
+        "        max_length=max_length,\n",
+        "        padding=False,\n",
+        "        return_dict=True,\n",
+        "        return_tensors=\"pt\",\n",
+        "    )\n",
+        "    return {\n",
+        "        \"input_ids\": formatted[\"input_ids\"][0].tolist(),\n",
+        "        \"attention_mask\": formatted[\"attention_mask\"][0].tolist(),\n",
+        "    }\n",
+        "\n",
+        "\n",
+        "if max_train_examples is not None:\n",
+        "    train_dataset = train_dataset.select(range(min(len(train_dataset), max_train_examples)))\n",
+        "    train_dataset = train_dataset.map(format_example, remove_columns=train_dataset.column_names)\n",
+        "else:\n",
+        "    train_dataset = train_dataset.map(format_example, remove_columns=train_dataset.column_names)\n",
+        "\n",
+        "if max_eval_examples is not None:\n",
+        "    eval_dataset = eval_dataset.select(range(min(len(eval_dataset), max_eval_examples)))\n",
+        "    eval_dataset = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)\n",
+        "else:\n",
+        "    eval_dataset = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "ecd33dd7",
+      "metadata": {
+        "id": "ecd33dd7"
+      },
+      "source": [
+        "## Training Configuration"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f9d837ee",
+      "metadata": {
+        "id": "f9d837ee"
+      },
+      "outputs": [],
+      "source": [
+        "train_batch_size = 2\n",
+        "eval_batch_size = 2\n",
+        "num_epochs = 1\n",
+        "gradient_accumulation_steps = 4\n",
+        "learning_rate = 1e-5\n",
+        "weight_decay = 0.0\n",
+        "warmup_ratio = 0.03\n",
+        "logging_frequency = 10"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "1cf11e96",
+      "metadata": {
+        "id": "1cf11e96"
+      },
+      "source": [
+        "## Create a `DataLoader` 👴"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1bc4fa24",
+      "metadata": {
+        "id": "1bc4fa24"
+      },
+      "outputs": [],
+      "source": [
+        "def collate_fn(batch):\n",
+        "    batch_dict = {\n",
+        "        \"input_ids\": [record[\"input_ids\"] for record in batch],\n",
+        "        \"attention_mask\": [record[\"attention_mask\"] for record in batch],\n",
+        "    }\n",
+        "    padded = tokenizer.pad(batch_dict, padding=True, return_tensors=\"pt\")\n",
+        "    labels = padded[\"input_ids\"].clone()\n",
+        "    labels[padded[\"attention_mask\"] == 0] = -100\n",
+        "    padded[\"labels\"] = labels\n",
+        "    return padded\n",
+        "\n",
+        "\n",
+        "TrainLoader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True, collate_fn=collate_fn)\n",
+        "EvalLoader = DataLoader(eval_dataset, batch_size=eval_batch_size, shuffle=False, collate_fn=collate_fn)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "f5965d1b",
+      "metadata": {
+        "id": "f5965d1b"
+      },
+      "source": [
+        "## Optimizer"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f57c7be2",
+      "metadata": {
+        "id": "f57c7be2"
+      },
+      "outputs": [],
+      "source": [
+        "optimizer = torch.optim.AdamW(\n",
+        "    model.parameters(),\n",
+        "    lr=learning_rate,\n",
+        "    weight_decay=weight_decay,\n",
+        ")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "215f8782",
+      "metadata": {
+        "id": "215f8782"
+      },
+      "source": [
+        "# Learning Rate Scheduler"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "034e2903",
+      "metadata": {
+        "id": "034e2903"
+      },
+      "outputs": [],
+      "source": [
+        "num_update_steps_per_epoch = max(len(TrainLoader) // gradient_accumulation_steps, 1)\n",
+        "max_train_steps = num_epochs * num_update_steps_per_epoch\n",
+        "warmup_steps = max(1, int(max_train_steps * warmup_ratio))\n",
+        "scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, max_train_steps)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "0f0090b6",
+      "metadata": {
+        "id": "0f0090b6"
+      },
+      "source": [
+        "# The Training Loop"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "1540e30a",
+      "metadata": {
+        "id": "1540e30a",
+        "outputId": "747badd7-18df-441f-8026-7aa4f30c2fd7"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Epoch 1/1\n",
+            "step=00010 | loss=1.7586 | lr=1.33e-06\n",
+            "step=00020 | loss=1.8188 | lr=2.67e-06\n",
+            "step=00030 | loss=1.8235 | lr=4.00e-06\n",
+            "step=00040 | loss=1.7935 | lr=5.33e-06\n",
+            "step=00050 | loss=1.8029 | lr=6.67e-06\n",
+            "step=00060 | loss=1.8433 | lr=8.00e-06\n",
+            "step=00070 | loss=1.8616 | lr=9.33e-06\n",
+            "step=00080 | loss=1.8238 | lr=9.98e-06\n",
+            "step=00090 | loss=1.7774 | lr=9.94e-06\n",
+            "step=00100 | loss=1.8081 | lr=9.90e-06\n",
+            "step=00110 | loss=1.7437 | lr=9.86e-06\n",
+            "step=00120 | loss=1.7830 | lr=9.81e-06\n",
+            "step=00130 | loss=1.8064 | lr=9.77e-06\n",
+            "step=00140 | loss=1.8541 | lr=9.73e-06\n",
+            "step=00150 | loss=1.8301 | lr=9.69e-06\n",
+            "step=00160 | loss=1.7725 | lr=9.65e-06\n",
+            "step=00170 | loss=1.7635 | lr=9.61e-06\n",
+            "step=00180 | loss=1.7963 | lr=9.57e-06\n",
+            "step=00190 | loss=1.7563 | lr=9.53e-06\n",
+            "step=00200 | loss=1.6950 | lr=9.48e-06\n",
+            "step=00210 | loss=1.7680 | lr=9.44e-06\n",
+            "step=00220 | loss=1.8906 | lr=9.40e-06\n",
+            "step=00230 | loss=1.7120 | lr=9.36e-06\n",
+            "step=00240 | loss=1.8390 | lr=9.32e-06\n",
+            "step=00250 | loss=1.7180 | lr=9.28e-06\n",
+            "step=00260 | loss=1.7709 | lr=9.24e-06\n",
+            "step=00270 | loss=1.7598 | lr=9.20e-06\n",
+            "step=00280 | loss=1.7981 | lr=9.15e-06\n",
+            "step=00290 | loss=1.7540 | lr=9.11e-06\n",
+            "step=00300 | loss=1.7695 | lr=9.07e-06\n",
+            "step=00310 | loss=1.7468 | lr=9.03e-06\n"
+          ]
+        },
+        {
+          "ename": "KeyboardInterrupt",
+          "evalue": "",
+          "output_type": "error",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
+            "Cell \u001b[0;32mIn[14], line 11\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m step, batch \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(TrainLoader, start\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1\u001b[39m):\n\u001b[1;32m     10\u001b[0m     batch \u001b[38;5;241m=\u001b[39m {key: value\u001b[38;5;241m.\u001b[39mto(device) \u001b[38;5;28;01mfor\u001b[39;00m key, value \u001b[38;5;129;01min\u001b[39;00m batch\u001b[38;5;241m.\u001b[39mitems()}\n\u001b[0;32m---> 11\u001b[0m     outputs \u001b[38;5;241m=\u001b[39m \u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mbatch\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m     12\u001b[0m     loss \u001b[38;5;241m=\u001b[39m outputs\u001b[38;5;241m.\u001b[39mloss \u001b[38;5;241m/\u001b[39m gradient_accumulation_steps\n\u001b[1;32m     13\u001b[0m     loss\u001b[38;5;241m.\u001b[39mbackward()\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/peft/peft_model.py:1850\u001b[0m, in \u001b[0;36mPeftModelForCausalLM.forward\u001b[0;34m(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)\u001b[0m\n\u001b[1;32m   1848\u001b[0m     \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_enable_peft_forward_hooks(\u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[1;32m   1849\u001b[0m         kwargs \u001b[38;5;241m=\u001b[39m {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs\u001b[38;5;241m.\u001b[39mitems() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mspecial_peft_forward_args}\n\u001b[0;32m-> 1850\u001b[0m         \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbase_model\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m   1851\u001b[0m \u001b[43m            \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1852\u001b[0m \u001b[43m            \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1853\u001b[0m \u001b[43m            \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1854\u001b[0m \u001b[43m            \u001b[49m\u001b[43mlabels\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlabels\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1855\u001b[0m \u001b[43m            \u001b[49m\u001b[43moutput_attentions\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutput_attentions\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1856\u001b[0m \u001b[43m            \u001b[49m\u001b[43moutput_hidden_states\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43moutput_hidden_states\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1857\u001b[0m \u001b[43m            \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_dict\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1858\u001b[0m \u001b[43m            \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m   1859\u001b[0m \u001b[43m        \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1861\u001b[0m batch_size \u001b[38;5;241m=\u001b[39m _get_batch_size(input_ids, inputs_embeds)\n\u001b[1;32m   1862\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m attention_mask \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m   1863\u001b[0m     \u001b[38;5;66;03m# concat prompt attention mask\u001b[39;00m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/peft/tuners/tuners_utils.py:222\u001b[0m, in \u001b[0;36mBaseTuner.forward\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    221\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;241m*\u001b[39margs: Any, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Any):\n\u001b[0;32m--> 222\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mforward\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/utils/generic.py:757\u001b[0m, in \u001b[0;36mcan_return_tuple.<locals>.wrapper\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    755\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m return_dict_passed \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m    756\u001b[0m     return_dict \u001b[38;5;241m=\u001b[39m return_dict_passed\n\u001b[0;32m--> 757\u001b[0m output \u001b[38;5;241m=\u001b[39m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    758\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m return_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(output, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[1;32m    759\u001b[0m     output \u001b[38;5;241m=\u001b[39m output\u001b[38;5;241m.\u001b[39mto_tuple()\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:474\u001b[0m, in \u001b[0;36mNanoChatForCausalLM.forward\u001b[0;34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, **kwargs)\u001b[0m\n\u001b[1;32m    435\u001b[0m \u001b[38;5;129m@can_return_tuple\u001b[39m\n\u001b[1;32m    436\u001b[0m \u001b[38;5;129m@auto_docstring\u001b[39m\n\u001b[1;32m    437\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    448\u001b[0m     \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Unpack[TransformersKwargs],\n\u001b[1;32m    449\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m CausalLMOutputWithPast:\n\u001b[1;32m    450\u001b[0m \u001b[38;5;250m    \u001b[39m\u001b[38;5;124mr\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m    451\u001b[0m \u001b[38;5;124;03m    Example:\u001b[39;00m\n\u001b[1;32m    452\u001b[0m \n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    472\u001b[0m \u001b[38;5;124;03m    >>> output = tokenizer.decode(generated_tokens, skip_special_tokens=True)\u001b[39;00m\n\u001b[1;32m    473\u001b[0m \u001b[38;5;124;03m    ```\"\"\"\u001b[39;00m\n\u001b[0;32m--> 474\u001b[0m     outputs: BaseModelOutputWithPast \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    475\u001b[0m \u001b[43m        \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    476\u001b[0m \u001b[43m        \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    477\u001b[0m \u001b[43m        \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    478\u001b[0m \u001b[43m        \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    479\u001b[0m \u001b[43m        \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    480\u001b[0m \u001b[43m        \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    481\u001b[0m \u001b[43m        \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    482\u001b[0m \u001b[43m        \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    483\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    485\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m outputs\u001b[38;5;241m.\u001b[39mlast_hidden_state\n\u001b[1;32m    486\u001b[0m     slice_indices \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mslice\u001b[39m(\u001b[38;5;241m-\u001b[39mlogits_to_keep, \u001b[38;5;28;01mNone\u001b[39;00m) \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(logits_to_keep, \u001b[38;5;28mint\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m logits_to_keep\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/utils/generic.py:927\u001b[0m, in \u001b[0;36mcheck_model_inputs.<locals>.wrapped_fn.<locals>.wrapper\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    924\u001b[0m                 monkey_patched_layers\u001b[38;5;241m.\u001b[39mappend((module, original_forward))\n\u001b[1;32m    926\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 927\u001b[0m     outputs \u001b[38;5;241m=\u001b[39m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    928\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m original_exception:\n\u001b[1;32m    929\u001b[0m     \u001b[38;5;66;03m# If we get a TypeError, it's possible that the model is not receiving the recordable kwargs correctly.\u001b[39;00m\n\u001b[1;32m    930\u001b[0m     \u001b[38;5;66;03m# Get a TypeError even after removing the recordable kwargs -> re-raise the original exception\u001b[39;00m\n\u001b[1;32m    931\u001b[0m     \u001b[38;5;66;03m# Otherwise -> we're probably missing `**kwargs` in the decorated function\u001b[39;00m\n\u001b[1;32m    932\u001b[0m     kwargs_without_recordable \u001b[38;5;241m=\u001b[39m {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs\u001b[38;5;241m.\u001b[39mitems() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m recordable_keys}\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:401\u001b[0m, in \u001b[0;36mNanoChatModel.forward\u001b[0;34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, cache_position, **kwargs)\u001b[0m\n\u001b[1;32m    398\u001b[0m hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39minitial_norm(hidden_states)\n\u001b[1;32m    400\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m decoder_layer \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlayers:\n\u001b[0;32m--> 401\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m \u001b[43mdecoder_layer\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m    402\u001b[0m \u001b[43m        \u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    403\u001b[0m \u001b[43m        \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcausal_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    404\u001b[0m \u001b[43m        \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    405\u001b[0m \u001b[43m        \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    406\u001b[0m \u001b[43m        \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    407\u001b[0m \u001b[43m        \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    408\u001b[0m \u001b[43m        \u001b[49m\u001b[43mposition_embeddings\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mposition_embeddings\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    409\u001b[0m \u001b[43m        \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m    410\u001b[0m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    412\u001b[0m hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnorm(hidden_states)\n\u001b[1;32m    414\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m BaseModelOutputWithPast(\n\u001b[1;32m    415\u001b[0m     last_hidden_state\u001b[38;5;241m=\u001b[39mhidden_states,\n\u001b[1;32m    416\u001b[0m     past_key_values\u001b[38;5;241m=\u001b[39mpast_key_values \u001b[38;5;28;01mif\u001b[39;00m use_cache \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m    417\u001b[0m )\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/modeling_layers.py:94\u001b[0m, in \u001b[0;36mGradientCheckpointingLayer.__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m     91\u001b[0m         logger\u001b[38;5;241m.\u001b[39mwarning_once(message)\n\u001b[1;32m     93\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_gradient_checkpointing_func(partial(\u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__call__\u001b[39m, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs), \u001b[38;5;241m*\u001b[39margs)\n\u001b[0;32m---> 94\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[38;5;21;43m__call__\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:279\u001b[0m, in \u001b[0;36mNanoChatDecoderLayer.forward\u001b[0;34m(self, hidden_states, attention_mask, position_ids, past_key_values, use_cache, cache_position, position_embeddings, **kwargs)\u001b[0m\n\u001b[1;32m    267\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\n\u001b[1;32m    268\u001b[0m     \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m    269\u001b[0m     hidden_states: torch\u001b[38;5;241m.\u001b[39mTensor,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    276\u001b[0m     \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: Unpack[TransformersKwargs],\n\u001b[1;32m    277\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m torch\u001b[38;5;241m.\u001b[39mTensor:\n\u001b[1;32m    278\u001b[0m     residual \u001b[38;5;241m=\u001b[39m hidden_states\n\u001b[0;32m--> 279\u001b[0m     hidden_states \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minput_layernorm\u001b[49m\u001b[43m(\u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m    280\u001b[0m     \u001b[38;5;66;03m# Self Attention\u001b[39;00m\n\u001b[1;32m    281\u001b[0m     hidden_states, _ \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mself_attn(\n\u001b[1;32m    282\u001b[0m         hidden_states\u001b[38;5;241m=\u001b[39mhidden_states,\n\u001b[1;32m    283\u001b[0m         attention_mask\u001b[38;5;241m=\u001b[39mattention_mask,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    289\u001b[0m         \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[1;32m    290\u001b[0m     )\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1773\u001b[0m, in \u001b[0;36mModule._wrapped_call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1771\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_compiled_call_impl(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)  \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[1;32m   1772\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m-> 1773\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/nanochat_/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1784\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m   1779\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m   1780\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m   1781\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m   1782\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m   1783\u001b[0m         \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1784\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m   1786\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1787\u001b[0m called_always_called_hooks \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m()\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:53\u001b[0m, in \u001b[0;36mNanoChatRMSNorm.forward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m     52\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, x):\n\u001b[0;32m---> 53\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_norm\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfloat\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39mtype_as(x)\n",
+            "File \u001b[0;32m/fsx/benjamin_burtenshaw/transformers/src/transformers/models/nanochat/modeling_nanochat.py:50\u001b[0m, in \u001b[0;36mNanoChatRMSNorm._norm\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m     49\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m_norm\u001b[39m(\u001b[38;5;28mself\u001b[39m, x):\n\u001b[0;32m---> 50\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m x \u001b[38;5;241m*\u001b[39m torch\u001b[38;5;241m.\u001b[39mrsqrt(\u001b[43mx\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpow\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmean\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m-\u001b[39;49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkeepdim\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m \u001b[38;5;241m+\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39meps)\n",
+            "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
+          ]
+        }
+      ],
+      "source": [
+        "\n",
+        "model.train()\n",
+        "global_step = 0\n",
+        "running_loss = 0.0\n",
+        "running_steps = 0\n",
+        "\n",
+        "for epoch in range(num_epochs):\n",
+        "    print(f\"Epoch {epoch + 1}/{num_epochs}\")\n",
+        "    optimizer.zero_grad(set_to_none=True)\n",
+        "    for step, batch in enumerate(TrainLoader, start=1):\n",
+        "        batch = {key: value.to(device) for key, value in batch.items()}\n",
+        "        outputs = model(**batch)\n",
+        "        loss = outputs.loss / gradient_accumulation_steps\n",
+        "        loss.backward()\n",
+        "\n",
+        "        running_loss += outputs.loss.float().item()\n",
+        "        running_steps += 1\n",
+        "\n",
+        "        if step % gradient_accumulation_steps == 0 or step == len(TrainLoader):\n",
+        "            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)\n",
+        "            optimizer.step()\n",
+        "            scheduler.step()\n",
+        "            optimizer.zero_grad(set_to_none=True)\n",
+        "            global_step += 1\n",
+        "\n",
+        "            if global_step % logging_frequency == 0:\n",
+        "                current_lr = scheduler.get_last_lr()[0]\n",
+        "                mean_loss = running_loss / running_steps\n",
+        "                print(f\"step={global_step:05d} | loss={mean_loss:.4f} | lr={current_lr:.2e}\")\n",
+        "                running_loss = 0.0\n",
+        "                running_steps = 0\n",
+        "\n",
+        "    train_loss = running_loss / running_steps if running_steps > 0 else float(\"nan\")\n",
+        "    print(f\"Training loss after epoch {epoch + 1}: {train_loss:.4f}\")\n",
+        "\n",
+        "    model.eval()\n",
+        "    losses = []\n",
+        "    with torch.no_grad():\n",
+        "        for _, batch in enumerate(EvalLoader, start=1):\n",
+        "            batch = {key: value.to(device) for key, value in batch.items()}\n",
+        "            loss = model(**batch).loss\n",
+        "            losses.append(loss.float().item())\n",
+        "    model.train()\n",
+        "    val_loss = sum(losses) / len(losses) if losses else float(\"nan\")\n",
+        "\n",
+        "    print(f\"Validation loss after epoch {epoch + 1}: {val_loss:.4f}\")\n",
+        "\n",
+        "print(\"Training complete.\")\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
+    },
+    "colab": {
+      "provenance": []
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}