The model has 310M parameters

#23
by lgcharpe - opened

When calculating the number of parameters this model have, I arrive at about 310M parameters, this seems to also be the value reported by HuggingFace. Here is the calculation:
Attention + MLP parameters: (768 x 768 x 4 + 3 x 768 x 3072) x 12 = (2 359 296 + 7 077 888) x 12 = (9 437 184) x 12 = 113 246 208
Embedding: 128 256 x 768 = 98 500 608
LM Head (plus final LayerNorm): 128 256 x 768 + 768 = 98 500 608 + 768 = 98 501 376
Total: 113 246 208 + 98 500 608 + 98 501 376 = 211 746 816 + 98 501 376 = 310 248 192

Is it possible that the the weights for the Embedding and LM Head should be tied? It does not seem to be the case here.

lgcharpe changed discussion title from The model has 310M tokens to The model has 310M parameters
lgcharpe changed discussion status to closed
EuroBERT org

For encoders, we remove the LM head for embeddings or classification, so the parameters of the head don't count. They are just provided here for convenience if people want to continue the pretraining.

In fact, we could have called it 110M params because we often don't count non-embedding parameters either because they don't enter the FLOPs computation for scaling laws since it's just one lookup.

In practice, the "non embedding parameters" part is only ~110M params if you remove the LM head and the embedding parameters.

That is what I figured a bit after posting the comment, which is why I closed it before your response.

Thank you for answering!

Sign up or log in to comment