The model has 310M parameters
When calculating the number of parameters this model have, I arrive at about 310M parameters, this seems to also be the value reported by HuggingFace. Here is the calculation:
Attention + MLP parameters: (768 x 768 x 4 + 3 x 768 x 3072) x 12 = (2 359 296 + 7 077 888) x 12 = (9 437 184) x 12 = 113 246 208
Embedding: 128 256 x 768 = 98 500 608
LM Head (plus final LayerNorm): 128 256 x 768 + 768 = 98 500 608 + 768 = 98 501 376
Total: 113 246 208 + 98 500 608 + 98 501 376 = 211 746 816 + 98 501 376 = 310 248 192
Is it possible that the the weights for the Embedding and LM Head should be tied? It does not seem to be the case here.
For encoders, we remove the LM head for embeddings or classification, so the parameters of the head don't count. They are just provided here for convenience if people want to continue the pretraining.
In fact, we could have called it 110M params because we often don't count non-embedding parameters either because they don't enter the FLOPs computation for scaling laws since it's just one lookup.
In practice, the "non embedding parameters" part is only ~110M params if you remove the LM head and the embedding parameters.
That is what I figured a bit after posting the comment, which is why I closed it before your response.
Thank you for answering!