scaling laws
10 billion tokens on a sub 1m param model? llms aren't that overparameterised lmao, plus you have a 5090, you can easily go up to 100M params for pretrain so why don't you?
I can't speak for them, but it is probably more about efficiency and rapid prototyping.
10 billion tokens on a sub 1m param model? llms aren't that overparameterised lmao, plus you have a 5090, you can easily go up to 100M params for pretrain so why don't you?
I like experimenting with small models, and this is a quick way to do that.
The largest model is 50M :P
The goal is to see how good a super small model can get
i completely support the premise of this but i would be very surprised if it could even form sentences
This model kinda already does
10 billion tokens on a sub 1m param model? llms aren't that overparameterised lmao, plus you have a 5090, you can easily go up to 100M params for pretrain so why don't you?
that's what im trying to tell him, but, he really don't want
