The math behind 'bigger is better'
Research PaperEstablished precise mathematical relationships (power laws) predicting how language model performance improves with model size, dataset size, and compute, providing the theoretical foundation for the race to scale.
Cross-entropy loss follows smooth power laws as a function of model parameters (N), dataset size (D), and compute budget (C).
If you have $X to spend on compute, the scaling laws tell you exactly how big to make your model and how much data to use. This turned AI development from an art into (partially) an engineering discipline.