Tag

#adam optimizer

1 article

Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It

This article explains how Stochastic Gradient Descent (SGD) creates a frequency bias in language models, where common words are learned better than rare ones. It shows how Adam optimizer improves this by giving more attention to rare tokens.

May 1839