Kaifeng Lyu FPO
Location:
The members of
Examiners:
Readers:
A copy of
Everyone is invited to attend
Abstract follows below:
Deep learning has achieved remarkable success in recent years, yet training neural networks often involves a delicate combination of guesswork and hyperparameter tuning. A critical aspect of this process is the “implicit bias” of optimization methods, where minor changes in the optimization setup—without affecting the small training loss at convergence—can drastically shift the solution to which the model converges, thereby affecting test performance. This dissertation presents a collection of results that mathematically characterize this implicit bias in various training regimes.
The first part of this dissertation explores how gradient descent, even without explicit regularization, can converge to solutions that maximize the margin. Previous results have established the first-order optimality of margin for homogeneous neural networks in general, but the global optimality of margin is not guaranteed due to their non-convex nature. This dissertation provides in-depth theoretical analyses when data has simple structures: for linearly separable data, we present both positive and negative results on whether the global optimality of margin can be attained. Furthermore, we show how this margin-based view can be used to explain interesting generalization phenomena in training neural networks with or without explicit regularization, including the simplicity bias and grokking phenomena.
The second part of the dissertation presents two results that capture the implicit biases induced by finite learning rate. Many existing analyses, including the margin-based ones in the first part, describe implicit biases that hold even when the learning rate is infinitesimal. However, practical implementations use finite learning rates, which have been empirically observed to benefit generalization. We analyze how full-batch GD with finite learning rates, combined with key training components like normalization layers and weight decay, create a bias towards flatter minima, which are positively correlated with better generalization. Additionally, we study the implicit bias in stochastic optimization and derive rigorous approximations for the dynamics of adaptive gradient methods like Adam and RMSprop via Stochastic Differential Equations (SDEs) to capture the effect of finite learning rates. Based on this, we also derive the square root scaling rule as a practical guideline for adjusting the optimization hyperparameters of adaptive gradient methods when changing batch size.