May 23, 2025
Second order optimizers are having a bit of a moment right now, with shampoo and muon rising in popularity due to people basically just trying them out at scale.
https://x.com/ashVaswani/status/1919809323737174259
This is great! Second order methods hold a lot of promise, they have theoretical footholds to statistical theory via things like e.g. Fisher information metric, and are a general way to “explode” the information that an optimizer has about the landscape to really make things more efficient…they also have a place near and dear to my heart due to being the workhorses I relied on back in the day (LBGFS, thank you for your service my friend).
Now that they are going mainstream, it seems reasonable to ask: what’s next? You could, of course, do the obvious next thing and ask whether even higher order optimization methods would help…and you’ll find there’s not much literature about them. Things just become too complicated at that point, complexity and computational requirements explode, and there aren’t even satisfying theoretical connections to other frameworks to justify the whole endeavor.
Let us first think: what is optimal optimization? There are many ways to answer this question, but one of them could be that (1) the optimization method leverages the maximal amount of information at any given time to (2) obtain the most efficient path to get to generalization. In that sense, information obtained of the local neighborhood is expanded by higher order methods by not only taking directional guidance from the loss function but also the induced curvature it produces, which distorts the paths that can go through it.
The spirit of higher order methods, then, is to capture as much structure as possible of these path distortions. Well, what if we follow this to the extreme and try to optimize for the entire path itself? Analogous to the principle of least action, this would posit the optimization procedure as an exercise in mechanics. Such a framework is of course, nothing new, as neural optimization has been variously studied through the lens of Langevin dynamics. Additionally, the very idea of momentum stems from keeping some record of the past path of the trajectory to find a way forward and avoid getting stuck in unpromising regions.
But we can go much further than that. Usually, we consider optimizers of eliciting updates of the form:
$$ \theta_{t+1} = \theta_t - \eta_t G_t(L(\theta_t)) $$
Where $G_t$ is some function that depends on the gradient of the loss function $L$ and/or higher order moments. Additionally, this update is usually stochastic following minibatch sampling and other things like momentum history (which for simplicity I will arbitrarily jam into G). Thus, with some mathematical treatment, such updates can be approximated by trajectories that follow a stochastic differential equation, typically of the form of a drift-diffusion equation:
$$ d\theta_t = \mu(\theta_t, G_t, t) dt + \Sigma(\theta_t, G_t, \eta_t, t)dW_t $$
Where $\mu, \Sigma$ are drift and diffusion functions (manifested as e.g. moments some and $W$ is a Wiener process. At a first glance one would accept such approximations with a grain of salt given that SGD processes are not necessarily Wiener (Gaussian)-like but they generally work well enough as proxies to study stochastic trajectory of optimization methods: