AGI and the RL lottery

There’s a notion in ML called the hardware lottery: only the methods that are well suited with current compute scaling paradigms are the ones that win in the end. It’s faster to do experiments with those methods, and it’s easier to make them scale: improvements can compound in short timescales. That’s why deep learning won: it won the GPU hardware lottery. That’s why transformers won too, they were the best adaptation for autoregressing sequences in a GPU.

But there is also a learning lottery. There are learning methods that perform better than others and the architecture that works best with them (have good defaults, are easy to tune, fast to iterate on) are the ones that end up dominating. In the past the learning lottery were methods like LBFGS, Nelder-Mead and gang. For Bayesian methods, the current learning lottery are NUTS samplers, and so on for different type of problems.

Now what if the type of problem you are interested in is achieving AGI? AGI is all about learning to deal with novelty and so far we only know a few learning paradigms that let you do that:

Evolutionary algorithms
Program synthesis
Reinforcement learning

Of the three, RL is the only one with a solid track record of finding golden nugget policies that generalize well within a specific environment and give surprising and superhuman results (there is another world where DeepMind focuses on ES instead of RL but that is a story for another time).

My hypothesis here is that AGI necessarily needs a method that fulfills the RL lottery. Indeed, the models that most resemble AGI today are LLMs that have been post-trained with some limited form of RL. LLMs by themselves, the base models, are a far cry from AGI as supervised learning will never yield a generalist policy.

And yet LLMs are a far from fulfilling the RL lottery, they’re pretty terrible actually. They’re big and bulky and if you use copies of them to model the trifecta that is dynamics, reward, and policy that the state of the art RL methods need you will quickly run out of memory and time to experiment. So my guess is that there’s something else that we’re missing, architecturally.

An architecture that wins the RL lottery, is in my view the one that will win the AGI race.