Everything's different and nothing has changed
After a hiatus, back to language and reasoning. New axes of inquiry and notes on research trends for 2025.
After two years of working on diffusion, I am returning to the exploration of research problems in reasoning and language modelling, with the aim of working on something new in the space. The field has evolved substantially (understatement) during this period — big labs have closed up on experimental details while, conversely, open-source has exploded; pre-training is allegedly out, post-training is unequivocally in; RL is rearing its head once again, and we’ve figured out all sorts of nice optimizations for memory and inference speed efficiency that have massively unlocked real-world utility for many types of model.
Most importantly, the scope of possibility and consequently the scope of broader ambition in the field has blown wide open. For a somewhat uncomfortably long period of time, most active threads of research had seemingly converged to pre-training and scaling, which a select handful of labs were well-positioned to take advantage of. Heading into 2025, there is a renewed sense of optimism emerging across both industry and academia that there are new footholds to establish and new axes of inquiry along which to orient ourselves.
I personally feel extremely excited about the next few cycles of research and technological progress that lie ahead. There are a few (imo) largely under-explored areas of research plus a few areas that are now largely discussed but are lacking in clear consensus. In particular, I want to highlight a short and evolving list of things that I am currently very interested in exploring, ranked by priority:
Open-ended reasoning — how do we make the inevitable jump from building reasoning systems capable of solving difficult but easily verifiable problems (for example, coding and mathematics) to ones capable of generating solutions to problems that are both difficult to solve and difficult to verify (for example, the natural sciences)?
Meta-generation algorithms, test-time compute, and search — given that we can generate arbitrarily large volumes of text (“information”), how do we effectively and efficiently search over that space for optimal solutions? This is interesting to me at all levels of abstraction, from tokens, to sequences, to meta-structures such as strategies/plans. Bonus question of interest is whether this kind of search is actually better done in latent space, rather than token space, or vice versa.
The subversion of scaling — from a high level perspective, are there still meaningful gains to be gleaned from pushing the envelope on scale, relative to capital expenditure? From a technical perspective, there have been a number recent and recent-ish papers that have postulated a variety of architectural innovations that may or may not be useful to this end — SSMs, BLT, etc. On my reading list anyways, regardless of utility wrt this topic.
Each item on this list warrants a fully-fledged post of its own, which I’ll be working towards over the next few months. In the interim, the writing here is expected to be predominantly technical — expect a few paper summaries, a lot of back-of-the-napkin math, and the occasional take on higher level trends.
Writing online is in part a mechanism for thinking and documenting, but also a call out to others who might be also be interested in thinking about (even better, working on) similar threads. If this is you, you should email me :-)
See you all in 2025.