NeurIPS

Posted 2025-12-09 # General

NeurIPS – the world’s largest AI and machine learning conference – was last week in San Diego. This year’s conference felt particularly significant: over 26,000 attendees packed the convention center and the crowd was as much (if not more) investors and industry as it was researchers, reflecting the intense focus on AI these days.

At a meta-level, the field feels like it is at an inflection point: many of the scaling assumptions of the past few years feel exhausted or at least delayed by what hardware can actually deliver. I lost count of how many conversations circled back to this same sentiment. The natural response, it seemed, was a pragmatic pivot: instead of reaching for more compute or more data, everyone was asking how to squeeze more from what we already have.

If compute-scaling hits a wall, how do we use hardware more efficiently? If we’ve scraped the internet dry, how do we get more learning turns from the same data? If algorithmic advances are slowing, what new architectures will actually move the needle? These questions framed the conference and pointed toward a future where resourcefulness matters more than raw scale.

Of the research themes that kept surfacing, three stood out:

Rethinking Model and System Evaluation

There’s an evaluation crisis brewing. Most benchmarks shaping the public narrative around LLMs were created years ago for models a fraction of the size we have today. They suffer from two fundamental problems.

First, scale mismatch: models trained on trillions of tokens are evaluated on test sets of just thousands of examples. The results are statistically fragile at best. Second, contamination is everywhere. Public benchmarks have leaked into training corpora, inflating scores that don’t reflect genuine generalization. Experiments showed substantial inflation for open-weight models, and speakers warned that contamination propagates subtly. Even synthetic data from contaminated models can poison downstream systems.

Then there’s the science of measurement itself. One tutorial argued that LLM evaluation lacks empirical rigor. Metrics get reported without confidence intervals, meaning apparent score differences might be statistical noise. The field is shifting toward grounded, task-specific evaluations that measure performance on economically meaningful work. Scale AI’s Remote Labor Index was a good example where they test agents on actual freelance marketplace tasks. It was clear that model value isn’t determined by marginal leaderboard improvements, but by performance on tasks that matter to end users and businesses.

RL and the Rise of Continual Agentic Systems

Continual learning emerged as a clear priority. Agents should learn from experience over time (and meta-learn how to generalize) rather than through one-off training runs. In his invited talk, Rich Sutton argued that robust intelligence requires agents with world models, planning abilities, and continuous learning capacity. His proposed OaK Architecture was one approach: a model-based RL system where every component including the learning rates themselves adapts during online training.

Other sessions highlighted efforts to connect LLMs and vision models to embodied intelligence. Tesla’s robotics panel detailed their work training large-scale multimodal models for end-to-end “pixels-to-actuation” robot control. Autonomous vehicle discussions emphasized open challenges in scaling to full self-driving: architecture choices, balancing imitation learning versus RL, and the infrastructure required for large-scale simulation.

What’s missing and creating bottlenecks? Infrastructure. Beyond training environments, the field needs better simulators, reproducible safety benchmarks, and standardized environment-model integration.

Scaling Architecture and Efficiency

Targeted architectural changes can yield significant gains. One award-winning paper showed that adding a head-specific gating mechanism to standard attention layers notably improves performance. This simple sigmoid “gate” introduces sparsity and non-linearity, enabling more stable training and better long-context handling. The technique has already been adopted in models like Qwen-3 and helps mitigate the “attention sink” effect (where models overfocus on a small subset of tokens). The broader trend is active experimentation with modified attention, memory mechanisms, and alternatives to the standard Transformer architecture. Google debuted their Titans architecture and MIRAS framework as one such alternative.

On the efficiency front, Microsoft researchers introduced orthonormal updates (optimizers like Muon and Dion) as potential successors to AdamW for large-scale training. These improve convergence in deep or distributed settings. A dedicated session on replacing Adam signals how seriously the field takes any speed or stability edge. On the inference side, a tutorial on test-time compute covered caching, retrieval, and on-demand computation techniques. Open-source projects like vLLM and sglang were also top of mind for many.

Where This Leaves Us

The convergent theme is maturation. We’re shifting from scaling what works to understanding why it works and then making it work better. The evaluation crisis forces us to confront what we actually want from these systems as opposed to what we can measure. The push toward continual agents recognizes that intelligence is fundamentally about adaptation, not pattern matching. And the architectural innovations suggest that even our most fundamental building blocks are still open to reinvention.

NeurIPS 2025 felt like a necessary recalibration. The community is homing in on a nuanced set of principles: efficiency, robustness, and genuine capability over leaderboard points.

The path forward looks less like a straight line upward and more like a careful, deliberate hill climb.