Scaling in the Wrong Direction

Yann LeCun has been making this argument for years and the industry keeps not listening. Next-token prediction is an architectural dead end — not because it doesn’t work, but because it works at a cost that scales in the wrong direction. Every capability gain requires proportionally more compute. More data. More energy. More carbon. The ceiling isn’t intelligence. The ceiling might be the electricity grid.

The hundred-billion-dollar bet

Something like $100 billion in data center construction is currently committed to scaling transformer-based next-token prediction. Microsoft, Google, Amazon, Oracle — they are building power infrastructure that rivals small countries. The assumption underneath all of it is that the current architecture, pushed further, will produce the next capability threshold.

LeCun’s counterargument centers on JEPA — Joint Embedding Predictive Architectures. Instead of predicting the next token in a sequence, predict the next representation in a learned embedding space. The intuition is that biological intelligence doesn’t process raw sensory input token by token. It builds world models. It predicts at a higher level of abstraction. And prediction in embedding space might be fundamentally more efficient than prediction in token space because the model doesn’t waste capacity on irrelevant detail.

Not sure who’s right. Nobody is, yet. But the carbon data from CarbonBench adds a dimension to this debate that hasn’t been discussed much: if a more efficient architecture ran on a clean grid, the savings would compound.

Compounding the wrong way

Consider the current scaling trajectory. GPT-4 training reportedly consumed around 50 GWh. GPT-5 estimates range from 100 to 200 GWh. Each generation roughly doubles or triples the energy requirement. If the grid powering that training is carbon-intensive — and Virginia, where a lot of this capacity is being built, runs a grid at roughly 339 gCO2/kWh — the carbon scales at the same rate as the compute.

Now consider two variables changing simultaneously. First, a more efficient architecture — say, something JEPA-inspired that achieves comparable capability at 5-10x less compute. That alone would be significant. But second, routing that reduced compute to a clean grid — the Netherlands at 129 gCO2/kWh, or Nordic regions even lower. The energy reduction and the grid reduction multiply together.

Back-of-the-envelope: if a JEPA-style model uses 5x less energy and runs on a grid that’s 3x cleaner, the carbon cost is 15x lower. That’s not additive. It’s multiplicative. And neither factor is hypothetical in isolation — more efficient architectures exist in research, and clean grids exist right now.

The problem is that the current scaling approach compounds in the opposite direction. More energy times a dirty grid. Each factor amplifying the other.

Energy consumption as architectural failure

This framing might be too strong, but it keeps coming back. What if the energy intensity of current LLMs isn’t just an engineering problem to be solved with better hardware and bigger solar farms? What if it’s a signal that the architecture itself is doing unnecessary work?

Next-token prediction models the full surface of language — every syntactic variation, every stylistic quirk, every possible continuation. Most of that surface is irrelevant to any given task. A model that answers “what’s the capital of France” doesn’t need to have learned the probability distribution over all possible next tokens in all possible contexts. It needs a world model that contains the relationship between France and Paris.

LeCun’s argument, roughly, is that autoregressive models are doing something like trying to predict the exact next frame of a video pixel by pixel, when what you actually need is to predict that the ball will continue moving to the right. The pixel-level prediction wastes enormous capacity on irrelevant visual detail. The embedding-level prediction captures the physics with a fraction of the compute.

Whether JEPA specifically is the right alternative is genuinely unclear. The V-JEPA results on video understanding are interesting but not conclusive. What seems more defensible is the general claim: an architecture that predicts at the right level of abstraction should be more energy-efficient than one that predicts at too low a level. And energy efficiency, in the context of carbon, is the whole game.

The scaling debate is also a geography debate

Something the CarbonBench data makes visible: the scaling debate isn’t just about how much compute. It’s about where the compute runs. This seems like an obvious point but it’s absent from most discussions of AI scaling laws.

A Chinchilla-optimal training run on Virginia’s grid produces roughly 2.6x the carbon of the same run on GCP in the Netherlands. Not a different model. Not a different dataset. Not a different number of tokens. The same run, different power plants.

The $100 billion in data center investment is disproportionately going to regions with carbon-intensive grids. Virginia, Texas, parts of the Midwest. These locations were chosen for land cost, tax incentives, network connectivity, and existing power capacity. Carbon intensity of the grid was not, as far as anyone can tell, a primary factor in site selection for most of these facilities.

This means the scaling trajectory has a carbon multiplier baked into its geography. Even if the architecture stays the same and the hardware gets more efficient — which it will, Blackwell chips are already more efficient per FLOP than Hopper — the grid underneath can erase much of that gain.

Three things that could compound together

The most optimistic scenario involves three factors multiplying, not just one:

More efficient architectures — whether JEPA or something else — that achieve current-generation capability at a fraction of the compute. Maybe 5-10x reduction. This is speculative but supported by the general observation that autoregressive prediction in token space is not the only way, and probably not the most efficient way, to build world models.

Clean grid routing — directing training and inference to regions where the electricity is primarily renewable. This is not speculative at all. The data already shows a 4x spread between the cleanest and dirtiest grids available through major cloud providers. This optimization is free and available today.

Time-shifting workloads — scheduling batch training and inference for the hours when grid carbon intensity is lowest. CarbonBench tracks the 24-hour carbon curve for each region. The swing between peak and trough is typically 30-50%. This is free and available today.

Multiply them together: 5x from architecture, 4x from grid selection, 1.4x from time-shifting. That’s a 28x reduction in carbon for equivalent capability. Not confident in the precision of any of these numbers individually, but the direction and the multiplicative interaction seem real.

The question underneath

LeCun might be wrong about JEPA specifically. The next breakthrough might be a variant of transformers that nobody’s thought of yet. But the underlying question — whether we’re scaling an architecture that is fundamentally energy-inefficient, on grids that are fundamentally carbon-intensive, because we haven’t had the data to see either problem clearly — that question seems worth sitting with.

The $100 billion is already being spent. The data centers are already being built. The grids they’re connecting to are already known. CarbonBench can tell you the carbon intensity of every one of them. Whether the next hundred billion goes in the same direction might depend on whether anyone looks at the data before signing the contracts.

carbonbench.ai

Scaling in the Wrong Direction

About this article

Tags

More from Systems

Related

Scaling in the Wrong Direction

Lineage

Appears in Threads

About this article

Tags

More from Systems

Related