There is a conversation happening about AI efficiency that seems to be stuck at the wrong level of abstraction. Most of it centers on model size. Use a smaller model. Quantize. Distill. Prune. All reasonable. But the data from CarbonBench suggests something that might matter more, and it is almost entirely ignored: the electricity grid your inference runs on.
The Netherlands versus Singapore is a 4x carbon difference for the same model, same provider, same price. Not a different model. Not a different architecture. The same weights, the same tokenizer, the same API endpoint format — just a different region parameter in the request.
The inversion
This inverts the usual optimization conversation in a way that still feels counterintuitive even after staring at the data for weeks.
Consider two developers. Developer A runs Llama 3.1 8B on AWS in Singapore. Developer B runs Llama 3.1 70B on GCP in the Netherlands. Developer A is using the smaller model — the “responsible” choice by conventional wisdom. Developer B is running a model nearly 9x larger.
Developer B produces less carbon per million tokens.
The 70B model uses roughly 5x more energy per token than the 8B. But the Netherlands grid runs at approximately 129 gCO2/kWh while Singapore sits around 530 gCO2/kWh. The grid multiplier overwhelms the model multiplier. A bigger model on a clean grid beats a smaller model on a dirty grid. The arithmetic is not close.
This seems like it should be a bigger deal than it is. The entire discourse around efficient AI — and there is a lot of it, conferences and papers and corporate sustainability reports — focuses almost exclusively on what you run. Not where you run it.
Why arbitrage might be the right word
In financial markets, arbitrage means exploiting a price difference for the same asset across two markets. The same bond trading at different prices on two exchanges. You buy low, sell high, the spread is your profit, and the opportunity exists because information hasn’t propagated yet.
Grid carbon arbitrage works similarly. The same inference — identical model, identical output quality, identical latency class — is available at dramatically different carbon costs depending on which region you route to. The price is the same. The carbon is not. The spread exists because the information is invisible. Nobody sees it, so nobody acts on it.
CarbonBench is trying to make it visible. The leaderboard ranks every model-provider-region combination by carbon intensity alongside cost and speed. When you can see the spread, routing to the cleaner option becomes a free optimization. Same bill from your cloud provider. Different line item on the planet’s ledger, if you want to think about it that way.
The numbers that keep surprising
Mistral 7B on AWS Ireland produces about 5 gCO2 per million tokens. Move it to Virginia and it doubles to roughly 10. Move it to Singapore and it triples to about 16. The model weights did not change. The tokenizer did not change. The inference speed barely changed. Everything that a developer normally optimizes for stayed constant. The thing that changed was which power plants were feeding electrons to the GPU.
Oregon is interesting because it runs heavily on hydroelectric. GCP us-west1 in Oregon produces some of the lowest carbon numbers in the dataset — competitive with Northern Europe despite being a US region. This seems underappreciated. Companies with US data residency requirements don’t have to accept Virginia-level carbon costs. Oregon exists. The Pacific Northwest exists. The grid is clean and the latency to major US population centers is acceptable.
Then there are the compound cases. Take a 70B model on a dirty grid versus an 8B model on a clean grid. The energy ratio between them is maybe 5x. But the grid ratio between Singapore and the Netherlands is 4x. So the dirty-grid 8B is operating at roughly 0.2x the energy but 4x the carbon intensity, landing at about 0.8x total carbon — barely better than the clean-grid 70B. In some configurations it is actually worse. The model optimization that everyone talks about gets almost entirely eaten by the grid penalty that nobody talks about.
Why this is free
This is the part that seems most significant and least discussed. Carbon-aware routing costs nothing. Cloud providers charge the same price for the same model regardless of region, in most cases. AWS Bedrock, GCP Vertex AI, Azure OpenAI — the per-token pricing is typically region-independent. You are not paying a premium to route to a cleaner grid. You are making a different API call to a different endpoint URL.
Compare this to every other carbon reduction strategy in AI. Smaller models sacrifice capability. Quantization sacrifices some accuracy. Fewer inference calls means less functionality. Purchasing carbon offsets costs money. Renewable energy certificates cost money. Building on-site solar costs a lot of money.
Region routing sacrifices nothing and costs nothing. It might be the only free lunch in AI carbon reduction. And almost nobody is doing it, because the data that would motivate it hasn’t been accessible.
The API call as a routing decision
Most production AI systems already have multi-region capability for redundancy and latency reasons. The infrastructure for region-aware routing exists. Adding carbon intensity as a routing signal seems like it should be straightforward — query the CarbonBench API for the cleanest region that meets your latency requirements, route accordingly.
The /api/recommend endpoint does this. Send it a model name and it returns the lowest-carbon region available right now, with alternatives ranked by carbon intensity. The response includes the absolute carbon savings compared to the dirtiest available option. For some model-region combinations, the savings are 75% or more. For free.
Still not sure why this hasn’t become standard practice. Might be that the data wasn’t available until recently. Might be that sustainability teams and engineering teams don’t talk to each other about region selection. Might be that the cloud provider dashboards don’t surface carbon alongside cost and latency, so it’s invisible in the decision-making interface. Probably all three.
What this implies for the optimization conversation
If the grid matters more than the model, then the optimization conversation needs to change altitude. Not abandon model efficiency — that still matters. But maybe start with the question that has the largest effect size and the lowest cost: where is this running?
A team that switches from Virginia to Oregon might save more carbon than a team that spends six months distilling a 70B model down to 7B. That’s a strange claim and still somewhat uncertain. But the math keeps checking out across the configurations tested so far.
The arbitrage is there. The spread is wide. The cost of exploiting it is zero. It just requires seeing the data.