Running Llama 3.1 70B through a provider API feels like a commodity operation. Pick a model, pick a provider, send a request, get tokens back. The pricing pages tell you what it costs in dollars. What they don’t tell you is what it costs in carbon — and how wildly that number changes depending on where and when you make the call.

Built CarbonBench to try to make this visible. Still early, but the initial findings seem worth sharing.

What the data suggests

The same model, on the same provider, at the same price, can produce anywhere from 4 to 530 grams of CO2 equivalent per million tokens. The variable isn’t the model. It’s the electricity grid the data center sits on.

A Llama 3.1 8B call through GCP in the Netherlands right now produces about 4 gCO2 per million tokens. The same call through AWS in Singapore produces around 17. Move it to Virginia during peak hours and it climbs past 50. The model is identical. The provider is comparable. The carbon is an order of magnitude different.

These are measurements, not estimates — updated daily, combining three data sources: real GPU energy benchmarks from the AI Energy Score project, live grid carbon intensity from Electricity Maps, and published provider pricing from AWS Bedrock, GCP Vertex AI, Azure OpenAI, Together, Groq, and Fireworks.

A conversation that seems stuck at the wrong altitude

The AI energy debate is dominated by data center consumption totals — how many terawatt-hours did OpenAI use this year, how many nuclear plants does Microsoft need. These are important numbers but they’re not actionable for someone making an API call.

What might be actionable: the carbon intensity of the electricity grid your inference runs on varies by 2-10x depending on region and time of day. Oregon’s grid runs heavily on hydroelectric power. Virginia’s runs on a mix that includes significant natural gas and coal. The Netherlands has substantial wind capacity. Singapore relies primarily on natural gas.

If that’s right, then a developer who cares about carbon — or a company with scope 3 emissions reporting requirements — has a lever they can actually pull. Not “stop using AI” but “route your requests to the region where the grid is cleanest right now.”

How the calculation works

The formula is straightforward:

Carbon per million tokens = GPU energy per token × grid carbon intensity

GPU energy per token is a property of the model and the hardware. A Llama 3 8B on an A100 uses about 0.035 Wh per 1000 tokens. A Llama 3.1 70B on an H100 uses about 0.17 Wh. Claude 3 Opus uses around 0.80 Wh. These numbers come from standardized benchmarks — the AI Energy Score project runs each model through a fixed text generation workload on reference hardware and measures total GPU energy consumption.

Grid carbon intensity is a property of where the data center is and what time it is. Measured in grams of CO2 equivalent per kilowatt-hour. This number changes continuously as the generation mix shifts — more wind at night, more solar during the day, coal and gas filling in when renewables dip.

Multiply them together and you get the carbon cost of running a million tokens through a specific model in a specific region at a specific time.

Some patterns worth noting

After ingesting 85 models across 10 families — Llama, GPT, Claude, Mistral, Gemma, Qwen, DeepSeek, Phi, Falcon, and Cohere — a few things stand out.

Small models in clean regions are extraordinarily efficient. Gemma 2B on GCP Netherlands produces around 2 gCO2 per million tokens. For context, a Google search produces roughly 0.2g of CO2. So a million tokens of Gemma 2B inference in the Netherlands has a carbon footprint equivalent to about 10 Google searches.

The same model on different grids tells a dramatic story. Mistral 7B on AWS in Ireland (162 gCO2/kWh grid) produces about 5 gCO2 per million tokens. The same model on AWS in Virginia (339 gCO2/kWh grid) produces about 10. In Singapore (530 gCO2/kWh), it’s 16. The model didn’t change. The electricity did.

Large models amplify the grid difference. A 70B parameter model uses roughly 5x the energy per token of an 8B model. When that 5x energy multiplier hits a dirty grid, the carbon compounds.

Time of day matters more than expected. Grid carbon intensity can swing 30-50% over a 24-hour cycle. Batch jobs that can tolerate latency could probably be scheduled for the cleanest hours. CarbonBench tracks this — the carbon intensity charts show the 24-hour curve for each region, with the lowest point marked.

A disconnect I keep thinking about

Provider pricing has almost no correlation with carbon cost. Groq charges $0.05 per million input tokens for Llama 3 8B. Fireworks charges $0.10. AWS charges $0.30. But they’re all running on Oregon-area data centers with similar grid carbon intensities.

Meanwhile, the biggest carbon difference comes from whether a model runs in the Netherlands (129 gCO2/kWh) versus Singapore (530 gCO2/kWh) — and the price is typically the same regardless of region.

This might mean carbon-aware routing is mostly free. You’re not paying more to run inference on a cleaner grid. You’re just choosing a different endpoint.

What the tool does

The leaderboard at carbonbench.ai ranks every model-provider-region combination by carbon intensity, cost, and speed. Filter by model family, filter by provider, sort by what matters to you. Click any row and a live carbon intensity chart shows you the 24-hour curve for that region.

The /api/recommend endpoint answers the question directly: “What’s the lowest-carbon way to run Llama 70B right now?” It returns the best option, four alternatives, and a human-readable insight explaining the recommendation and the carbon savings.

The data pipeline pulls fresh carbon intensity from Electricity Maps daily, combines it with GPU energy benchmarks and provider pricing, and recalculates all scores.

Why this might matter beyond carbon

Even if you don’t care about carbon specifically, this data reveals something about AI infrastructure that isn’t visible from the pricing page: the physical reality of where computation happens.

Every API call is an abstraction over a GPU in a rack in a building connected to an electricity grid that has a specific carbon mix at that specific moment. Cloud providers abstract this away — that’s the point of cloud. But the abstraction hides a variable that might matter for regulatory compliance, ESG reporting, enterprise procurement, and customer trust.

As AI becomes a larger fraction of global electricity consumption, the question of where you run inference might become as important as which model you choose. Trying to make the data visible so that choice can be intentional rather than by default.

Technical details

For the curious:

85 models benchmarked, including 39 with real GPU energy measurements from the AI Energy Score project

6 providers tracked: AWS Bedrock, GCP Vertex AI, Azure OpenAI, Together AI, Groq, Fireworks AI

9 regions across US, EU, and APAC

Carbon data from Electricity Maps, updated daily

Open API — no authentication required for public endpoints

Stack: Next.js 14, Railway Postgres, Vercel, TypeScript

API docs at carbonbench.ai/docs.

What might come next

The immediate roadmap includes live pricing scrapers, historical carbon data for trends, and a paid API tier for teams that want programmatic carbon-aware routing.

The longer-term direction is more interesting: decentralized benchmark data collection through a Bittensor subnet, where a distributed network of miners run standardized inference benchmarks and report back verified performance data. Instead of trusting a single source for energy measurements, you’d have a marketplace of independently verified data points. Not sure if that’s practical yet, but the idea keeps nagging.

The core finding is already live: the carbon cost of AI inference is not fixed. It’s a function of where and when you make the call.

carbonbench.ai