Somewhere in a sustainability report being drafted right now, a company is meticulously accounting for the emissions from its office lighting, its employee commutes, its supply chain logistics. Three floors down, an engineering team is making ten thousand inference calls a day to a hosted LLM, and nobody in the sustainability department knows what that costs in carbon. Because the cloud makes it invisible.

The abstraction problem

Cloud computing was designed to abstract away physical infrastructure. That’s the value proposition. You don’t need to know which rack your workload runs on, which power grid feeds the data center, which fuel mix generates the electricity. You pay for compute in abstract units — vCPUs, GPU-hours, tokens — and the physical reality disappears behind an API endpoint.

For most purposes this abstraction is a good thing. For carbon accounting, it’s a disaster.

Every API call to a hosted model is a scope 3 emission. Scope 3, in the GHG Protocol framework, covers indirect emissions from a company’s value chain — the emissions that occur upstream and downstream of your direct operations. When you send a prompt to GPT-4 or Claude or Llama through a provider API, the inference runs on a GPU in a data center connected to an electricity grid with a specific carbon intensity at that specific moment. That carbon is your scope 3, whether you measure it or not.

The problem is that “whether you measure it or not” has been the operative phrase. Almost nobody measures it. The data to measure it hasn’t been readily accessible. And scope 3 reporting, while increasingly required by regulation, has relied on rough estimates and industry averages rather than actual measurements.

The regulatory reality

The EU Corporate Sustainability Reporting Directive is live. It requires large companies and listed SMEs to report scope 3 emissions. California’s Climate Corporate Data Accountability Act requires scope 3 reporting for companies with revenue over $1 billion. The SEC climate disclosure rules, though still being litigated, include scope 3 for certain categories. The direction is clear even if the timelines are debated: scope 3 will be reported, audited, and eventually regulated.

AI inference is becoming a material category within scope 3 for companies that rely heavily on it. A financial services firm running thousands of inference calls for risk analysis, document processing, and customer service might be generating tonnes of CO2 through its AI operations without any line item in its emissions inventory. Not because the emissions are small — they might not be — but because there’s been no practical way to measure them.

The analogy to cloud computing costs keeps surfacing in the early days. Companies would adopt cloud services, engineering would spin up instances, and finance wouldn’t see the bill until it was already six figures. The solution was cost observability — dashboards, budgets, alerts, FinOps as a discipline. Carbon seems to be in the same pre-observability phase. The spend is happening. Nobody’s watching the meter.

What CarbonBench makes measurable

The core formula is not complicated. For any inference call, the carbon cost is a function of three things: how much energy the model uses per token, how many tokens were processed, and the carbon intensity of the grid at the time and place the inference ran.

CarbonBench tracks all three. GPU energy per token comes from standardized benchmarks — the AI Energy Score project measures actual power consumption for each model on reference hardware. Grid carbon intensity comes from Electricity Maps, updated daily for every region where major providers operate. Token counts come from the API response — most providers return usage data including input and output tokens.

Multiply them together and you get grams of CO2 per call. Aggregate across all calls in a reporting period and you get a scope 3 line item for AI inference. It might be the first time this has been practically measurable at the individual API call level.

Whether the precision is sufficient for regulatory reporting is a question nobody can fully answer yet. The GPU energy benchmarks are standardized but don’t account for provider-specific hardware optimizations. The grid carbon intensity is real-time but averaged across the regional grid, not specific to a data center’s power purchase agreements. These are approximations. But they’re dramatically better than the alternative, which is zero measurement.

The enterprise blind spot

Talked to a few sustainability consultants about this — informally, not systematically, so take this with appropriate uncertainty. The picture that emerged is that most enterprise carbon inventories treat cloud computing as a single line item using spend-based estimation. Take total cloud spend, multiply by an industry emissions factor per dollar, report the result. This method doesn’t distinguish between a storage bucket and a GPU cluster. It certainly doesn’t distinguish between inference on a clean grid and inference on a dirty one.

The spend-based approach was designed for an era when cloud usage was mostly storage and web hosting. The energy intensity per dollar was relatively uniform. AI inference changed that. A dollar of inference on a large model consumes dramatically more electricity than a dollar of S3 storage. But the spend-based method treats them identically.

Activity-based measurement — counting actual GPU-hours or tokens and converting to energy and carbon — is more accurate but has been impractical because the conversion factors weren’t available. CarbonBench provides them. The API returns carbon per million tokens for any model-provider-region combination, updated daily. An enterprise could, in principle, instrument its AI pipeline to log the carbon cost of every inference call and aggregate for reporting.

Whether any enterprise will actually do this is uncertain. The regulatory pressure is building. The data is now available. The gap between the two is implementation — engineering work to add carbon logging to the inference pipeline, and organizational work to connect that data to the sustainability reporting workflow.

What the cloud hides

There’s something deeper here that goes beyond carbon accounting. The cloud abstraction hides the physical reality of computation. Every API call is an abstraction over a GPU drawing hundreds of watts from a power grid fed by specific power plants burning specific fuels or harnessing specific renewable sources. The electrons are real. The CO2 is real. The API endpoint is a fiction that makes the physical consequences invisible.

This invisibility is not neutral. It biases decisions toward ignoring externalities. When the carbon cost of an inference call is zero in your dashboard because it’s not measured, the rational economic decision is to ignore it. When it becomes visible — a number attached to each call, accumulated over time, reported to stakeholders — the decision calculus changes. Not because the cost was zero before and is now positive. The cost was always there. It just wasn’t on anyone’s screen.

CarbonBench is one attempt to put it on the screen. The leaderboard makes it visible for benchmarking and selection. The API makes it measurable for operations. The 24-hour carbon curves make it predictable for scheduling. Whether visibility alone is sufficient to change behavior is an empirical question. Probably not — regulation will be necessary for most organizations. But visibility is a prerequisite. You can’t manage what you can’t measure, and until recently, nobody could measure this.

The procurement angle

One thing that might move faster than regulation: enterprise procurement. Large organizations increasingly include sustainability criteria in vendor evaluations. An AI provider that can report the carbon intensity of its inference — per call, per region, over time — has a tangible advantage over one that can’t.

This might explain why some providers have been more forthcoming about their energy data than others. Not altruism. Procurement checklists. When the RFP asks “can you report the carbon intensity of inference by region” and only one vendor can answer yes, that’s a competitive edge measured in contract value, not just carbon.

CarbonBench data could serve both sides of this transaction. Providers can point to it as independent third-party measurement. Procurement teams can use it to compare providers on a dimension that isn’t on the pricing page. The data is there. The question is whether it enters the procurement conversation before or after the regulations force it.

carbonbench.ai