The carbon cost of LLMs varies by more than two orders of magnitude depending on model size, inference efficiency, and deployment region — and most enterprises are dramatically overspending on both cost and carbon by using large frontier models for tasks that smaller models handle equally well. This benchmarking guide quantifies the energy and carbon footprint of GPT-4 vs Claude claude-opus-4-6 vs open-source alternatives, and provides the decision framework for sustainable model selection.
LLM Carbon Footprint Benchmarks
| Model | Est. gCO₂/1K tokens (us-east-1) | Est. gCO₂/1K tokens (eu-north-1) | Relative Carbon Index |
| GPT-4 (via API) | ~1.2 gCO₂ | Not applicable — no region selection via API | 100× (baseline high) |
| GPT-4o (via API) | ~0.4 gCO₂ | Not applicable | 33× |
| Claude claude-opus-4-6 (via API) | ~1.0 gCO₂ | Not applicable | 83× |
| Claude claude-sonnet-4-6 (via API) | ~0.25 gCO₂ | Not applicable | 21× |
| Llama 4 8B (self-hosted, A100, us-east-1) | ~0.012 gCO₂ | ~0.0005 gCO₂ | 1× (lowest) |
| Llama 4 70B (self-hosted, A100, eu-north-1) | — | ~0.003 gCO₂ | 3× |
| DeepSeek V3 (self-hosted, eu-north-1) | — | ~0.008 gCO₂ | 7× |
Why Proprietary APIs Have No Region Option for Carbon
When you call OpenAI, Anthropic, or Google's APIs, you don't choose which data centre processes your request — the provider's routing determines this. You cannot guarantee your inference runs in a low-carbon region. This is the key carbon disadvantage of proprietary APIs vs self-hosted models: self-hosted deployments can be placed in eu-north-1 (Nordic hydro, ~18 gCO₂/kWh) for an 11× carbon reduction vs us-east-1 (~400 gCO₂/kWh), with zero performance change.
Model Selection for Carbon Reduction
100×
Carbon difference between the highest-carbon option (GPT-4 API) and lowest-carbon option (Llama 4 8B self-hosted in eu-north-1) — for the same classification or extraction task
11×
Carbon reduction from running the same self-hosted model in eu-north-1 vs us-east-1 — available for free, requires only a deployment region change
60%
Of enterprise GPT-4 API calls can typically be served by smaller models (8B–13B fine-tuned) without quality loss — replacing them yields 30–100× carbon reduction for those workloads
📊
High-Volume Classification
If you're running millions of classification or extraction calls per day, replace GPT-4 with a fine-tuned Llama 4 8B deployed in eu-north-1. Carbon impact: 1,000× reduction. Quality impact: typically <2% on narrow domain tasks. Cost impact: 99%+ reduction. This is the single highest-impact AI carbon optimisation most enterprises can make — and it improves financials simultaneously.
💡
Reasoning Tasks
Complex reasoning, nuanced analysis, and creative tasks genuinely benefit from large frontier models. For these, the carbon difference matters but so does quality. Use Claude claude-sonnet-4-6 or GPT-4o (not the full claude-opus-4-6/GPT-4 unless necessary) for a 3–4× carbon improvement at minimal quality cost. Reserve claude-opus-4-6 only for tasks where the quality difference is demonstrably worth the carbon premium.
🔧
RAG and Search
Embedding models and retrieval-augmented generation benefit from small, fast models — BGE-M3 or E5-large for embeddings, Llama 4 8B for generation. The retrieval pipeline reduces how much the generation model needs to "know" — enabling smaller models without quality loss. Self-host the entire RAG stack in eu-north-1 for maximum carbon efficiency. Our
ML team designs carbon-optimised RAG architectures.
🎯
Measure First
Before optimising, measure. Use Ecologits (open source) to estimate carbon from API call logs — it estimates gCO₂ per call from model, token count, and provider. Use CodeCarbon for self-hosted inference measurement. Build a monthly AI carbon report for your
engineering dashboards. You can't reduce what you don't measure — and the data typically reveals 2–3 high-impact optimisations immediately.