AI energy consumption: how to measure and reduce LLM costs

Q: What does SCALE D2C offer for GreenTech and Sustainable IT?

End-to-end GreenTech and Sustainable IT strategy, implementation, and optimisation for enterprise and D2C brands. Contact us for a free consultation.

Q: How long does a GreenTech and Sustainable IT engagement take?

Strategy projects: 4–8 weeks. Full implementation: 3–12 months. ROI typically within 12–18 months.

Q: Does SCALE D2C work with all business sizes?

Yes — D2C brands to enterprise. View our pricing .

AI workloads — particularly large language model inference and training — are now the fastest-growing contributor to enterprise carbon footprints and cloud bills simultaneously. A single GPT-4-scale training run consumes approximately 1,287 MWh of electricity — equivalent to the annual energy consumption of 120 US homes. Enterprise LLM inference at scale compounds this: a million API calls per day to GPT-4 costs $15,000 and generates measurable carbon. This guide covers how to measure AI energy consumption, benchmark model efficiency, and reduce both cost and carbon through architectural choices.

The Scale of AI Energy Consumption

AI Energy Consumption — Enterprise Context

AI's energy footprint has three components: (1) Training — the one-time cost of training a foundation model, measured in MWh to GWh; (2) Inference — the ongoing cost of serving the model for requests, which scales with usage volume; (3) Fine-tuning — intermediate cost of adapting a pre-trained model to specific tasks. For enterprise consumers of AI (not model trainers), inference dominates the energy and cost profile — and it is the component most amenable to optimisation through model selection and deployment architecture.

Energy and Carbon Benchmarks by Model

Model	Energy per 1K tokens (Wh)	CO₂ per 1K tokens (gCO₂)	Relative Cost Index
GPT-4 / Claude claude-opus-4-6	~0.001–0.003 Wh	~0.4–1.2 gCO₂	100× (baseline high)
GPT-4o / Claude claude-sonnet-4-6	~0.0003–0.001 Wh	~0.12–0.4 gCO₂	30×
Llama 4 8B (self-hosted, A100)	~0.00005 Wh	~0.02 gCO₂ (us-east-1)	2×
Llama 4 8B (self-hosted, eu-north-1)	~0.00005 Wh	~0.001 gCO₂	1× (baseline low)
DeepSeek V3 (self-hosted)	~0.0002 Wh	~0.08 gCO₂	8×

Practical Reduction Strategies

10–30×

Energy reduction achievable by switching from GPT-4-class models to small, fine-tuned 8B models for suitable tasks — the single highest-impact action in most enterprise AI energy optimisation programmes

11×

Carbon intensity difference between running inference in us-east-1 (~400 gCO₂/kWh) vs eu-north-1 (~18 gCO₂/kWh) — region selection is free and has the second-largest carbon impact after model selection

75%

Energy reduction from INT4 quantisation of LLM inference vs FP16 — with only 3–8% quality degradation on most enterprise tasks, quantisation is the highest-ROI inference optimisation

📏

Right-Size Your Models

The most impactful reduction: use the smallest model that meets quality requirements for each task. Classification, extraction, and structured output tasks don't need GPT-4. A fine-tuned Llama 3 8B model typically matches GPT-4 on narrow domain tasks at 10–30× lower energy. Run an A/B test: same task, smaller model, measure quality difference. Most enterprises find 60–70% of their GPT-4 calls can be served by 8–13B models without quality loss.

🌍

Deploy in Low-Carbon Regions

For self-hosted models, run inference in eu-north-1 (AWS Stockholm, Nordic hydro) or eu-west-1 (Ireland, high renewable mix). For proprietary API calls, select the lowest-carbon data centre option — Azure and GCP expose data centre carbon intensity data. Carbon-aware routing of AI inference to the cleanest available region costs zero additional engineering effort for new deployments. Connect to your infrastructure-as-code for region selection automation.

⚡

Quantise and Optimise

INT8 quantisation: 50% energy reduction with <1% quality loss. INT4 quantisation (AWQ/GPTQ): 75% energy reduction with 3–8% quality loss on most tasks. Deploy with TensorRT for NVIDIA hardware (2–4× throughput improvement vs naive serving) or vLLM with PagedAttention (3–5× throughput improvement). Higher throughput = fewer GPUs needed = lower energy per token served.

🗃️

Cache Aggressively

The greenest LLM call is one never made. KV cache (built into vLLM and TGI) reuses computation for common prompt prefixes. Semantic cache (GPTCache, Redis + embeddings) returns cached responses for semantically similar queries — useful for FAQ, documentation, and high-repetition enterprise tasks. A well-implemented semantic cache reduces LLM calls by 30–60% for typical enterprise knowledge base Q&A workloads.

How to Measure Your AI Carbon Footprint

Step 1

Instrument with CodeCarbon and Kepler

For self-hosted models, deploy Kepler (Kubernetes eBPF energy measurement) for per-inference energy tracking. For training runs, add CodeCarbon to your training script — one decorator, zero code changes. For proprietary API calls, use the Ecologits library (open source) which estimates energy from token count and model type. Connect all measurement to your GreenOps dashboards.

CodeCarbonKeplerEcologits

Step 2

Calculate SCI Score per AI Service

Apply the Software Carbon Intensity formula to each AI service: SCI = (E × I + M) per R (per API call, per active user). This gives you a comparable carbon metric across different AI services, enabling data-driven model selection and optimisation prioritisation. Report SCI scores monthly in your engineering metrics dashboard alongside latency and cost.

SCI per AI serviceMonthly trackingModel comparison

Reduce Your AI Carbon and Cost

Our DevOps, ML, and digital transformation teams help enterprises measure, reduce, and report AI energy consumption as part of integrated GreenOps programmes. Book a free advisory session to build your AI sustainability strategy.

SCALE D2C Editorial Team

GreenTech and Sustainable IT Research · March 2026

Frequently Asked Questions