Home Blog Confidential Computing and P Private AI infrastructure: air-gapped LLM deployment gu...
Confidential Computing and P February 15, 2026 11 min read

Private AI infrastructure: air-gapped LLM deployment guide

Confidential Computing and P Enterprise Guide 2026 SCALE D2C D2C Technology Confidential Computing and P Enterprise Guide 2026 SCALE D2C D2C Technology

Air-gapped LLM deployment — running large language models on infrastructure with no internet connectivity — is the gold standard for organisations with the most stringent data sovereignty, classification, or operational security requirements. Defence, intelligence, critical national infrastructure, and regulated financial institutions are deploying private AI infrastructure in fully isolated environments. This guide covers the architecture, hardware requirements, model selection, and operational complexity involved.

Why Air-Gapped LLM Deployment?

The cloud AI deployment model — sending prompts to API endpoints operated by OpenAI, Anthropic, Google, or AWS — is unacceptable for a defined set of use cases. The reasons vary: classification-level data that legally cannot leave controlled infrastructure, operational security requirements where network egress itself represents an attack surface, regulatory regimes that mandate data residency with audit trail requirements cloud APIs cannot meet, or risk tolerance decisions by legal and compliance teams unwilling to accept cloud provider data processing terms regardless of their content.

Air-gapped deployment is distinct from private cloud or VPC deployment — in a VPC deployment, the infrastructure is logically isolated but physically connected to internet-routed networks. Air-gapped means physical separation: no network interface connected to any externally routable network. This is a meaningful security difference for threat models that include nation-state adversaries or insider threats with network access, and it comes with significant operational cost in terms of update mechanisms, model versioning, and toolchain management.

43%
Of defence and intelligence organisations surveyed report active air-gapped or classified-network LLM deployments as of Q1 2026, up from under 5% in 2023
3–5×
Infrastructure cost premium for equivalent performance in air-gapped vs cloud deployment, due to dedicated GPU procurement and operational overhead
6–18 months
Typical deployment timeline from procurement decision to operational air-gapped LLM capability, driven primarily by GPU hardware lead times and security accreditation

Hardware Architecture for Air-Gapped LLM

Air-gapped LLM deployment requires on-premises GPU infrastructure scaled to the model size and inference workload. The hardware architecture decisions are more consequential than in cloud deployment because they represent a capital commitment that cannot be elastically scaled.

GPU selection for air-gapped deployment in 2026 is primarily between NVIDIA H100/H200 (highest performance, highest cost, controlled export in some jurisdictions), NVIDIA A100 (mature, well-supported, lower cost on secondary market), AMD Instinct MI300X (competitive performance, growing software ecosystem), and specialised inference accelerators (Groq LPU, Cerebras for specific workloads). For classified deployments, export control restrictions on H100/H200 to certain jurisdictions must be factored into procurement — in some cases, A100 or MI300X becomes the de facto choice based on licencing and export requirements rather than performance preference.

Server configuration typically uses 4–8 GPU servers per inference cluster, with NVLink or high-bandwidth interconnect between GPUs for large model serving. A 70B parameter model requires 140GB+ of GPU memory for full-precision serving — minimum 2× A100 80GB or 2× H100 80GB for comfortable serving of a 70B model in FP16. Quantised models (INT4, INT8) reduce memory requirements significantly: a 70B model in INT4 quantisation fits in ~35GB, enabling single-GPU serving at the cost of some accuracy.

Storage and networking within the air-gapped enclave require careful design: model weights (70B model = 140GB+ in FP16) must be stored on fast local storage (NVMe SSD) or high-bandwidth NAS accessible to GPU servers; internal network connectivity within the air-gapped environment uses standard high-bandwidth switching (100GbE or InfiniBand); and the data transfer mechanism into the air-gapped environment (for model updates, software patches, new data) must be explicitly designed — physical media transfer, one-way data diodes, or controlled-transfer workstations with security review are common patterns.

Model Selection for Air-Gapped Deployment

Not all LLMs are suitable for air-gapped deployment — models must be available for download and local execution, which excludes closed-API models (GPT-4, Claude, Gemini Ultra) and focuses the field on open-weight models.

ModelParametersGPU Req (FP16)GPU Req (INT4)LicenceBest For
Llama 3.3 70B70B2× H100 80GB1× H100 80GBMeta Llama 3 (commercial OK)General purpose, strong reasoning
Mistral Large 2123B3–4× H100 80GB2× H100 80GBMistral Research (check terms)Multilingual, instruction following
Falcon 180B180B8× A100 80GB4× A100 80GBApache 2.0Maximum open-weight capability
Llama 3.1 8B8B1× A100 40GBSingle consumer GPUMeta Llama 3 (commercial OK)Low-resource, high-throughput
Phi-414B1× A100 40GBSingle GPUMITReasoning, STEM, compact deployment

Inference Serving Stack

The inference serving stack for air-gapped deployment must be fully self-hosted — no cloud dependencies, update-by-default, or telemetry that requires external connectivity. The standard 2026 stack:

vLLM is the dominant open-source LLM serving framework, providing PagedAttention for memory-efficient serving, continuous batching for high throughput, and OpenAI-compatible API endpoints that allow applications built for OpenAI APIs to work with local models without code changes. vLLM supports all major open-weight models and runs on NVIDIA GPUs with CUDA. It requires no external connectivity after initial setup.

Ollama provides a simpler deployment path for smaller models and development/testing use cases, with a user-friendly model management interface. Less suited to high-throughput production serving than vLLM but valuable for developer workstations and small-scale deployments within the air-gapped environment.

OpenWebUI or similar self-hosted chat interfaces provide end-user access to the LLM without exposing the raw API — important for non-technical users who need a familiar interface without internet access.

Operational Considerations

🔄
Model Updates and Versioning
Without internet connectivity, model updates require a controlled transfer process into the air-gapped environment. Establish a model update pipeline: download verified model weights on a connected workstation, hash verification, security review, physical media or one-way data diode transfer, and deployment via the internal model registry. Cadence typically quarterly for minor updates, event-driven for critical security patches.
📋
Security Accreditation
Air-gapped LLM deployments handling classified or regulated data require formal security accreditation of both the infrastructure and the AI system itself. This involves assessment of the model weights (is open-weight model provenance verified?), the serving infrastructure, access controls, audit logging, and data handling procedures. Accreditation timelines of 6–18 months are common in government and defence contexts.
📊
Monitoring Without Telemetry
Standard AI platform monitoring tools typically rely on cloud telemetry. Air-gapped environments require self-hosted observability: Prometheus and Grafana for infrastructure metrics, local log aggregation (ELK stack or similar), and custom application performance monitoring. Token throughput, GPU utilisation, queue depth, and error rates should be monitored internally.
🔑
Access Control and Audit
All LLM interactions in air-gapped classified environments require comprehensive audit logging: who queried the model, when, with what prompt (or hash thereof), and what response was returned. Access control must integrate with the enclave's identity management — Active Directory or equivalent — with multi-factor authentication for LLM access.

Frequently Asked Questions

On-premises deployment means running infrastructure in your own data centre rather than a cloud provider's — but the infrastructure may still have internet connectivity for management, updates, and telemetry. Private cloud deployment (e.g., a VPC in AWS GovCloud or Azure Government) means logically isolated cloud infrastructure where your data doesn't share compute with other tenants, but the physical infrastructure is still in a cloud provider's facility and connected to internet-routed networks. Air-gapped deployment means physical network isolation — no network interface connected to any externally routable network. The air gap provides the strongest security guarantee for certain threat models (insider threat with network access, nation-state network-layer attacks) but at significant operational cost. Most organisations with data sensitivity requirements are adequately served by on-premises or private cloud deployment; air-gapped deployment is reserved for the highest classification levels and most sensitive operational contexts.

For most enterprise use cases in 2026, the gap between top open-weight models (Llama 3.3 70B, Mistral Large 2) and frontier closed models has narrowed substantially. On standard benchmarks and practical enterprise tasks — document summarisation, information extraction, structured output generation, code assistance — 70B+ parameter models deliver performance that satisfies most enterprise requirements. The gap is most pronounced for complex multi-step reasoning, nuanced instruction following on ambiguous tasks, and cutting-edge coding capability — tasks where GPT-4 class models maintain a meaningful edge. For air-gapped deployments, the question is not 'is it as good as GPT-4?' but 'is it good enough for our use cases?' — and for a large majority of enterprise NLP tasks, the answer is yes. Running a parallel evaluation of your specific use cases with local models before committing to air-gapped deployment is strongly recommended.

NVIDIA H100 and H200 GPU lead times have improved from the extreme shortage of 2023–2024 but remain 8–16 weeks for standard commercial procurement through Dell, HP, and Supermicro server configurations. Classified procurement through government channels may have different lead times and may require additional export control review for H100/H200 in certain geographies. AMD MI300X availability has improved and represents a 4–8 week lead time through AMD's enterprise channel. For organisations with urgent requirements, the secondary market for A100 servers provides shorter lead times (2–6 weeks) with performance adequate for most 70B model serving use cases. Factor GPU procurement lead times into your deployment timeline from the start — hardware delays are the most common cause of air-gapped LLM deployment timeline slippage.

Fine-tuning in an air-gapped environment requires all training infrastructure to be available locally: the base model weights, training data (classified within the enclave), and the fine-tuning framework. The standard approach uses parameter-efficient fine-tuning methods — LoRA (Low-Rank Adaptation) or QLoRA — which require significantly less GPU memory than full fine-tuning and can be completed on 1–2 A100 GPUs for 7B–13B parameter models. The HuggingFace Transformers library and PEFT library are fully self-hostable and require no external connectivity after the initial transfer of code and model weights into the enclave. Training data curation, model evaluation, and deployment promotion all occur within the air-gapped environment. The operational challenge is that fine-tuned models need the same secure transfer and version management processes as base model updates — establish these processes before beginning fine-tuning work.

Air-gapping eliminates network-based exfiltration risks but does not eliminate all AI-specific security risks. Prompt injection attacks — where malicious instructions embedded in user-supplied documents manipulate the LLM into taking unintended actions — remain a risk in air-gapped RAG and agent deployments. Model weight provenance is a concern: open-weight models downloaded from the internet before transfer into the enclave must be verified against known-good hashes; compromised model weights (supply chain attack) could embed backdoors that are activated by specific inputs. Insider threat remains: authorised users can attempt to extract sensitive information through carefully crafted prompts, and LLM responses to sensitive queries may need audit review. Output redaction — automatically screening LLM outputs for sensitive information that shouldn't leave the enclave through the user interface — is an additional control worth considering for the highest-sensitivity environments.

The minimum viable configuration for a useful production capability (serving 7–13B parameter models to 10–20 concurrent users) is 1–2 servers with NVIDIA A100 40GB or 80GB GPUs, 256–512GB system RAM, NVMe SSD storage (4TB+ for model weights and operating system), and 100GbE networking within the enclave. This supports Llama 3.1 8B in FP16 or a 13B model in INT4 quantisation with reasonable throughput. For 70B model serving, 2–4 A100 80GB GPUs are required. The minimum viable configuration for development and testing is lower: a single NVIDIA A100 40GB (or consumer RTX 4090 for non-classified development) running quantised models is sufficient to validate use cases and develop applications before production GPU hardware is procured. Consumer GPUs (RTX 4090, RTX 3090) are not appropriate for classified or production use but work well for capability prototyping within an isolated development environment.

RAG in air-gapped environments requires all components to be self-hosted: the vector database (Chroma, Qdrant, Weaviate, Milvus — all available as self-hosted deployments), the embedding model (open-weight embedding models like nomic-embed-text, BGE, or E5 run locally on CPU or GPU), the document ingestion pipeline, and the LLM itself. The architecture is identical to cloud RAG except every component runs within the enclave. Performance considerations: embedding generation is compute-intensive and may require dedicated GPU capacity separate from the inference server; vector database performance scales with index size and query volume; and document preprocessing (PDF parsing, OCR for scanned documents) requires additional software components that must also be air-gapped. The end-to-end air-gapped RAG stack is mature and well-tested in 2026 — all required components are available as open-source software with no mandatory external connectivity.

Ongoing costs for air-gapped LLM infrastructure are dominated by three categories: power and cooling (GPU servers consume 3–15kW per server at load — a 4-GPU H100 server draws ~10kW; a 10-server cluster costs £50,000–150,000 annually in power and cooling at typical data centre rates), staff costs (1–2 FTE specialised in AI infrastructure, model management, and security operations is typically required for a production air-gapped LLM environment), and hardware refresh (GPU hardware has a 3–5 year useful life for AI workloads before performance falls meaningfully behind current-generation models; a full hardware refresh cycle for a 4-server cluster represents £500K–2M capital cost depending on GPU tier). Software costs are minimal — the open-source serving stack has no licence fees — but security accreditation and audit costs for classified environments add £20–100K annually depending on classification level and audit frequency. Total cost of ownership for a modest air-gapped LLM environment runs £200–600K annually in the UK or equivalent markets.

PRIVATE AI

Ready to Implement Private AI infrastructure: air-gapped LLM deployme...?

Our specialist team delivers measurable ROI from Confidential Computing and P programmes for enterprise and D2C brands.

Free Audit