Air-gapped LLM deployment — running large language models on infrastructure with no internet connectivity — is the gold standard for organisations with the most stringent data sovereignty, classification, or operational security requirements. Defence, intelligence, critical national infrastructure, and regulated financial institutions are deploying private AI infrastructure in fully isolated environments. This guide covers the architecture, hardware requirements, model selection, and operational complexity involved.
Why Air-Gapped LLM Deployment?
The cloud AI deployment model — sending prompts to API endpoints operated by OpenAI, Anthropic, Google, or AWS — is unacceptable for a defined set of use cases. The reasons vary: classification-level data that legally cannot leave controlled infrastructure, operational security requirements where network egress itself represents an attack surface, regulatory regimes that mandate data residency with audit trail requirements cloud APIs cannot meet, or risk tolerance decisions by legal and compliance teams unwilling to accept cloud provider data processing terms regardless of their content.
Air-gapped deployment is distinct from private cloud or VPC deployment — in a VPC deployment, the infrastructure is logically isolated but physically connected to internet-routed networks. Air-gapped means physical separation: no network interface connected to any externally routable network. This is a meaningful security difference for threat models that include nation-state adversaries or insider threats with network access, and it comes with significant operational cost in terms of update mechanisms, model versioning, and toolchain management.
Hardware Architecture for Air-Gapped LLM
Air-gapped LLM deployment requires on-premises GPU infrastructure scaled to the model size and inference workload. The hardware architecture decisions are more consequential than in cloud deployment because they represent a capital commitment that cannot be elastically scaled.
GPU selection for air-gapped deployment in 2026 is primarily between NVIDIA H100/H200 (highest performance, highest cost, controlled export in some jurisdictions), NVIDIA A100 (mature, well-supported, lower cost on secondary market), AMD Instinct MI300X (competitive performance, growing software ecosystem), and specialised inference accelerators (Groq LPU, Cerebras for specific workloads). For classified deployments, export control restrictions on H100/H200 to certain jurisdictions must be factored into procurement — in some cases, A100 or MI300X becomes the de facto choice based on licencing and export requirements rather than performance preference.
Server configuration typically uses 4–8 GPU servers per inference cluster, with NVLink or high-bandwidth interconnect between GPUs for large model serving. A 70B parameter model requires 140GB+ of GPU memory for full-precision serving — minimum 2× A100 80GB or 2× H100 80GB for comfortable serving of a 70B model in FP16. Quantised models (INT4, INT8) reduce memory requirements significantly: a 70B model in INT4 quantisation fits in ~35GB, enabling single-GPU serving at the cost of some accuracy.
Storage and networking within the air-gapped enclave require careful design: model weights (70B model = 140GB+ in FP16) must be stored on fast local storage (NVMe SSD) or high-bandwidth NAS accessible to GPU servers; internal network connectivity within the air-gapped environment uses standard high-bandwidth switching (100GbE or InfiniBand); and the data transfer mechanism into the air-gapped environment (for model updates, software patches, new data) must be explicitly designed — physical media transfer, one-way data diodes, or controlled-transfer workstations with security review are common patterns.
Model Selection for Air-Gapped Deployment
Not all LLMs are suitable for air-gapped deployment — models must be available for download and local execution, which excludes closed-API models (GPT-4, Claude, Gemini Ultra) and focuses the field on open-weight models.
| Model | Parameters | GPU Req (FP16) | GPU Req (INT4) | Licence | Best For |
|---|---|---|---|---|---|
| Llama 3.3 70B | 70B | 2× H100 80GB | 1× H100 80GB | Meta Llama 3 (commercial OK) | General purpose, strong reasoning |
| Mistral Large 2 | 123B | 3–4× H100 80GB | 2× H100 80GB | Mistral Research (check terms) | Multilingual, instruction following |
| Falcon 180B | 180B | 8× A100 80GB | 4× A100 80GB | Apache 2.0 | Maximum open-weight capability |
| Llama 3.1 8B | 8B | 1× A100 40GB | Single consumer GPU | Meta Llama 3 (commercial OK) | Low-resource, high-throughput |
| Phi-4 | 14B | 1× A100 40GB | Single GPU | MIT | Reasoning, STEM, compact deployment |
Inference Serving Stack
The inference serving stack for air-gapped deployment must be fully self-hosted — no cloud dependencies, update-by-default, or telemetry that requires external connectivity. The standard 2026 stack:
vLLM is the dominant open-source LLM serving framework, providing PagedAttention for memory-efficient serving, continuous batching for high throughput, and OpenAI-compatible API endpoints that allow applications built for OpenAI APIs to work with local models without code changes. vLLM supports all major open-weight models and runs on NVIDIA GPUs with CUDA. It requires no external connectivity after initial setup.
Ollama provides a simpler deployment path for smaller models and development/testing use cases, with a user-friendly model management interface. Less suited to high-throughput production serving than vLLM but valuable for developer workstations and small-scale deployments within the air-gapped environment.
OpenWebUI or similar self-hosted chat interfaces provide end-user access to the LLM without exposing the raw API — important for non-technical users who need a familiar interface without internet access.