On-device AI — running large language models entirely on local hardware without sending data to cloud APIs — has shifted from experimental to production-viable in 2026. Advances in model quantisation, hardware acceleration, and purpose-built edge AI chips have made it possible to run capable LLMs on consumer laptops, mobile devices, and enterprise edge servers. For organisations handling sensitive data, the privacy and latency benefits are compelling; this guide covers the technology, the tradeoffs, and the implementation path.
What Is On-Device AI and Why Does It Matter for Privacy?
On-device AI refers to running machine learning inference entirely on local hardware — the user's device, an on-premises server, or an edge appliance — without transmitting data to external cloud services. The privacy implication is fundamental: data processed on-device never leaves the organisation's security perimeter, eliminating the cloud API surface as a data exposure vector.
The privacy case is strongest for workloads involving personal health data, legal documents, financial records, proprietary intellectual property, or any data subject to jurisdictional data residency requirements that prohibit processing outside specific geographic boundaries. For these use cases, cloud API approaches require complex contractual frameworks, DPA agreements, and residual trust in cloud provider security — all of which on-device processing eliminates by design.
Current On-Device Model Landscape
The model ecosystem for on-device deployment has matured dramatically since 2023. Purpose-built efficient models and quantised versions of larger models now cover a broad capability range suitable for production workloads.
Apple Intelligence models represent the most polished consumer on-device AI deployment, with Apple's 3B parameter foundation model running natively on iPhone 15 Pro and all M-series Macs. The Private Cloud Compute architecture extends this with privacy-preserving server-side processing for tasks that exceed device capability, with formal cryptographic guarantees that Apple cannot access the data. For organisations already in the Apple ecosystem, this is the easiest path to on-device AI for end-user applications.
Meta Llama 3.2 in 1B and 3B parameter sizes is optimised for mobile and edge deployment and represents the open-weight foundation for most enterprise on-device deployments. The 3B model runs comfortably on modern smartphones with 6GB RAM and at excellent speeds on Apple Silicon Macs. Fine-tuned variants for specific domains (legal, medical, code) are widely available through Hugging Face and can be deployed using the same runtime infrastructure as the base model.
Microsoft Phi-3 and Phi-4 models are specifically designed for edge deployment, with the Phi-3-mini (3.8B) achieving GPT-3.5-level performance on reasoning benchmarks while running on consumer hardware. Microsoft's ONNX Runtime provides cross-platform on-device deployment infrastructure with hardware acceleration support across Intel, AMD, ARM, and Apple Silicon.
Google Gemma 2 (2B and 9B variants) provides strong multilingual capability in a deployment-efficient package, with TensorFlow Lite and MediaPipe integration enabling mobile deployment. The 2B model runs on mid-range Android devices with 4GB RAM.
| Model | Parameters | Min RAM | Best Hardware | Capability Level | Licence |
|---|---|---|---|---|---|
| Apple Intelligence | ~3B on-device | 8GB (iPhone 15 Pro+) | Apple Neural Engine | Strong for everyday tasks | Proprietary (Apple devices only) |
| Llama 3.2 3B | 3B | 4GB | Apple Silicon, modern ARM | Good reasoning and instruction following | Llama Community Licence |
| Phi-4 Mini | 3.8B | 4GB | ONNX Runtime, all platforms | Strong reasoning, maths, code | MIT |
| Gemma 2 2B | 2B | 3GB | Android, edge devices | Good multilingual tasks | Gemma Terms of Use |
| Mistral 7B Q4 | 7B (quantised) | 6GB | Apple M-series, Nvidia GPU | Strong general purpose | Apache 2.0 |
| Llama 3.1 8B Q4 | 8B (quantised) | 8GB | Apple M2+, RTX 3060+ | Near GPT-3.5 on many tasks | Llama Community Licence |
Runtime Infrastructure for On-Device Deployment
Selecting the right inference runtime is as important as model selection — it determines hardware compatibility, performance characteristics, and integration complexity.
llama.cpp is the most widely used open-source runtime for on-device LLM deployment, supporting GGUF-format quantised models across CPU and GPU on all major platforms (macOS, Linux, Windows, iOS, Android). Its Metal backend provides excellent Apple Silicon performance; CUDA backend for Nvidia GPUs; and a pure CPU path for any hardware. The project's broad community support means most open-weight models have GGUF conversions available and production-quality server wrappers (llama-server) for API-compatible local endpoints.
Ollama wraps llama.cpp in a more user-friendly package with automatic model downloads, a simple REST API, and a growing model library. For enterprise deployments where ease of use and standardised API access matter more than raw performance tuning, Ollama provides the fastest path to on-device LLM capabilities with an OpenAI-compatible API that requires minimal application code changes.
Apple MLX is Apple's own machine learning framework optimised for the unified memory architecture of Apple Silicon. MLX models run significantly faster than llama.cpp on M-series hardware for many architectures and is the recommended runtime for macOS-first deployments where Apple Silicon is the target platform.
Microsoft ONNX Runtime provides cross-platform deployment for Phi and other ONNX-compatible models, with hardware abstraction across Intel, AMD, ARM, and Apple Silicon. It is the natural choice for enterprise Windows deployments and organisations requiring consistent cross-platform behaviour.
Implementation Roadmap
Audit your current cloud AI API usage and classify each workload by data sensitivity. High-sensitivity workloads (patient data, legal docs, financial PII) are primary on-device candidates. Map regulatory requirements — GDPR data residency, HIPAA, sector-specific rules — that mandate local processing.
Evaluate 3–4 candidate models on representative samples of your actual tasks. On-device model capability is highly task-dependent — a model excellent at summarisation may perform poorly at structured data extraction. Build a golden evaluation dataset from production examples and use it to compare models systematically rather than relying on general benchmark rankings.
Assess RAM, available compute (CPU TFLOPS, GPU VRAM, NPU capability), and thermal constraints across your target device fleet. On mobile, thermal throttling under sustained inference load is a common production issue discovered late. Define minimum hardware specifications for supported on-device AI functionality to avoid degraded experiences on older devices.
Use Ollama or llama-server to expose an OpenAI-compatible local API, enabling application code to switch between cloud and on-device inference by changing the API base URL and model name. This architecture simplifies hybrid deployments where on-device handles sensitive workloads and cloud APIs handle tasks requiring frontier model capabilities.
Treat on-device model updates like application releases — version-controlled, staged rollout, regression-tested. Build model download and update infrastructure that is bandwidth-conscious (3–8GB per model update), supports background downloads, and includes rollback capability. On-device model management is an operational discipline that cloud API users do not need to develop but on-device deployments cannot avoid.
On-device and cloud AI are not mutually exclusive — the most effective enterprise deployments use a tiered model routing approach: sensitive data and latency-critical tasks route to on-device models, while complex tasks requiring frontier capability and non-sensitive queries route to cloud APIs. A model gateway layer (LiteLLM, Portkey) manages this routing transparently to application code.
Tradeoffs and Limitations
Capability gap versus frontier models remains real. On-device models in the 3–8B parameter range perform well on structured tasks but lag behind GPT-4-class models on complex multi-step reasoning, nuanced long-form generation, and tasks requiring extensive world knowledge. Accurately assessing whether this gap matters for your specific workload — not in the abstract but on actual production task samples — is essential to making sound deployment decisions.
Device fragmentation creates uneven performance across a real-world device fleet. A use case that works smoothly on an M3 MacBook Pro may be too slow on an older Windows laptop or unsupported on 32-bit mobile devices. Define minimum supported device specifications and build graceful degradation paths for users on hardware below the threshold.
Model update logistics require infrastructure investment that cloud API users do not need. Distributing 4–8GB model files to a fleet of devices requires CDN infrastructure, bandwidth management (avoiding simultaneous mass downloads), delta update capability, and monitoring for failed updates. These infrastructure costs are real and should be factored into TCO comparisons with cloud API approaches.