Confidential Computing and P February 25, 2026 11 min read

On-device AI for privacy: running LLMs locally guide

Confidential Computing and P Enterprise Guide 2026 SCALE D2C D2C Technology Confidential Computing and P Enterprise Guide 2026 SCALE D2C D2C Technology

On-device AI — running large language models entirely on local hardware without sending data to cloud APIs — has shifted from experimental to production-viable in 2026. Advances in model quantisation, hardware acceleration, and purpose-built edge AI chips have made it possible to run capable LLMs on consumer laptops, mobile devices, and enterprise edge servers. For organisations handling sensitive data, the privacy and latency benefits are compelling; this guide covers the technology, the tradeoffs, and the implementation path.

What Is On-Device AI and Why Does It Matter for Privacy?

On-device AI refers to running machine learning inference entirely on local hardware — the user's device, an on-premises server, or an edge appliance — without transmitting data to external cloud services. The privacy implication is fundamental: data processed on-device never leaves the organisation's security perimeter, eliminating the cloud API surface as a data exposure vector.

The privacy case is strongest for workloads involving personal health data, legal documents, financial records, proprietary intellectual property, or any data subject to jurisdictional data residency requirements that prohibit processing outside specific geographic boundaries. For these use cases, cloud API approaches require complex contractual frameworks, DPA agreements, and residual trust in cloud provider security — all of which on-device processing eliminates by design.

On-Device LLM Inference — Technical Definition

Running a large language model's forward pass (the computation that generates each output token) entirely on local CPU, GPU, or NPU hardware, using locally stored model weights. No data is transmitted to external servers during inference. The model may have been trained externally, but execution is entirely local.

4B+

Parameter models now run at practical speeds on Apple M-series MacBooks — making capable local inference accessible to knowledge workers without specialist hardware

~0ms

Network latency for on-device inference — eliminating the 80–300ms round-trip to cloud APIs that makes real-time applications challenging

94%

Of enterprise respondents in Gartner's 2025 AI privacy survey cite data sovereignty as a primary driver for evaluating on-device AI deployment

Current On-Device Model Landscape

The model ecosystem for on-device deployment has matured dramatically since 2023. Purpose-built efficient models and quantised versions of larger models now cover a broad capability range suitable for production workloads.

Apple Intelligence models represent the most polished consumer on-device AI deployment, with Apple's 3B parameter foundation model running natively on iPhone 15 Pro and all M-series Macs. The Private Cloud Compute architecture extends this with privacy-preserving server-side processing for tasks that exceed device capability, with formal cryptographic guarantees that Apple cannot access the data. For organisations already in the Apple ecosystem, this is the easiest path to on-device AI for end-user applications.

Meta Llama 3.2 in 1B and 3B parameter sizes is optimised for mobile and edge deployment and represents the open-weight foundation for most enterprise on-device deployments. The 3B model runs comfortably on modern smartphones with 6GB RAM and at excellent speeds on Apple Silicon Macs. Fine-tuned variants for specific domains (legal, medical, code) are widely available through Hugging Face and can be deployed using the same runtime infrastructure as the base model.

Microsoft Phi-3 and Phi-4 models are specifically designed for edge deployment, with the Phi-3-mini (3.8B) achieving GPT-3.5-level performance on reasoning benchmarks while running on consumer hardware. Microsoft's ONNX Runtime provides cross-platform on-device deployment infrastructure with hardware acceleration support across Intel, AMD, ARM, and Apple Silicon.

Google Gemma 2 (2B and 9B variants) provides strong multilingual capability in a deployment-efficient package, with TensorFlow Lite and MediaPipe integration enabling mobile deployment. The 2B model runs on mid-range Android devices with 4GB RAM.

Model	Parameters	Min RAM	Best Hardware	Capability Level	Licence
Apple Intelligence	~3B on-device	8GB (iPhone 15 Pro+)	Apple Neural Engine	Strong for everyday tasks	Proprietary (Apple devices only)
Llama 3.2 3B	3B	4GB	Apple Silicon, modern ARM	Good reasoning and instruction following	Llama Community Licence
Phi-4 Mini	3.8B	4GB	ONNX Runtime, all platforms	Strong reasoning, maths, code	MIT
Gemma 2 2B	2B	3GB	Android, edge devices	Good multilingual tasks	Gemma Terms of Use
Mistral 7B Q4	7B (quantised)	6GB	Apple M-series, Nvidia GPU	Strong general purpose	Apache 2.0
Llama 3.1 8B Q4	8B (quantised)	8GB	Apple M2+, RTX 3060+	Near GPT-3.5 on many tasks	Llama Community Licence

Runtime Infrastructure for On-Device Deployment

Selecting the right inference runtime is as important as model selection — it determines hardware compatibility, performance characteristics, and integration complexity.

llama.cpp is the most widely used open-source runtime for on-device LLM deployment, supporting GGUF-format quantised models across CPU and GPU on all major platforms (macOS, Linux, Windows, iOS, Android). Its Metal backend provides excellent Apple Silicon performance; CUDA backend for Nvidia GPUs; and a pure CPU path for any hardware. The project's broad community support means most open-weight models have GGUF conversions available and production-quality server wrappers (llama-server) for API-compatible local endpoints.

Ollama wraps llama.cpp in a more user-friendly package with automatic model downloads, a simple REST API, and a growing model library. For enterprise deployments where ease of use and standardised API access matter more than raw performance tuning, Ollama provides the fastest path to on-device LLM capabilities with an OpenAI-compatible API that requires minimal application code changes.

Apple MLX is Apple's own machine learning framework optimised for the unified memory architecture of Apple Silicon. MLX models run significantly faster than llama.cpp on M-series hardware for many architectures and is the recommended runtime for macOS-first deployments where Apple Silicon is the target platform.

Microsoft ONNX Runtime provides cross-platform deployment for Phi and other ONNX-compatible models, with hardware abstraction across Intel, AMD, ARM, and Apple Silicon. It is the natural choice for enterprise Windows deployments and organisations requiring consistent cross-platform behaviour.

Implementation Roadmap

Discovery

Identify privacy-sensitive workloads

Audit your current cloud AI API usage and classify each workload by data sensitivity. High-sensitivity workloads (patient data, legal docs, financial PII) are primary on-device candidates. Map regulatory requirements — GDPR data residency, HIPAA, sector-specific rules — that mandate local processing.

Model Evaluation

Benchmark models against your task distribution

Evaluate 3–4 candidate models on representative samples of your actual tasks. On-device model capability is highly task-dependent — a model excellent at summarisation may perform poorly at structured data extraction. Build a golden evaluation dataset from production examples and use it to compare models systematically rather than relying on general benchmark rankings.

Hardware Assessment

Evaluate target device capabilities

Assess RAM, available compute (CPU TFLOPS, GPU VRAM, NPU capability), and thermal constraints across your target device fleet. On mobile, thermal throttling under sustained inference load is a common production issue discovered late. Define minimum hardware specifications for supported on-device AI functionality to avoid degraded experiences on older devices.

Build & Integrate

Implement with API-compatible local endpoint

Use Ollama or llama-server to expose an OpenAI-compatible local API, enabling application code to switch between cloud and on-device inference by changing the API base URL and model name. This architecture simplifies hybrid deployments where on-device handles sensitive workloads and cloud APIs handle tasks requiring frontier model capabilities.

Production Operations

Manage model updates and version control

Treat on-device model updates like application releases — version-controlled, staged rollout, regression-tested. Build model download and update infrastructure that is bandwidth-conscious (3–8GB per model update), supports background downloads, and includes rollback capability. On-device model management is an operational discipline that cloud API users do not need to develop but on-device deployments cannot avoid.

💡 Key Insight

On-device and cloud AI are not mutually exclusive — the most effective enterprise deployments use a tiered model routing approach: sensitive data and latency-critical tasks route to on-device models, while complex tasks requiring frontier capability and non-sensitive queries route to cloud APIs. A model gateway layer (LiteLLM, Portkey) manages this routing transparently to application code.

Tradeoffs and Limitations

Capability gap versus frontier models remains real. On-device models in the 3–8B parameter range perform well on structured tasks but lag behind GPT-4-class models on complex multi-step reasoning, nuanced long-form generation, and tasks requiring extensive world knowledge. Accurately assessing whether this gap matters for your specific workload — not in the abstract but on actual production task samples — is essential to making sound deployment decisions.

Device fragmentation creates uneven performance across a real-world device fleet. A use case that works smoothly on an M3 MacBook Pro may be too slow on an older Windows laptop or unsupported on 32-bit mobile devices. Define minimum supported device specifications and build graceful degradation paths for users on hardware below the threshold.

Model update logistics require infrastructure investment that cloud API users do not need. Distributing 4–8GB model files to a fleet of devices requires CDN infrastructure, bandwidth management (avoiding simultaneous mass downloads), delta update capability, and monitoring for failed updates. These infrastructure costs are real and should be factored into TCO comparisons with cloud API approaches.

Expert Q&A

Frequently Asked Questions

For structured, well-defined tasks — document summarisation, classification, entity extraction, Q&A against specific documents — on-device models in the 7–8B range perform at 85–92% of GPT-4 quality in blind evaluations. The gap widens significantly for complex multi-step reasoning, creative generation, and tasks requiring broad world knowledge. The right question is not whether on-device matches cloud quality in general but whether it meets quality thresholds for your specific tasks — which requires empirical evaluation rather than general benchmark comparisons.

For personal productivity use cases (summarisation, drafting, Q&A), Apple M2 or later MacBooks with 16GB RAM are the most capable consumer devices, running 7–8B models at 30–50 tokens/second. For Windows enterprise fleets, laptops with discrete Nvidia RTX 3060 or later GPUs (8GB+ VRAM) deliver comparable performance. For server-side on-premises deployment handling team-level inference loads, a single Nvidia A100 or H100 GPU comfortably serves 20–50 concurrent users at 7B scale. NPU-equipped ARM laptops (Snapdragon X Elite, Intel Meteor Lake) are improving rapidly and will deliver strong on-device performance at lower power budgets by 2027.

On-device processing significantly simplifies GDPR and HIPAA compliance by eliminating the cloud data processor relationship and the associated DPA requirements, data transfer agreements, and cross-border transfer mechanisms. Data processed entirely on-device and never transmitted externally falls outside most cloud-specific regulatory requirements. However, on-device does not automatically make an application compliant — you still need to address data retention, access controls, audit logging, and the full scope of applicable regulatory requirements. The compliance benefit is the elimination of cloud processor risk, not wholesale exemption from regulation.

Quantisation reduces model weight precision from 32-bit or 16-bit floating point to lower precision formats (8-bit, 4-bit, or even 2-bit integers), reducing model file size and memory requirements by 2–8× while enabling faster inference on hardware without high-precision floating point units. The quality tradeoff depends on the quantisation method and precision level: 8-bit quantisation (Q8) produces negligible quality loss on most tasks; 4-bit quantisation (Q4) loses 2–5% quality on complex reasoning tasks but remains acceptable for most structured tasks; extreme quantisation below 4-bit produces more significant quality degradation. GGUF format quantised models from llama.cpp represent the current best-practice balance of file size, compatibility, and quality.

Fine-tuning for on-device deployment follows the same process as cloud model fine-tuning but must target the smaller model architecture you intend to deploy. LoRA (Low-Rank Adaptation) fine-tuning is the standard approach — it modifies a small number of adapter weights rather than all model parameters, requiring 10–30× less compute than full fine-tuning and producing adapters (50–200MB) rather than full model copies. A domain-specific fine-tune of Llama 3.2 3B with 5,000–50,000 examples requires 2–8 GPU hours on an A100, costing $50–300 per training run. The fine-tuned model and adapter can be merged for deployment as a single GGUF file on the target runtime.

Apple Private Cloud Compute (PCC) is a hybrid architecture where Apple Intelligence processes some requests on-device and routes more complex requests to purpose-built Apple servers with cryptographic privacy guarantees — specifically, that Apple and Apple's servers cannot read the content of requests processed there. PCC uses hardware-attested secure enclaves, stateless request processing, and formal public verifiability commitments that allow security researchers to inspect the system. It differs from standard on-device AI in that data does leave the device (to Apple's servers), but with stronger privacy guarantees than typical cloud processing. For enterprise use cases requiring absolute data sovereignty (no data leaves the organisation's infrastructure), PCC does not satisfy the requirement — only true on-premises or device-local processing does.

LLM inference is computationally intensive and does affect battery life, but the impact varies significantly by model size, inference duration, and hardware efficiency. Apple's Neural Engine is highly power-efficient for on-device model inference — a 30-second inference session on Llama 3.2 3B on an iPhone 15 Pro consumes approximately 0.5–1% battery, comparable to a video call. Android NPUs show similar efficiency for supported model architectures. The critical optimisation is routing: only invoke on-device inference for requests that genuinely require it, rather than running inference continuously. Background inference tasks should be throttled by battery state and thermal conditions, with graceful degradation to cloud routing when the device is in a constrained state.

The tooling ecosystem has matured considerably in 2025–2026. For iOS and macOS, Apple's Core ML and the MLX Swift library provide native integration with the hardware acceleration available on Apple devices. For Android, Google's MediaPipe LLM Inference API and TensorFlow Lite support Gemma and other compatible models. For cross-platform desktop and server applications, Ollama provides the simplest deployment path with an OpenAI-compatible API; llama.cpp's server mode offers more performance tuning for production workloads. LangChain and LlamaIndex both support Ollama as a backend, enabling on-device deployment with the same application framework as cloud AI deployments, often requiring only a configuration change rather than application code modifications.

ON-DEVICE

Confidential Computing and P

Ready to Implement On-device AI for privacy: running LLMs locally gui...?

Our specialist team delivers measurable ROI from Confidential Computing and P programmes for enterprise and D2C brands.

Book a Free Advisory Call Explore All Services