Foundation models for robotics — large pre-trained models that encode general robot control, perception, and reasoning capabilities — represent the most significant shift in robotics since deep learning. RT-2, OpenVLA, and π0 (pi zero) are the leading examples: models that can follow natural language instructions, generalise to novel objects and environments, and perform complex manipulation tasks without task-specific programming. For enterprise robotics programmes, understanding these models determines whether your Physical AI roadmap includes a 2-year lead over competitors or a 2-year catch-up.
What Are Foundation Models for Robotics?
Leading Robotics Foundation Models
| Model | Developer | Architecture | Key Capability | Availability |
|---|---|---|---|---|
| RT-2 (Robotic Transformer 2) | Google DeepMind | VLA on PaLM-E — language + vision + action | Emergent reasoning — "move the item that can be used as a drink" | Research only — API via Google |
| OpenVLA | Stanford / Berkeley (open) | 7B VLA on Prismatic-7B vision encoder | Strong generalisation; open-weight fine-tuning | Open-weight — Apache 2.0 |
| π0 (pi zero) | Physical Intelligence | Flow matching diffusion policy on VLM | Dexterous bimanual manipulation; fastest inference | Commercial — API via Physical Intelligence |
| Octo | Berkeley (open) | Transformer — action chunking | Versatile, well-documented, easy to fine-tune | Open-weight — Apache 2.0 |
| Diffusion Policy | MIT / Columbia (open) | Diffusion model for robot actions | State-of-the-art dexterous manipulation | Open-weight — MIT licence |
What Is a Vision-Language-Action (VLA) Model?
A VLA model combines a vision encoder (processes camera images), a language model (processes text instructions), and an action head (outputs robot motor commands) into a single end-to-end trained network. The key capability VLAs unlock is semantic generalisation: the robot can interpret "pick up the blue cup" or "put the item that doesn't belong with the others into the bin" — because the language model contributes semantic understanding that task-specific policies lack.
Fine-Tuning for Enterprise Deployment
Pre-trained robotics foundation models generalise broadly but are not production-ready for specific enterprise tasks without fine-tuning. The same LoRA and QLoRA techniques used for language model fine-tuning apply: collect 100–500 demonstrations of your specific task, fine-tune the policy head (and optionally the vision encoder) on task-specific data, and evaluate on held-out demonstrations.
- Teleoperation demonstrations — human operator performs the task while recording all sensor data
- 100–500 demonstrations typically sufficient for task-specific fine-tuning
- LEROBOT (Hugging Face) provides teleoperation infrastructure and dataset format
- OpenVLA fine-tuning: standard LoRA on action head + language adapter layers
- Hardware: single A100 80GB for OpenVLA fine-tuning, 8–16 hours per task
- Octo: simpler architecture, faster fine-tuning — good starting point for new teams
- Deploy on NVIDIA Jetson AGX Orin or A100 workstation depending on robot form factor
- Quantise to INT4 for edge deployment if latency budget allows 10% quality trade-off
- Wrap in ROS 2 node for integration with existing planning and safety infrastructure
- Bin picking with novel objects — foundation models generalise to new SKUs
- Natural language robot instruction in collaborative robot cells
- Multi-step assembly tasks requiring semantic understanding of components
Our machine learning development and software development teams design and deploy robotics foundation model programmes — from teleoperation data collection through fine-tuning pipelines to ROS 2 production integration. Book a free advisory session to scope your foundation model robotics programme.