Foundation models for robotics: RT-2, OpenVLA, pi zero

Q: Does SCALE D2C work with all business sizes?

Yes — D2C brands to enterprise. View our pricing .

Foundation models for robotics — large pre-trained models that encode general robot control, perception, and reasoning capabilities — represent the most significant shift in robotics since deep learning. RT-2, OpenVLA, and π0 (pi zero) are the leading examples: models that can follow natural language instructions, generalise to novel objects and environments, and perform complex manipulation tasks without task-specific programming. For enterprise robotics programmes, understanding these models determines whether your Physical AI roadmap includes a 2-year lead over competitors or a 2-year catch-up.

What Are Foundation Models for Robotics?

Robotics Foundation Models — Definition

Large pre-trained neural networks that encode general-purpose visuomotor control — the ability to translate visual observations and language instructions into robot actions — trained on large, diverse datasets of robot trajectories, internet video, and language. Unlike traditional robot control which requires task-specific programming, foundation models enable generalisation: a model trained on thousands of manipulation tasks can attempt novel tasks it has never seen, guided by natural language instructions alone. They are the robotics equivalent of GPT — a general capability base that can be adapted, fine-tuned, and deployed for specific tasks.

Leading Robotics Foundation Models

Model	Developer	Architecture	Key Capability	Availability
RT-2 (Robotic Transformer 2)	Google DeepMind	VLA on PaLM-E — language + vision + action	Emergent reasoning — "move the item that can be used as a drink"	Research only — API via Google
OpenVLA	Stanford / Berkeley (open)	7B VLA on Prismatic-7B vision encoder	Strong generalisation; open-weight fine-tuning	Open-weight — Apache 2.0
π0 (pi zero)	Physical Intelligence	Flow matching diffusion policy on VLM	Dexterous bimanual manipulation; fastest inference	Commercial — API via Physical Intelligence
Octo	Berkeley (open)	Transformer — action chunking	Versatile, well-documented, easy to fine-tune	Open-weight — Apache 2.0
Diffusion Policy	MIT / Columbia (open)	Diffusion model for robot actions	State-of-the-art dexterous manipulation	Open-weight — MIT licence

What Is a Vision-Language-Action (VLA) Model?

A VLA model combines a vision encoder (processes camera images), a language model (processes text instructions), and an action head (outputs robot motor commands) into a single end-to-end trained network. The key capability VLAs unlock is semantic generalisation: the robot can interpret "pick up the blue cup" or "put the item that doesn't belong with the others into the bin" — because the language model contributes semantic understanding that task-specific policies lack.

62%

Success rate of RT-2 on never-before-seen tasks described in natural language — compared to near-zero for traditional task-specific policies on out-of-distribution tasks

Parameters in OpenVLA — the current open-weight standard for robotics foundation models, running on a single NVIDIA A100 GPU at 6Hz inference frequency for real-time control

30Hz

Control frequency achievable with π0 (pi zero) using flow matching — fast enough for dexterous bimanual manipulation tasks that require sub-50ms action latency

Fine-Tuning for Enterprise Deployment

Pre-trained robotics foundation models generalise broadly but are not production-ready for specific enterprise tasks without fine-tuning. The same LoRA and QLoRA techniques used for language model fine-tuning apply: collect 100–500 demonstrations of your specific task, fine-tune the policy head (and optionally the vision encoder) on task-specific data, and evaluate on held-out demonstrations.

📦 Data Collection

Teleoperation demonstrations — human operator performs the task while recording all sensor data
100–500 demonstrations typically sufficient for task-specific fine-tuning
LEROBOT (Hugging Face) provides teleoperation infrastructure and dataset format

🔧 Fine-Tuning Stack

OpenVLA fine-tuning: standard LoRA on action head + language adapter layers
Hardware: single A100 80GB for OpenVLA fine-tuning, 8–16 hours per task
Octo: simpler architecture, faster fine-tuning — good starting point for new teams

🏭 Enterprise Deployment

Deploy on NVIDIA Jetson AGX Orin or A100 workstation depending on robot form factor
Quantise to INT4 for edge deployment if latency budget allows 10% quality trade-off
Wrap in ROS 2 node for integration with existing planning and safety infrastructure

✅ Best Use Cases 2026

Bin picking with novel objects — foundation models generalise to new SKUs
Natural language robot instruction in collaborative robot cells
Multi-step assembly tasks requiring semantic understanding of components

Implementing Robotics Foundation Models?

Our machine learning development and software development teams design and deploy robotics foundation model programmes — from teleoperation data collection through fine-tuning pipelines to ROS 2 production integration. Book a free advisory session to scope your foundation model robotics programme.

SCALE D2C Editorial Team

Physical AI and Robotics Research · March 2026

Frequently Asked Questions

End-to-end Physical AI and Robotics strategy, implementation, and optimisation for enterprise and D2C brands. Contact us for a free consultation.

Strategy projects: 4–8 weeks. Full implementation: 3–12 months. ROI typically within 12–18 months.

Yes — D2C brands to enterprise. View our pricing.

Foundation models for robotics: RT-2, OpenVLA, pi zero

What Are Foundation Models for Robotics?

Leading Robotics Foundation Models

What Is a Vision-Language-Action (VLA) Model?

Fine-Tuning for Enterprise Deployment

Frequently Asked Questions

Ready to Implement Physical AI and Robotics?