What Open X-Embodiment Is
Open X-Embodiment (OXE) is a unified dataset of robot manipulation demonstrations collected across more than 22 different robot embodiments, including arms from Franka Emika, Trossen Robotics (WidowX, ViperX), Universal Robots (UR5), KUKA, Google's own robot fleet, and many others. The dataset totals over one million episodes covering hundreds of distinct manipulation tasks: picking, placing, opening drawers and cabinets, pouring liquids, wiping surfaces, stacking objects, and more.
The project was a collaboration of over 30 research institutions, led by Google DeepMind with major contributions from Stanford, UC Berkeley, Carnegie Mellon, MIT, Columbia, ETH Zurich, and others. Each lab contributed their existing demonstration datasets, which were then standardized into the RLDS (Robot Learning Dataset Specification) format. The full dataset is hosted on Google Cloud Storage and is freely available for research use.
The "X" in the name stands for cross-embodiment. The defining ambition of OXE is not just to create a large dataset, but to demonstrate that training on data from many different robots produces better policies than training on data from a single robot, even for that single robot. This hypothesis turned out to be correct, and the evidence has reshaped how the field thinks about robot data.
Why It Matters: The RT-X Results
The landmark finding from the OXE paper (Padalkar et al., 2023) was the performance of RT-X models, specifically RT-1-X and RT-2-X, trained on the full multi-embodiment dataset.
RT-1-X (a smaller, efficient model) trained on OXE data outperformed single-robot specialist RT-1 models by approximately 50% on held-out evaluation tasks across multiple robot platforms. This was the headline result: a single generalist model, trained on data from 22 different robots, performed better on any individual robot than a model trained only on that robot's own data. The mechanism is that cross-embodiment data forces the model to learn embodiment-agnostic manipulation representations, effectively providing a strong prior for visual understanding and task concepts.
RT-2-X (built on the much larger PaLM-E vision-language model) showed even stronger cross-embodiment transfer, with particularly impressive zero-shot generalization to robot embodiments not present in the training set. When evaluated on a held-out robot type, RT-2-X achieved meaningful task completion rates without any fine-tuning, something that would be impossible for a model trained only on a single robot's data.
These results validated a core hypothesis: robot manipulation knowledge is partially embodiment-agnostic. A policy that has seen a Franka arm open a drawer and a WidowX arm pick up a cup has learned something about drawers and cups that transfers to a UR5, even though the UR5 has completely different kinematics.
Key Findings from the Paper
What transfers across embodiments: Visual scene understanding (recognizing objects, understanding spatial relationships) transferred most strongly. High-level task semantics (the concept of "pick up," "open," "place on") transferred well. Pre-grasp approach trajectories (moving toward an object before contact) transferred moderately.
What does not transfer well: Precise grasp configurations (exact finger positions relative to object surfaces) required embodiment-specific data. Contact dynamics (grip force modulation, insertion forces) did not transfer. Fine motor control (sub-centimeter precision movements) required per-embodiment fine-tuning.
Data distribution matters: The OXE dataset is not uniformly distributed across embodiments and tasks. Some labs contributed tens of thousands of episodes, others contributed hundreds. The task distribution is heavily skewed toward tabletop pick-and-place. Despite this imbalance, the cross-embodiment benefit was robust, though the largest benefits accrued to under-represented embodiments (which gained the most from cross-embodiment transfer) rather than to the dominant embodiments (which had enough data to train strong specialists).
Scale helps, but diversity helps more: Ablation studies varying the number of embodiments in the training set while holding total episode count constant showed that adding a new embodiment with fewer episodes consistently outperformed adding more episodes from an already-represented embodiment. This diversity-over-volume finding has become one of the most cited and most practically important results in robot learning.
How to Access and Use the Dataset
OXE is hosted on Google Cloud Storage and can be downloaded using the tensorflow_datasets (TFDS) API. The dataset uses the RLDS format, where each episode is a sequence of steps containing observation dictionaries (images, joint states, gripper state), action vectors, reward signals, and natural language task annotations.
Getting started:
- Install tensorflow_datasets:
pip install tensorflow-datasets - Browse available sub-datasets at the OXE GitHub repository or the TFDS catalog
- Load a specific sub-dataset:
tfds.load('fractal20220817_data')(for the RT-1 Fractal dataset, one of the largest components) - For PyTorch users: use LeRobot's conversion utilities to transform RLDS data into LeRobot Parquet format, or use the oxe_torch_dataloader for direct PyTorch loading
Practical usage patterns:
- Pre-training a foundation model: Download the full OXE dataset (or a diverse subset covering 10+ embodiments). Train your model on this data to learn general manipulation representations. Then fine-tune on your task-specific data. This consistently requires 5-10x fewer task-specific demonstrations than training from scratch.
- Augmenting a small dataset: If you have 100-200 demonstrations on your specific robot, add relevant OXE sub-datasets to your training mixture. Focus on sub-datasets from similar embodiments (same gripper type, similar arm geometry) and similar task categories.
- Evaluating cross-embodiment transfer: Use OXE's standard evaluation protocol and held-out task sets to benchmark your model's generalization capability against published baselines.
Limitations: What OXE Does Not Cover
OXE is transformative, but it has real limitations that teams should understand before relying on it.
Task diversity is skewed. The majority of episodes are tabletop pick-and-place, with smaller fractions covering drawer/cabinet opening, wiping, and pouring. Complex multi-step tasks, bimanual tasks, and mobile manipulation tasks are under-represented. If your deployment task is not well-covered by OXE's task distribution, the pre-training benefit will be limited.
Hardware is dated. Many contributing labs used hardware that was current in 2020-2023 but is now outdated: low-resolution cameras, older RealSense models, and arm configurations that differ from the ViperX/Franka setups most commonly used in 2026. The visual features from older cameras may not perfectly match the visual distribution of modern camera setups, reducing transfer efficiency.
Dexterity is limited. Almost all OXE data uses parallel-jaw grippers. Dexterous manipulation with multi-finger hands is essentially absent from the dataset. If your application involves dexterous hand manipulation, OXE provides limited direct benefit, though the visual understanding component still transfers.
Annotation quality varies. Language annotations range from careful, specific descriptions ("pick up the red cup from the left side of the table") to generic labels ("pick up object"). This inconsistency limits the effectiveness of language-conditioned training on the raw dataset without post-processing.
No force-torque data. The vast majority of OXE episodes contain only joint positions and camera images. Force-torque sensor data, which is critical for contact-rich tasks, is absent from most sub-datasets. This limits the usefulness of OXE for training policies that need to modulate grip force or handle compliant objects.
Loading OXE Data with LeRobot (Python)
# Load an OXE sub-dataset via LeRobot's HuggingFace integration
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
# Load the Bridge V2 subset (WidowX data, 60K+ episodes)
dataset = LeRobotDataset("lerobot/bridge_orig")
# Inspect dataset structure
print(f"Number of episodes: {dataset.num_episodes}")
print(f"Keys per frame: {dataset[0].keys()}")
# Typical keys: observation.images.image_0, observation.state, action, ...
# Load a specific episode
episode = dataset.filter(lambda x: x["episode_index"] == 42)
for frame in episode:
obs_image = frame["observation.images.image_0"] # PIL Image or tensor
state = frame["observation.state"] # joint positions
action = frame["action"] # action vector
# Process as needed for your training pipeline
# For TFDS-native loading (alternative):
# import tensorflow_datasets as tfds
# ds = tfds.load('fractal20220817_data', split='train')
# for episode in ds.take(10):
# steps = episode['steps']
# for step in steps:
# image = step['observation']['image']
# action = step['action']
Fine-Tuning a Foundation Model on OXE + Your Data
The standard pipeline for using OXE data to improve your task-specific policy involves three stages:
Stage 1: Select relevant OXE sub-datasets. Not all OXE data is equally useful for your task. Select sub-datasets based on: similar robot type (same gripper type is more important than same arm kinematics), similar task category (pick-place data helps pick-place; it does not help assembly), and data quality (prefer sub-datasets with language annotations and consistent camera setups). For a WidowX-based pick-place project, the Bridge V2 and Berkeley Cable Routing datasets are most relevant. For a Franka-based project, the Fractal and TOTO datasets are strongest.
Stage 2: Pre-train or use pre-trained weights. If using Octo or OpenVLA, the pre-trained weights already incorporate OXE data. Start from these weights and proceed to fine-tuning. If training a custom architecture, pre-train on your selected OXE sub-datasets for 100-200 epochs (typically 12-48 hours on 4x A100 GPUs depending on data volume). Monitor the validation loss on a held-out portion of your task-specific data to detect overfitting to OXE distribution at the expense of your target task.
Stage 3: Fine-tune on task-specific data. Fine-tune the pre-trained model on your task-specific demonstrations using a lower learning rate (typically 10x lower than pre-training: 1e-5 for Octo, 5e-6 for OpenVLA). Use all of your task-specific data plus a 10-20% mixture of the most relevant OXE data to prevent catastrophic forgetting of the general manipulation knowledge. Fine-tuning typically requires 50-200 epochs (2-8 hours on a single A100). Evaluate on held-out task-specific test episodes every 10 epochs and select the best checkpoint.
Benchmark Results: OXE-Trained Models vs. Specialists
| Model | Pre-Training Data | Fine-Tune Data | In-Dist. Success | Novel Object Success |
|---|---|---|---|---|
| ACT (from scratch) | None | 200 task demos | 82% | 28% |
| RT-1-X | Full OXE | 200 task demos | 88% | 52% |
| Octo (fine-tuned) | Full OXE | 200 task demos | 86% | 48% |
| OpenVLA (fine-tuned) | Full OXE | 200 task demos | 90% | 58% |
The pattern is consistent: OXE pre-training provides a modest in-distribution improvement (5-10%) and a large novel-object generalization improvement (20-30 percentage points). The in-distribution advantage narrows with more fine-tuning data but the OOD advantage persists even with 500+ task-specific demonstrations.
How SVRC Data Complements OXE
OXE provides breadth: many embodiments, many tasks, many environments. What it lacks is depth in specific domains and consistency in data quality. SVRC data collection fills this gap by providing: consistent camera setups with calibrated intrinsics and extrinsics across all episodes, force-torque sensor data synchronized with visual and proprioceptive streams (absent from almost all OXE data), systematic object diversity within target task categories (30+ objects per category, not the 5-10 typical in OXE sub-datasets), and language annotations following a standardized protocol rather than the variable quality across OXE sub-datasets.
The recommended approach for teams with a specific deployment target: use OXE-pretrained foundation model weights for the visual backbone and general manipulation knowledge, then fine-tune with SVRC-collected task-specific data that provides the depth and quality needed for deployment-grade reliability.
How to Contribute Your Own Data
Contributing to OXE strengthens the community dataset and provides a mechanism for your data to be cited and used by the broader research community. The contribution process involves several steps.
- Format your data in RLDS. Each episode must contain observations (images and proprioception), actions, and language annotations in the RLDS schema. The rlds_creator library provides conversion utilities.
- Add per-step language annotations. Every step should have a natural language description of the current task. These annotations are used by language-conditioned models and are a requirement for inclusion.
- Document your dataset. Provide a dataset card with: robot type and configuration, camera specifications and placement, collection environment description, task descriptions, operator count and training, and episode count per task.
- Submit a pull request. The OXE GitHub repository accepts dataset contributions through pull requests. The review process checks format compliance, data quality (no corrupted episodes, no extreme outliers), and documentation completeness.
If your demonstrations were collected through SVRC's data services, our platform can generate RLDS-compatible exports with standardized metadata, simplifying the contribution process. Contact the SVRC team for guidance on preparing your data for OXE submission.
What Comes Next: DROID, Bridge V2, and Beyond
OXE established the principle. The next generation of datasets is extending it in specific directions.
DROID (Khazatsky et al., 2024) focuses on environmental diversity. 76,000 demonstrations across 564 environments and 86 labs, specifically designed to test how environment diversity affects policy generalization. DROID is complementary to OXE: where OXE maximizes robot embodiment diversity, DROID maximizes scene and environment diversity.
Bridge V2 (Walke et al., 2023) provides a focused, high-quality dataset for WidowX-based manipulation. 60,000+ demonstrations across 24 environments with careful quality control. Bridge V2 is the go-to fine-tuning dataset for teams deploying on WidowX hardware because it provides the volume and environmental diversity needed for robust deployment, specifically for one embodiment.
Open-Anything datasets. The community is working toward OXE-style aggregation for domains currently under-represented: dexterous manipulation with multi-finger hands, bimanual tasks, mobile manipulation, and outdoor/field robotics. SVRC is actively contributing data from our bimanual and dexterous manipulation collection campaigns to these emerging aggregation efforts.
The broader trajectory is toward a robotics equivalent of the web-scale text corpora that enabled large language models. OXE was the proof of concept. The question now is whether the community can achieve the diversity and scale needed to train truly generalist robot foundation models, and how long that will take. SVRC's data collection infrastructure is designed to contribute to this effort while providing immediate practical value to teams building today's robot systems.
Related Reading
- Scaling Laws for Robot Learning: What We Know in 2026
- LeRobot Framework: Getting Started Guide
- The Generalization Challenge: Why Robot Policies Still Fail
- Robot Learning from Video: State of the Art in 2026
- Imitation Learning for Robots: From Demonstrations to Deployment
- SVRC Data Collection Services
- SVRC Datasets