A robot can watch you all day and learn nothing. Here's what it's missing.

Training data is what actually decides whether a robot works, and it is unlike any other kind. A guide to the real trade behind robot learning: faking the world in simulation, recording it for real, and the unglamorous layers that turn a video into a lesson a robot can use.

Point a camera at yourself and unlock your front door. Play the clip back. To you it is obviously a person turning a key. To a robot it is a flat rectangle of colored dots that shift over time. The robot cannot tell where your hand is in space, how hard you pushed, or even that the key and your fingers are two different things. That gap, between a video of a task and something a robot can actually learn from, is the least glamorous and most important problem in robotics. This is a guide to what lives inside it.

Almost every conversation about robots fixates on the body and the brain: the hands, the motors, the model. The thing that actually decides whether a robot works is duller and harder to talk about. It is the training data. And robot training data is unlike any other kind, because the one shortcut that built modern AI is closed off.

Robots don't get an internet

A language model learned to write by reading the internet, a thirty-year pile of text that was already there. Robots have no such pile. As we have argued before, robots can't train on the internet; they train on us. Every example of a robot doing a physical task has to be physically performed and recorded by someone. The largest open robot dataset ever assembled, Open X-Embodiment, gathers about 1 million demonstrations from 22 robot types across 21 institutions, and it took a once-in-a-field collaboration to build. Next to the trillions of words a chatbot trains on, that is a rounding error.

So the entire field runs on one question: what is the cheapest, most scalable way to manufacture physical experience that a robot can actually learn from? There are really two places to get it. You can fake the world, or you can record the real one.

Option one: fake the world

Simulation is a video game of physics. You build a virtual room, drop in a virtual robot, and let it attempt a task a million times overnight. It never gets tired, never breaks, and every attempt comes with perfect labels for free, because the simulator knows exactly where everything is. For some skills this is genuinely the right tool, and the whole industry uses it.

But simulators fake the one thing manipulation depends on: contact. How a fingertip slips, how a soft object squishes, how a latch catches. Those moments are sudden and almost impossible to compute faithfully, so the simulator gets them subtly wrong, and a robot trained inside it learns to lean on physics that do not exist outside. The numbers are sobering. In one landmark result, a hand trained entirely in simulation that managed 50 successes in a row inside the sim dropped to 13 on the real hardware, a story we told in full in why manipulation is the hardest problem in AI. This is the famous sim-to-real gap, and at bottom it is a data problem: the simulation was missing the real-world detail it needed to be right.

A split image: a wireframe simulated robot hand gripping a perfect virtual object beside the same hand slipping in a real, textured setting, with a visible gap between them. — The sim-to-real gap: physics that work perfectly in a simulator fall apart on real hardware, because the simulator faked the contact.

Option two: record the real world

Recording reality avoids the fakery, but raw reality is expensive, and not all of it is equally useful. There are three ways people capture it, and they trade off sharply.

Teleoperation. A human drives the robot through the task with controllers while it records its own body. The data is gold, since it lives in the robot's exact joints, but it scales one robot-hour at a time. The honest benchmark is the DROID dataset, which took 50 people across 18 labs working 12 months to gather 76,000 demonstrations, and most of that time went to resetting the scene and fixing hardware, not demonstrating.
Third-person video. A camera on a tripod in the corner. Cheap and plentiful, but it watches from the outside, so it constantly loses the hands behind objects and never sees the task the way the person doing it sees it.
Egocentric, or first-person, human video. A camera on the doer's head or in their hand, looking out at their own hands as they work. No robot is needed to record it, and it captures the task from exactly the viewpoint the robot will later have to act from.

That last one is the breakout. When researchers trained on 20,854 hours of first-person human video, robot dexterity climbed 54% along an almost straight line, the same kind of scaling curve that built the large language models. Head to head against the same amount of teleoperation data, human video won, by about 52% on familiar tasks and 90% on unfamiliar ones. One project taught a real robot seven tasks at 70% success using nothing but footage from a pair of smart glasses, with no robot in the loop at all. We unpacked why human data beats more robots separately. The short version: people are the most scalable source of physical experience on Earth.

A person in head-mounted camera glasses performing a manual task, with a faint platinum viewing cone projecting from the glasses toward their working hands. — Egocentric capture: recording a task from the doer's own viewpoint, the exact angle the robot will have to act from.

Why raw footage still isn't data

Here is the catch that ties the whole field together. A phone clip of someone cooking is nearly worthless to a robot on its own, for the same reason your door-unlocking video was. Video records what a task looked like. A robot needs to know what actually happened: where every object sat in real space, what the hands did down to the finger, and how much force flowed through them. Turning footage into a lesson means layering several streams of information onto the same instant, all precisely aligned. Miss one, and the lesson has a hole in it.

Think of it as everything that has to be captured during a single second of a single task.

The view from the doer's eyes. First-person video. It matters that the camera sits where the head or hand is, not in the corner, because the robot has to learn the task from the viewpoint it will actually have.
Where everything is, in real metric 3D. This is the layer people underestimate. A single camera fundamentally cannot recover true scale: a small object up close and a large one far away can fill the very same pixels. So serious capture pairs the camera with motion sensors and uses depth and self-localization, the same trick a robot vacuum uses to map a room, to pin every object to a position in real centimeters rather than pixels.
Exactly what the hands did. The hand is tracked as a skeleton, typically 21 points per hand, so the robot can read precisely how the fingers opened, closed, and made contact, frame by frame.
How hard they pressed. The subtle, non-negotiable one. Vision cannot see force. Cradling a ripe strawberry and crushing it can look identical on camera. That information has to be captured right at the point of contact, with touch or force sensors, or it is simply gone.
What the task was, in words. A plain-language label, such as open the drawer or pour the water, so the robot can connect an instruction to the motion. This is what later lets you ask it to do something new in words.
The moments it went wrong. Most datasets quietly delete the frames where a hand slipped and re-gripped. Those are the most valuable frames in the whole recording, because they are how a robot learns to recover, not just how to look smooth when nothing goes wrong.

One moment of a hand grasping an object exploded into stacked aligned layers: camera view, depth map, hand skeleton, fingertip force halo, and a 3D point cloud. — What turns footage into a lesson: several streams layered onto the same instant, aligned to the millisecond. Miss one and the lesson has a hole in it.

The part you can't fake or buy

Notice what all of this means. Useful robot data is not simply collected, it is constructed, layer by layer, and every layer has to line up to the millisecond. That is slow, exacting work, and it is why having lots of video is not the same as having training data. The hard part was never pointing a camera. It is the grounding, the force, the alignment, and the quality control stacked on top of the footage.

The part the industry would rather skip

There is one more layer, an invisible one. The data that made today's AI smart was largely taken without permission. An audit of one giant dataset found that roughly 60% of it came from sites whose terms forbid scraping, and an independent review found that not a single major data-labeling platform cleared even a basic minimum-wage bar. Robotics has a rare second chance here, because its data does not exist yet. Every demonstration is being created right now, by a real person. That experience can be collected with consent, paid for fairly, and shipped with full provenance: a clear, auditable record of who did each task, under what terms, and how it was handled. For anyone putting a robot in a real home or workplace, that record is not paperwork. It is the line between a dataset you can defend and a liability you inherited.

The bottom line

Training data for robotics comes down to a single honest trade. You can fake the world in simulation and get unlimited, free, slightly wrong experience. You can record the real world through a robot and get perfect, embodied, painfully slow experience. Or you can learn from the most scalable source of real physical experience on the planet, human hands, as long as you are willing to do the hard part: ground every demonstration in real 3D, capture the forces a camera misses, keep the recoveries, label it, align it to the millisecond, and stand behind how you got it.

The next decade of robots will not be won by whoever has the most arms or the biggest model. It will be won by whoever builds the best engine for human experience: real, grounded, force-aware, recovery-rich, and collected with consent. That engine, not the robot, is the hard part. It is also the whole game.