← Blog

June 23, 2026 · Manukriya · Robot Learning

Robots can write your essay. They still can't load your dishwasher.

An AI can write your essay but can't load your dishwasher. Why robot manipulation is the hardest problem left in AI, and why the fix isn't smarter robots, it's better data: the human kind, grounded in 3D and collected with consent.

Robots can write your essay. They still can't load your dishwasher.

An AI can draft a legal brief, debug your code, and pass the bar exam. Ask that same level of intelligence to pick a single sock off the floor and put it in a basket, and it falls apart.

That gap has a name. It's called manipulation. It's a robot using its hands to touch and rearrange the physical world, and it is, by a wide margin, the hardest unsolved problem in physical AI. Everything a robot might one day do for us in a home, a warehouse, or a hospital runs through it. A robot that can see but can't reliably grasp is just a camera on wheels.

This is a guide to why that problem is so stubborn, the three ways people are trying to crack it, and the shift the whole field is quietly making: away from building cleverer robots and toward feeding them better data. We'll keep the jargon to a minimum and define it when it shows up.

What “manipulation” really means

The cleanest definition comes from roboticist Matt Mason: manipulation is an agent's control of its environment through selective contact. In plain terms, everything you do to move the world that isn't moving yourself. Walking is locomotion; the robot moves itself. Picking up a mug, threading a cable, prying open a laptop: that's manipulation; the robot moves everything else.

It feels trivial because you've been doing it since you were one year old. That's exactly the trap. There's a famous observation in AI called Moravec's paradox: the things that feel hard to us (chess, calculus) are easy to automate, and the things that feel effortless (seeing, walking, using our hands) are brutally hard. Moravec's own explanation, back in 1988, was evolution. We've spent roughly a billion years getting good at sensorimotor skill and only a few thousand at abstract thought.

We are all prodigious olympians in perceptual and motor areas, so good that we make the difficult look easy. (Hans Moravec, 1988)

So when a robot fumbles a sock, it isn't being dumb. It's failing at the single thing evolution spent the longest perfecting.

When people in the field say “manipulation,” they're usually pointing at one of six flavors:

  • Pick-and-place: grab something and move it from A to B. (Most warehouse robots, today.)
  • In-hand / dexterous: reorient an object within the hand, without setting it down. (Spinning a pen into writing position.)
  • Tool use: hold one object to act on another. (A knife, a screwdriver, a spatula.)
  • Two-handed (bimanual): coordinate two hands for what one can't do alone. (Opening a jar; folding a sheet.)
  • Soft / deformable: handle things that change shape the instant you touch them. (Cloth, cables, food, tissue.)
  • Assembly / insertion: high-precision mating of parts. (Plugging in a connector; a peg in a hole.)

Difficulty climbs as you go down that list. A factory arm doing the same insertion a million times is manipulation. So is a robot folding a crumpled shirt in an unfamiliar kitchen. The second is harder by orders of magnitude, and the reason why is physics.

Why it's so much harder than it looks

Three things conspire.

1. The moment of contact breaks the math. As long as a robot is moving through empty air, its motion is smooth and predictable, and easy to compute. The instant a fingertip touches something, the physics turns into what researchers call a non-smooth, non-convex problem: the object can stick, slip, roll, tip, or stay put, and tiny changes in where and how hard you press send those outcomes flying in completely different directions. There often isn't one clean answer to solve for. Contact is precisely the part of the world that's hardest to predict, and manipulation is nothing but contact.

2. The robot is sensing in slow motion. Your hand corrects a slipping grip in about a tenth of a second, faster than you can consciously notice. And you don't do it by reacting; your brain predicts the slip and adjusts before it happens, because pure reaction would be too slow for anything delicate. A robot's eyes, by contrast, update its control roughly 5 to 50 times a second, while its motors can twitch a thousand times a second. So the robot has fast muscles wired to slow eyes. It's like trying to catch a dropped glass while watching the world through a laggy video call.

3. The real world has infinite variety. Train a robot to grip rigid plastic bottles and it will faceplant on a wet glass, a bag of rice, or a balled-up shirt. There are effectively endless objects, surfaces, and lighting conditions, and a skill learned in one lab routinely collapses when the table is a few centimeters higher. This points at something important: generalization is, fundamentally, a data problem. The more varied the experience a robot learns from, the better it copes with a world it hasn't seen. Which raises the real question: where does that experience come from?

A robotic fingertip pressing a curved glass surface at the instant of contact, with faint silver force lines radiating from the contact point.
The instant of contact: where smooth, predictable motion turns into physics no one can fully compute.

Three ways to teach a robot to use its hands

There are essentially three, and each buys you something at a real cost.

Teleoperation: a human puppeteers the robot. Someone drives the robot through the task with a controller or a twin set of arms while it records its own movements. The data is excellent: it's in the robot's exact body, so it can learn from it directly. The open-source ALOHA rig (under $20k) can teach a task at 80 to 90% success from about 50 demonstrations. The catch is that it doesn't scale. You need the physical robot for every single recording, plus a skilled human, in real time. The DROID dataset is the honest benchmark: gathering it took 50 people across 18 labs working for 12 months to capture 76,000 demonstrations. And most of that time isn't even demonstrating; it's resetting the scene and fixing hardware. You can't scrape teleoperation off the internet. You earn every minute of it.

Simulation: practice inside a video game of physics. Spin up a virtual world, run a million attempts overnight, get perfect labels for free, and let the robot fail without breaking anything. It's genuinely powerful for some skills. But simulators fake the one thing manipulation is about: contact. How a fingertip presses and slips, how a sponge squishes, how a lid catches: those moments are sudden and nearly impossible to compute faithfully, so the simulator gets them subtly wrong. A robot trained only in there learns to exploit the simulator's wrong physics, and that trick shatters on a real, slightly wet plate. One landmark result is telling: a hand trained purely in simulation that managed 50 successes in a row in the sim dropped to 13 on the real hardware. Removing the trick that papered over the gap sent real-world performance to nearly zero. As one researcher put it, the world is its own best model.

Learning from human video: watch people do it. Instead of puppeteering a robot or faking physics, just record humans doing real tasks, ideally from their own point of view, with a camera on their head or in their hand, and teach the robot from that. No robot needed during recording. This is the newest of the three and, increasingly, the one the field is betting on.

A way to hold all three in your head:

  • Teleoperation: perfect, robot-ready data, but it scales one robot-hour at a time and can't be scraped.
  • Simulation: unlimited scale and free labels, but it fakes contact, the part that matters most.
  • Human video: the most scalable source on Earth, but raw video isn't training data yet.

That last “yet” is the whole game.

Three minimal panels: a human hand on a control handle, a translucent wireframe simulation cube, and a first-person head-mounted camera view of a hand reaching for an object.
Three ways to teach a robot to use its hands: teleoperation, simulation, and learning from human video.

The quiet plot twist: the bottleneck is data, not brains

For a long time the assumption was that robots needed a smarter algorithm. The frontier has stopped believing that. A peer-reviewed 2026 survey of the field put it bluntly: future advances will depend less on model architecture and more on data engines, and it called data infrastructure a first-class research problem.

Why? Because of an asymmetry that's easy to miss. Language models got brilliant by swallowing the internet, the trillions of words that were already sitting there. Robotics has no internet to swallow. Every training example has to be physically performed, one attempt at a time. The largest open robot dataset, Open X-Embodiment, is about 1 million demonstrations pooled from 22 robot types across 21 institutions, and that was a heroic, once-in-a-field collaboration. A million is a rounding error next to what language models train on. The robot brain isn't the constraint anymore. The robot's experience is.

So the real question becomes: what's the most scalable way to manufacture experience? And the answer the data keeps giving is us.

Why human data wins

Three findings from the last couple of years make the case hard to argue with.

  • It scales like a law of nature. One large study trained on 20,854 hours of first-person human video and saw a robot's dexterity climb 54% over a version trained without it, following an almost ruler-straight scaling curve. More human footage, predictably better hands. That's the kind of curve that built today's large language models.
  • It beats the expensive stuff head-to-head. When researchers compared human video against the same amount of painstaking teleoperation data, the human video won, with 52% higher success on familiar tasks and 90% higher on unfamiliar ones. The cheap, scalable source isn't just more available; it's better.
  • It can need zero robot data at all. One project trained a real robot to do seven tasks at 70% success using nothing but footage from a pair of smart glasses, with no robot involved in collecting the data.

But raw video is not training data. A phone clip of someone cooking is almost useless to a robot on its own. To learn from human hands, you have to turn that footage into something grounded and precise:

  • Where things are, in real 3D. The video has to be anchored in actual metric space: real centimeters, not pixels. A single camera fundamentally can't recover true scale, which is why serious capture pairs the camera with motion sensors to pin down an exact position for every frame.
  • What the hands are doing. The hand gets tracked as a skeleton of 21 points so the robot can read exactly how the fingers opened, closed, and made contact.
  • How hard they pressed. Here's a subtle one: vision alone cannot capture force. Many completely different grip forces look identical on camera. Gently holding a carrot and crushing it can look the same from the outside. That information is invisible to a video and has to be captured at the point of contact.
  • The recoveries, not just the wins. Most datasets quietly throw away the moments where a hand slipped and re-grabbed. Those are the most valuable frames there are. They're how a robot learns to get out of trouble, not just how to look good when everything goes right.

That's the difference between a video a robot can watch and data a robot can learn from.

First-person view of human hands manipulating a small object on a table, overlaid with a translucent 21-point hand skeleton and a faint 3D point cloud.
What turns raw human video into training data: hand pose, metric 3D grounding, and contact, captured from the wearer's own point of view.

The part of the data conversation nobody wants to have

There's one more thing, and it's the part most of the industry would rather not say out loud.

The data that made today's AI smart was largely taken without permission and without paying the people it came from. An audit of one massive training set found that roughly 60% of it came from websites whose own terms forbid scraping. And the workers who label and clean AI data? An independent fair-work assessment found that not a single major annotation platform cleared even a basic minimum-wage bar. “We sourced it ethically” has, for most of this industry, been a slide in a deck, not a fact you could check.

Physical AI is being built right now, and the data for it doesn't exist yet, which means that this time there's a choice. The human experience that teaches robots can be collected with consent, with fair pay, and with full provenance: a clear, auditable record of who performed each task, under what terms, and how it was handled. That's not a compliance footnote. For anyone deploying a robot into a real home or workplace, it's the difference between a dataset you can defend and a liability you inherited.

The bottom line

Manipulation is hard because the instant a robot touches the world, the physics turns unpredictable and the robot is, in a sense, watching in slow motion, fumbling at the one skill evolution spent a billion years perfecting. We spent years assuming the fix was a smarter robot. The evidence now says otherwise.

You can teach a robot by puppeteering it, one expensive robot-hour at a time. You can teach it in a simulator that fakes the very contact that matters. Or you can teach it from the most scalable source of physical experience on the planet, human hands, if you do the hard part: ground that experience in real 3D, capture the forces a camera can't see, keep the recoveries, and collect it all in a way you'd be proud to put your name on.

The companies that win the next decade of robotics won't be the ones with the biggest robot fleet. They'll be the ones who built the better data engine: diverse, recovery-rich, consented, and fairly paid.