← Blog

June 23, 2026 · Manukriya · Robot Learning

AI has read the entire internet. It still doesn't know what a door weighs.

A plain-English guide to embodied AI: the kind that acts in the physical world instead of just talking about it. Why the internet can't teach it, why the bottleneck is neither the brain nor the body but lived experience, and why that experience has to come from us.

AI has read the entire internet. It still doesn't know what a door weighs.

A large language model has read more than any human could in a thousand lifetimes. It can explain the physics of friction, walk you through a recipe for hollandaise, and describe exactly how a hinge lets a door swing. Hand it an actual door, though, and it has no idea how hard to push. It has never felt one.

That gap is the whole story of embodied AI. There are two kinds of artificial intelligence now. One lives in a world made entirely of text and pixels, and it has become breathtakingly good. The other has to operate in ours, where things have weight and friction and consequences, and by comparison it is still a toddler. This is a guide to that second kind: what it is, why it turned out to be so much harder than the first, and the unglamorous reason the whole field is now stuck on the same wall.

What embodied AI actually is

Embodied AI is artificial intelligence that senses and acts in the physical world through a body: a robot arm, a humanoid, a self-driving car, a drone. The body is the entire point. Contrast it with what most people mean by AI today. A chatbot or an image generator is digital AI. It maps an input to an output inside software. The output is a paragraph, a picture, a prediction. Nothing in the room has to move, and if the answer is wrong, nothing breaks.

Every embodied system runs the same loop, many times a second: sense the world, decide what to do, act, then sense how the world changed because it acted. Roboticists call it the perceive, plan, act loop. The trap is hidden in that last step. A digital model's output never alters its own input. An embodied agent changes the very world it then has to read, with every move it makes. There is no undo button, and the clock never stops.

The idea that intelligence needs a body is older than the current excitement. In 1990 the MIT roboticist Rodney Brooks published a paper with the marvelous title "Elephants Don't Play Chess." His argument was that the AI of the era had it backwards. Intelligence, he said, does not come from shuffling abstract symbols inside a sealed box. It comes from a body coping with a messy, unpredictable world. His most quoted line: it is better to use the world as its own model. Thirty-five years later, that sentence is the founding bet of an entire industry.

Why the internet can't teach it

Here is the single most important fact about embodied AI, and the one most explainers skip. Digital AI got brilliant by reading the internet: roughly thirty years of human writing, code, and images that were already there for the taking. It is the largest pile of training material ever assembled, and a language model only had to swallow it. Embodied AI has no equivalent pile, because the physical world was never written down.

Think about what a robot actually needs to know, and then ask where it would look it up. How firmly to hold a paper cup so it neither slips nor crumples. How a folded towel sags when you lift it by one corner. What your arm does in the tenth of a second before a full mug starts to tip. None of that is online. It was never recorded, because until very recently nobody had a reason to record it. A digital model can train on the whole web. A robot needs data that someone, somewhere, physically performed.

Robots can't train on the internet. They train on us. That one line separates embodied AI from everything that came before it, and it is the reason this field is so much harder, and so much more interesting, than building a better chatbot.

We took one corner of this problem apart in a companion piece on why robot manipulation is the hardest unsolved problem in AI. Manipulation is the hands. Embodiment is the bigger frame around it: the whole problem of being an agent with a body in a world that pushes back, refuses to hold still, and never comes with labels.

A humanoid robotic hand reaching toward an ordinary door handle, with faint platinum force lines radiating from the handle.
An ordinary door tells a robot nothing about its own weight. That knowledge was never written down.

Why everyone is suddenly building robots

If the embodiment idea dates to 1990, why is half of Silicon Valley pouring into humanoids right now? Because three things that were each hard on their own all happened within about two years of one another.

  • The software learned to act. In 2023 Google DeepMind showed RT-2, an early vision-language-action model, or VLA: a single model that takes in what a camera sees plus a plain-English instruction and outputs robot movements directly. In 2024 the startup Physical Intelligence followed with its foundation model, trained across many different robot types. For the first time, the same kind of model that powers a chatbot could also drive a pair of hands.
  • The hardware got cheap. A capable research humanoid used to cost about as much as a house. Unitree's G1, which arrived around $16,000, dropped that to roughly the price of a used car. Thousands of labs and companies could suddenly afford a body to experiment on.
  • The money showed up. In September 2025 the humanoid company Figure raised over a billion dollars at a $39 billion valuation, up roughly fifteen times in eighteen months. NVIDIA's Jensen Huang put it plainly, declaring that physical AI had arrived and that every industrial company would eventually become a robotics company. The field filled out fast: Figure, Tesla's Optimus, the 1X NEO, Agility's Digit, and a reborn, fully electric Atlas from Boston Dynamics.

It is a real moment, and the demos are genuinely impressive. But notice what all three breakthroughs have in common. They are about the brain and the body: better models, better hardware, more capital. Almost none of it touches the one thing that actually decides whether a robot works in your kitchen.

A line of humanoid robot silhouettes of varying heights rendered in platinum silver against a near-black void.
The humanoid race filled out fast, but better bodies and bigger models do not solve the data problem.

The bottleneck you can't buy

Put the most advanced humanoid on the market in a real kitchen it has never seen, and it will still fumble chores a four-year-old finds boring. Not because its motors are too weak or its model is too small, but because it has never built up the experience. And experience, for something with a body, cannot be downloaded. It has to be lived, or borrowed from someone who lived it.

You can see the scale of the gap in the numbers. The largest open robot dataset ever assembled, Open X-Embodiment, gathers about 1 million demonstrations from 22 different robot types across 21 institutions. It took a once-in-a-field collaboration to build, and it is still a rounding error next to the trillions of words a language model trains on. Every one of those robot demonstrations had to be physically performed. There is no shortcut, and there is no web to scrape.

So the field has quietly converged on the only source of physical experience that already exists at human scale: people. There are billions of us, doing dexterous things with our hands all day long. In the words of one robotics lead, we are the most scalable embodiment on the planet.

What the evidence says about learning from us

This is not a hunch anymore. Over the last two years, several results have made the case hard to wave away, and all of them point at first-person, or egocentric, human video: footage shot from a camera on the person's head or in their hand, looking out at their own hands as they work.

  • It scales like a law. One study trained on 20,854 hours of first-person human video and watched a robot's dexterity climb 54%, along an almost ruler-straight line. More human footage in, predictably better hands out. That is the same shape of curve that built today's language models.
  • It beats the expensive option. Compared head to head against the same amount of painstaking robot teleoperation data, human video won: about 52% higher success on familiar tasks and 90% higher on tasks the robot had never seen.
  • It can need no robot at all. One project taught a real robot seven tasks at 70% success using nothing but footage from a pair of smart glasses, with no robot involved in collecting the data.

There is a catch, and it matters: raw video is not training data. A phone clip of someone cooking is nearly useless to a robot on its own. It has to be turned into something precise and grounded: where things sit in real 3D space, exactly what the hands did, and how much force they used, which a camera alone can never see. That conversion is the hard, unglamorous craft underneath all of this, and it is a story in itself.

A first-person view through smart glasses of human hands manipulating an object on a table, with faint 3D grounding lines and depth points overlaid.
Turning raw first-person video into training data: where things sit in 3D, what the hands did, and the forces a camera cannot see.

The choice we get this time

There is one more difference between the two kinds of AI, and it is the part the industry tends to mumble. The data that made digital AI smart was largely taken without asking. One audit of a giant training set found that roughly 60% of it came from websites whose own terms forbid scraping, and an independent review found that not a single major data-labeling platform cleared even a basic minimum-wage bar. "Ethically sourced" was usually a slide in a pitch deck, not something you could actually check.

Embodied AI is different in a hopeful way: its data does not exist yet. It is being created right now, task by task, person by person. That means there is a genuine choice this time. The human experience that teaches robots can be collected with consent, paid for fairly, and shipped with full provenance: a clear, auditable record of who did each task, under what terms, and how it was handled. For anyone about to put a robot in a real home or workplace, that is not a compliance footnote. It is the difference between a dataset you can stand behind and a liability you inherited.

The bottom line

Embodied AI is the kind that has to live in our world instead of describing it. It is harder than digital AI for one reason above all others: the internet could teach a model to talk, but nothing online can teach it what a door weighs. That knowledge lives only in the doing, and the doing lives only in us.

The companies that matter in the next decade of robots will not be the ones with the flashiest hardware or the biggest model. They will be the ones that built the better engine for human experience: diverse, grounded in real 3D, captured with the forces a camera misses, and collected in a way they would be proud to put their name on.