June 23, 2026 · Manukriya · Robot Learning
AI learned to imagine the world. It still gets the physics wrong.
A world model is an AI that learns to imagine what happens next, so a robot can rehearse before it acts. A plain-English guide to what it is, why 2026 made everyone a believer again, why it has disappointed before, and the catch that decides whether this time is different.

Someone tosses you their keys from across the room. Before the keys even leave their hand, your brain has already sketched the whole arc: where they will be in half a second, how fast, exactly where your hand has to go to meet them. You are not reacting to the keys. You are predicting them, running a little simulation of the world a beat ahead of reality and then living inside that prediction. That predictive movie in your head is, more or less, what AI researchers are now racing to build for machines. They call it a world model.
It might be the most important idea in AI right now, and also one of the most oversold. This is a plain-English guide to what a world model actually is, why it could change robotics, why it has let people down before, and the catch that decides whether this time is any different.
What a world model actually is
A world model is an AI that learns to predict what happens next. Give it the current state of the world and an action you might take, and it hands you back the state you would end up in. Nudge this glass and it slides; push harder and it tips. In effect it is a learned physics engine: not a set of equations a human wrote down, but an intuition for how the world behaves, soaked up from watching the world behave.
Under the hood it does three things you can hold in one breath. It compresses the flood of raw pixels into a compact sense of what is going on, the state. It learns how that state changes when something acts on it, the dynamics. And it can run that forward, step after step, to picture a future that has not happened yet. None of this is new. Ha and Schmidhuber laid it out cleanly in a 2018 paper titled simply World Models, and the intuition is decades older than that.
Why it would be a big deal
Today's robots are mostly reactive. They see, they act, they see what changed, they act again, with very little sense of what is coming. A world model promises something better: a robot that can imagine the consequences of a move before it commits, the way you rehearse a tight parking job in your head before you touch the wheel. Three things follow from that.
- It can plan instead of just react. A robot with a world model can mentally try several moves, see which imagined future works out best, and only then act in the real one.
- It learns far faster. Real-world practice is slow and costly: every attempt needs a real robot, real time, and real breakage. If a robot can practice inside its own imagined world, it can run thousands of attempts for almost nothing. The landmark example is DreamerV3, an AI that taught itself to mine diamonds in Minecraft from scratch, with no human demonstrations, largely by practicing inside its own learned model of the game. Approaches like this can be 10 to 100 times more sample-efficient than learning purely by doing.
- It edges toward cause and effect. Predicting what happens if you do something quietly forces a model toward a rough grasp of why, which is the beginning of the common sense that robots so badly lack.

Why it has burned people before
Here is the part the hype skips. World models are not new, and they have a long record of underdelivering. As one working roboticist put it, world models have historically really underperformed expectations. The dream of an AI that simulates reality has been circling for decades, and it kept hitting the same wall: the predictions came out blurry and wrong.
The AI researcher Yann LeCun spent years trying to make a model predict the next frame of video pixel by pixel, and kept getting a smear. His conclusion was that the blur was not a bug to tune away; it was the model telling him something true. The future is not one fixed picture. Too many things could happen next, and a model forced to draw the single most likely image just averages them into mush. Predicting reality in full detail turned out to be far harder than predicting the next word in a sentence.
Why everyone suddenly believes again
So why is the whole field talking about world models in 2026? Because one thing changed: video generation got shockingly good. The same wave of models that can conjure a photorealistic clip from a single sentence turned out to be, in a sense, world models in disguise. A model that can generate a convincing video of what happens next has implicitly learned something about how the world moves.
The money and talent followed fast. NVIDIA released Cosmos, a family of world foundation models built specifically to generate training scenarios for robots and self-driving cars. Google DeepMind showed Genie 3, which turns a typed sentence into a playable, navigable 3D world that stays consistent for minutes at a time. And Fei-Fei Li, who helped kick off the modern deep-learning era, raised about a billion dollars for her company World Labs to chase what she calls spatial intelligence: AI that understands and generates 3D worlds rather than just words. Her framing is that language was the last frontier and physical space is the next one.

Where the dream still breaks
It is genuinely impressive, and it is genuinely not solved. Push these models past a few seconds and the cracks appear, and they are exactly the cracks that matter for a robot.
- They hallucinate physics. A generated world will happily show water curling upward, a stack of blocks that should topple but stands, or a hand that closes straight through a solid object. It looks plausible for a moment and is simply wrong.
- They forget. Objects drift, change shape, or quietly vanish once they leave the frame. Object permanence, which a human baby masters in months, is still shaky for these models.
- Small errors compound. Roll the prediction forward and tiny mistakes stack on tiny mistakes until the imagined world drifts away from any real one. A robot that plans inside a drifting dream is planning for a world that will never show up.

There is a reason these are the failures, and it points straight at the heart of the matter. Most of these models learned by watching enormous amounts of internet video. Video shows you what things look like. It almost never shows you what they feel like: how much force a grip took, how a soft object pushed back, the precise instant a surface caught. A model trained only on appearances learns the look of physics without the substance of it. So it produces dreams that pass the eye and fail the hand.
The catch that decides everything
This is the same lesson robotics keeps relearning, in a new costume. We have written before about the sim-to-real gap, where a robot trained in a hand-built simulator falls apart in reality because the simulator faked the physics of contact. A world model is just a simulator the AI built for itself instead of one an engineer wrote by hand. It inherits the same fatal weakness: it is only ever as accurate as the experience it learned from.
So the real question is not how do we build a better world model. It is what do we feed it. A model that has only ever seen the surface of things will keep dreaming surfaces. To dream a world a robot can act in, it has to learn from a world a robot can act in: real human interaction, grounded in actual 3D space, carrying the forces and contacts a camera alone can never capture, including the messy moments where a grip slipped and a hand recovered. That is the same reason robots train on us rather than on the internet. It is the raw material a trustworthy world model is made of.
The bottom line
A world model is AI learning to imagine the world so a robot can rehearse before it acts. It is a genuinely big idea, and the demos are dazzling. But a dream is only as truthful as the experience behind it, and most of today's world models have only ever watched the world from the outside.
A world model can only imagine a reality as accurate as the reality it was shown, and the reality robots need was never captured by a camera watching from across the room. It has to be lived, grounded, and felt. Build the world model on that, and it can finally dream in physics that hold up. Build it on surfaces, and it will keep waking robots into a world that was never real.