This is very encouraging progress, and probably what Demis was teasing [1] last month. A few speculations on technical details based on staring at the released clips:
1. You can see fine textures "jump" every 4 frames - which means they're most likely using a 4x-temporal-downscaling VAE with at least 4-frame interaction latency (unless the VAE is also control-conditional). Unfortunately I didn't see any real-time footage to confirm the latency (at one point they intercut screen recordings with "fingers on keyboard" b-roll? hmm).
2. There's some 16x16 spatial blocking during fast motion which could mean 16x16 spatial downscaling in the VAE. Combined with 1, this would mean 24x1280x720/(4x16x16) = 21,600 tokens per second, or around 1.3 million tokens per minute.
3. The first frame of each clip looks a bit sharper and less videogamey than later stationary frames, which suggests this is could be a combination of text-to-image + image-to-world system (where the t2i system is trained on general data but the i2w system is finetuned on game data with labeled controls). Noticeable in e.g. the dirt/textures in [2]. I still noticed some trend towards more contrast/saturation over time, but it's not as bad as in other autoregressive video models I've seen.
[1] https://x.com/demishassabis/status/1940248521111961988
[2] https://deepmind.google/api/blob/website/media/genie_environ...
Really impressive... but wow this is light on details.
While I don't fully align with the sentiment of other commenters that this is meaningless unless you can go hands on... it is crazy to think of how different this announcement is than a few years ago when this would be accompanied by an actual paper that shared the research.
Instead... we get this thing that has a few aspects of a paper - authors, demos, a bibtex citation(!) - but none of the actual research shared.
I was discussing with a friend that my biggest concern with AI right now is not that it isn't capable of doing things... but that we switched from research/academic mode to full value extraction so fast that we are way out over our skis in terms of what is being promised, which, in the realm of exciting new field of academic research is pretty low-stakes all things considered... to being terrifying when we bet policy and economics on it.
To be clear, I am not against commercialization, but the dissonance of this product announcement made to look like research written in this way at the same time that one of the preeminent mathematicians writing about how our shift in funding of real academic research is having real, serious impact is... uh... not confidence inspiring for the long term.
I wish they would share more about how it works. Maybe a reseach paper for once? we didn't even get a technical report.
From my best guess: it's a video generation model like the ones we already head. But they condition inputs (movement direction, viewangle). Perhaps they aren't relative inputs but absolute and there is a bit of state simulation going on? [although some demo videos show physics interactions like bumping against objects - so that might be unlikely, or maybe it's 2D and the up axis is generated??].
It's clearly trained on a game engine as I can see screenspace reflection artefacts being learned. They also train on photoscans/splats... some non realistic elements look significantly lower fidelity too..
some inconsistencies I have noticed in the demo videos:
- wingsuit discollcusions are lower fidelity (maybe initialized by high resolution image?)
- garden demo has different "geometry" for each variation, look at the 2nd hose only existing in one version (new "geometry" is made up when first looked at, not beforehand).
- school demo has half a caroutside the window? and a suspiciously repeating pattern (infinite loop patterns are common in transformer models that lack parameters, so they can scale this even more! also might be greedy sampling for stability)
- museum scene has odd reflection in the amethyst box, like the rear mammoth doesn't have reflections on the right most side of the box before it's shown through the box. The tusk reflection just pops in. This isn't fresnel effect.
I'm still struggling to imagine a world where predicting the next pixel wins over over building a deterministic thing that is then ran.
Eg: Using AI to generate textures, wire models, motion sequences which themselves sum up to something that local graphics card can then render into a scene.
I'm very much not an expert in this space, but to me it seems if you do that, then you can tweak the wire model, the texture, move the camera to wherever you want in the scene etc.
> Text rendering. Clear and legible text is often only generated when provided in the input world description.
Reminds me of when image AIs weren't able to generate text. It wasn't too long until they fixed it.
This is revolutionary. I mean, we already could see this coming, but now it's here. With limitations, but this is the beginning.
In game engines it's the engineers, the software developers who make sure triangles are at the perfect location, mapping to the correct pixels, but this here, this is now like a drawing made by a computer, frame by frame, with no triangles computed.
I don't think I've ever seen a presentation that's had me question reality multiple times before. My mind is suitably blown.
Very cool! I've done research on reinforcement/imitation learning in world models. A great intro to these ideas is here: https://worldmodels.github.io/
I'm most excited for when these methods will make a meaningful difference in robotics. RL is still not quite there for long-horizon, sparse reward tasks in non-zero-sum environments, even with a perfect simulator; e.g. an assistant which books travel for you. Pay attention to when virtual agents start to really work well as a leading signal for this. Virtual agents are strictly easier than physical ones.
Compounding on that, mismatches between the simulated dynamics and real dynamics make the problem harder (sim2real problem). Although with domain randomization and online corrections (control loop, search) this is less of an issue these days.
Multi-scale effects are also tricky: the characteristic temporal length scale for many actions in robotics can be quite different from the temporal scale of the task (e.g. manipulating ingredients to cook a meal). Locomotion was solved first because it's periodic imo.
Check out PufferAI if you're scale-pilled for RL: just do RL bigger, better, get the basics right. Check out Physical Intelligence for the same in robotics, with a more imitation/offline RL feel.
Advances in generative AI are making me progressively more and more depressive.
Creativity is taken from us at exponential rate. And I don't buy argument from people who are saying they are excited to live in this age. I can get that if that technology stopped at current state and remained to be just tools for our creative endeavours, but it doesn't seem to be an endgame here. Instead it aims to be a complete replacement.
Granted, you can say "you still can play musical instruments/paint pictures/etc for yourself", but I don't think there was ever a period of time where creative works were just created for sake of itself rather for sharing it with others at masse.
So what is final state here for us? Return to menial not-yet-automated work? And when this would be eventually automated, what's left? Plug our brains to personalized autogenerated worlds that are tailored to trigger related neuronal circuitry for producing ever increasing dopamine levels and finally burn our brains out (which is arguably already happening with tiktok-style leasure)? And how you are supposed to pay for that, if all work is automated? How economics of that is supposed to work?
Looks like a pretty decent explanation of Fermi paradox. No-one would know how technology works, there are no easily available resources left to make use of simpler tech and planet is littered to the point of no return.
How to even find the value in living given all of that?
Everyone is in agreement, this is impressive stuff. Mind blowing, even. But have the good people at Google decided why exactly we need to build the torment nexus?
I wonder how hard it would be to get VR output?
That's an insane product right there just waiting to happen. Too bad Google sleeps so hard on the tech they create.
Genuinely technically impressive, but I have a weird issue with calling these world simulator models. To me, they're video game simulator models.
I've only ever seen demos of these models where things happen from a first-person or 3rd-person perspective, often in the sort of context where you are controlling some sort of playable avatar. I've never seen a demo where they prompted a model to simulate a forest ecology and it simulated the complex interplay of life.
Hence, it feels like a video game simulator, or put another way, a simulator of a simulator of a world model.
Can you imagine explaining to someone from the 1800s that we've created a fully generative virtual world experience and the demo was "painting a wall blue"
I'm not sure this is interesting beyond the wow effect. Unless we can actually get the world out of the AI. The real reason chatgpt and friends actually have customers is that the text interface is actually durable and easily to build upon after generation. It's also super ez to feed text into a fresh cycle. But this, while looking fancy, doesn't seem to be on the path to actually working out. Unless there is a sane export to unreal or something.
And unfortunately not possible to play around for the general public.
> world model
Another linguistic devastation. A "world model" is in epistemology the content of a representation of states of thing - all states of things, facts and logic.
This use of the expression "world model" seems to be a reduction. An that's too bad, because we needed the idea in its good form to speak about what neural networks contain, in this LLM sub-era.
Like the new widespread sloppy use of the expression "AI", this does not contribute to collective mental clarity.
What gets me is the egocentric perspective it has naturally produced from its training data, where you have the perception of a 3D 6 degrees of freedom world space around you. Once it's running at 90 frames per second and working in a meshed geometry space, this will intersect with augmented virtual XR headsets, and the metaverse will become an interaction arena for working with artificial intelligence using our physical action, our gaze, our location, and a million other points of background noise telemetry, all of which will be integrated into what we now today call context and the response will be adjusting in a useful, meaningful way what we see painted into our environment. Imagine the world as a tangible user interface.
Would progress in these be faster if they created 3d meshes and animations instead of full frame videos?
This will be great for VR, especially for Vison Pro
It sounds cool that Genie 3 can make whole worlds you can explore, but I wonder how soon regular people will actually get to try it out?
It's feel like Ready Player One on Vision Pro will arrive soon
The next version of this, paired with a generative sound model + VR. Matched with the tech that already exists that is able to convert thoughts to words, you could pre-empt what someone wants and display it to them in real-time. Allowing them to explore their conscious and unconscious, meeting and interacting and forming deeper relationships with the aspects of themselves that they had forgotten about.
> To that end, we're exploring how we can make Genie 3 available to additional testers in the future.
No need to explore; I can tell you how. Release the weights to the general public so that everyone can play with it and non-Google researchers can build their work upon it.
Of course this isn't going to happen because "safety". Even telling us how many parameters this model has is "unsafe".
I feel like this tech is a dead end. If it could instead generate 3d models which are then rendered, that would be immensely useful. Eliminates memory and playtime constraints, allows it to be embedded in applications like games. But this? Where do we go from here? Even if we eliminate all graphical issues and get latency from 1s to 0, what purpose does it serve?
This is beautiful. An incredible device that could expand people's view of history and science. We could create such immersive experiences with this.
I know that everyone always worries about trapping people in a simulation of reality etc. etc. but this would have blown my mind as a child. Even Riven was unbelievable to me. I spent hours in Terragen.
A Mind needs a few things: The ability to synthesize sensor data about the outside world into a form that can be compressed into important features, the ability to choose which of those features to pay attention to, the ability to model the physical world around it, find reasonable solutions to problems, and simulate its actions before taking them, The ability to understand and simulate the actions of other Minds, the ability to compress events into important features and store them in memory, the ability to retrieve those memories and appropriate times and in appropriate clarity, etc.
I feel like as time goes on more and more of these important features are showing up as disconnected proofs of concept. I think eventually we'll have all the pieces and someone will just need to hook them together.
I am more and more convinced that AGI is just going to eventually happen and we'll barely notice because we'll get there inch by inch, with more and more amazing things every day.
I thought I was not going to see too many negative comments here, yet I was mistaken. I thought if it's not LLM, people would have a more nuanced take and could look at the research with an open mind. The examples on the website are probably cherry-picked, but progress is really nice compared to Genie 2.
It's a nice step towards gains in embodied AI. Good work, DeepMind.
I feel like it’s gonna shake up the gaming world big time down the road. And I just made a site to record the impact of it: https://www.genie3ai.app/
This is one of the most insane feats of AI I have ever seen to be honest.
"developing simulated environments for open-ended learning and robotics"
What this means is that a robot model could be trained 1000x faster on GPUs compared to training a robot in the physical world where normal spacetime constraints apply.
Interesting! This feels like they're trying to position it as a competitor to Nvidia's Omniverse, which is based on the Universal Scene Descriptor format as the backbone. I wonder what format world objects can be ingested into Genie in - e.g. for the manufacturing use cases mentioned.
It's interesting, because I was always a bit confused and annoyed by the Giant's Drink/Mind Game that Ender plays in Ender's Game. It just always felt so different to how games I knew played, it felt odd that he would "discover" things that the developers hadn't intended, because I always just thought "wait, someone had to build that into the game just in case he happened to do that one specific thing?" Or if it was implied that they didn't do that, then my thought was "that's not how this works, how is it coming up with new/emergent stories?"
This feels almost exactly like that, especially the weird/dreamlike quality to it.
First AI thing that’s made me feel a bit of derealization…
…and this is the worst the capabilities will ever be.
Watching the video created a glimmer of doubt that perhaps my current reality is a future version of myself, or some other consciousness, that’s living its life in an AI hallucinated environment.
This looks incredibly promising not just for AI research but for practical use cases in game development. Being able to generate dynamic, navigable 3D environments from text prompts could save studios hundreds of hours of manual asset design and prototyping. It could also be a game-changer for indie devs who don’t have big teams.
Another interesting angle is retrofitting existing 2D content (like videos, images, or even map data) into interactive 3D experiences. Imagine integrating something like this into Google Maps suddenly street view becomes a fully explorable 3D simulation generated from just text or limited visual data.
Can anyone specifically working or with expertise in this field, give even a best guest breakdown (or better) of the technology and architecture, system design or possibly even the compute requirement's of how they think this was implemented? Very curious as to how thing works and methods employed, as they are atm tight lipped generally. So kind of curious for those who are specialists in this space what they could surmise or speculate on the implementation of Genie 3
There are very few people visible in the demo’s. I suppose that is harder?
The Simulation Theory presents the following trilemma, one of which must be true:
1. Almost all human-level civilizations go extinct before reaching a technologically mature “posthuman” stage capable of running high-fidelity ancestor simulations.
2. Almost no posthuman civilizations are interested in running simulations of their evolutionary history or beings like their ancestors.
3. We are almost certainly living in a computer simulation.
I'm imagining how these worlds combined with AI NPCs could help people learn real-world skills, or overcome serious anxiety disorders, etc.
That's completely bonkers. We are making machines dream of explorable, editable, interactable worlds.
I wonder how much it costs to run something like this.
The demo looks like they’re being very gentle with the AI, this doesn’t look like much of an advancement.
I find the model very impressive, but how could it be used in the wild? They mention robots (maybe to test them cheaply in completely different environments?), but I don't see the use in games except during development to generate ideas/assets.
We were working towards this years ago with Doarama/Ayvri, and I remember fondly in 2018 an investor literally yelling at me that I didn't know what I was talking about and AI would never be able to do this. Less than a decade later, here we are.
Our product was a virtual 3d world made up of satellite data. Think of a very quick, higher-res version of google earth, but the most important bit was that you uploaded a GPS track and it re-created the world around that space. The camera was always focused on the target, so it wasn't a first person point of view, which, for the most part, our brains aren't very good at understanding over an extended period of time.
For those curious about the use case, our product was used by every paraglider in the world, commercial drone operations, transportation infrastructure sales/planning, out-door events promotions (specifically bike and ultramarathon races).
Though I suspect we will see a new form of media come from this. I don't pretend to suggest exactly what this media will be, but mixing this with your photos we can see the potential for an infinitely re-framable and zoomable type of photo media.
Creating any "watchable" content will be challenging if the camera is not target focused, and it makes it difficult to create a storyline if you can't dictate where the viewer is pointed.
The idea of a possible new AI winter to come seems less likely with each new announcement. Robust world models can be used as an effectively infinite source of training data.
Kinda wish the ski scenario had "yeti" as an event you could trigger.
I wouldn't want to be a hollywood production studio or game developer right now.
have they released a research paper / do we have details on the architecture and training?
This is bad use of AI, we spend our compute to make science faster. I am pretty confident computational cost of this will be maybe 100x of chatgpt query. I don't want to think even environmental effects.
I can't believe it's been 24 hours and nobody mentioned the word "holodeck" :)
We need more of these models that can truly understand world building and how the world works. Then we can talk about real AGI
This better be enough of a technological leap for Half Life 3 to come out
The claims being made in this announcement are not demonstrated in the video. A very careful first person walk in an AI video isn’t very impressive these days…
What format do these world models output? Since it's interactive, it's not just a video...does DeepMind have some kind of proprietary runtime or what?
Deeply disillusioned — the cyclist was not a pelican.
(See "Exploring locations and historical settings" scene 5.)
So we cannot use this yet?
While watching the video I was just imagining the $ increasing by the second. But then it's not available at all yet :(
I lost a lot of trust in Google’s AI announcements after being initially amazed an impressed by this debacle:
https://arstechnica.com/information-technology/2023/12/googl...
Damn, I'm getting Black Mirror vibes from this. Maybe because I watched the Eulogy episode last night.
Really great work though, impressive to see.
So are foundational models real finally now?
Are they just multimodal for everything?
Are foundational time series models included in this category?
Have they explained anywhere what hardware resources it takes to run this in 720p at 24fps with minutes-long context?
it's simulations all the way down
It works only with a text prompt? No way to give it an image as a starting point?
The elephant in the room is the porn capabilities. Onlyfans will be dead in 10 years.
I wonder how far are we from being able to use this at home as a form of entertainment.
People are thinking "how are video games going to use this?"
That's not the point, video games are worth chump-change compared to robotics. Training AIs on real-world robotic arms scaled poorly, so they're looking for paths that leverage what AI scales well at.
Make this programmable to the details would be a game-changer
I can see this being incredible for history lessons and history school lectures
This is scary. I don’t have a benchmark to propose but in don’t think my brain can imagine things with greater fidelity than this. I can probably write down the physics better but I think these systems have reached parity with at least my imagination model
Wondering What happens when we peer through a microscope or telescope?
Strap on a headset and we are one step closer to being in a simulation.
They’re very clever to only turn 90 degrees. I’d like to see a couple of 1080s with a little bit of 120 degree zig zagging along the way please.
I am much more convinced now that the Simulation Argument is correct
Damn, this reminds me of those Chinese FMV games on Steam.
Now this could be the killer app VR's been looking for.
Mark Zuckberd must very very upset looking at this, I expect him to throw another billion dollars at google engineers
a lot to unpack here, i've added a detailed summary here:
https://extraakt.com/extraakts/google-s-genie-3-capabilities...
Just imagine if the developers of Star Citizen had access to this technology, how much more they could have squeezed from unsuspecting backers.
what a time to be alive
a massive leap forward for real-time world modeling
What would scare me is if this becomes economically viable enough to release to the public, rather than staying an unlimited budget type of demo.
this is crazy, really cool
google pushing new levels of evil with this one
Another case of moving the goalposts until you score a goal.
Jesus.
This is starting to feel pretty **ing exponential.
Think of the pornographic possibilities
/s
Yet another unavailable model from Google.. if I can't use it, I don't care. Tell me about it when it's ready to use.
like how? is this mainly realtime inference?
Not open source, not worth it. Next.
movies are about to become cheap to produce.
good writers will remain scarce though.
maybe we will have personalized movies written entirely through A.I
What is the purpose of this? It seems designed to muddy the waters of reality vs. falsehood and put creatives in film/tv out of jobs. Real Jurassic Park moment here
Consistency over multiple minutes and it runs in real time at 720p? I did not expect world models to be this good yet.
> Genie 3’s consistency is an emergent capability
So this just happened from scaling the model, rather than being a consequence of deliberate architecture changes?
Edit: here is some commentary on limitations from someone who tried it: https://x.com/tejasdkulkarni/status/1952737669894574264
> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...