During a recent episode of the podcast Possible, co-hosted by Reid Hoffman, co-founder of LinkedIn, Demis Hassabis, CEO of Google DeepMind, announced that Google intends to merge its Gemini AI models with its Veo video-generating models to enhance the AI’s understanding of the physical world.
“From the outset, we designed Gemini, our foundational model, to be multimodal,” Hassabis explained. “The rationale behind this is our vision of creating a universal digital assistant that genuinely aids users in the real world.”
The AI industry is progressively advancing toward ‘omni’ models — systems capable of comprehending and synthesizing multiple forms of media. Google’s latest Gemini models have the ability to generate audio, images, and text, whereas OpenAI’s default model in ChatGPT now also creates images, including Studio Ghibli-style artwork. Furthermore, Amazon plans to introduce an “any-to-any” model later this year.
Developing these omni models demands substantial training data — encompassing images, videos, audio, text, and more. Hassabis indicated that Veo’s video data is predominantly sourced from YouTube, which is owned by Google.
“By extensively analyzing YouTube videos, Veo 2 can deduce the principles of physical dynamics,” stated Hassabis.
Google previously communicated to TechCrunch that its models “might be” trained using “some” YouTube content, respecting agreements with YouTube content creators. Reportedly, last year the company expanded its terms of service partly to access more data for training its AI models.