① Recently, NVIDIA released the physical AI large model Cosmos, which can predict environments such as warehouses and traffic conditions to train robots; ② According to the list disclosed by NVIDIA, the first batch of users of Cosmos includes manufacturers such as 1X, Agility, Figure AI, and Xpeng Motors; ③ Brokerages believe that among the methods for collecting training data for humanoid robots, synthetic data will greatly promote the development of robots.
According to the Star Daily on January 8, embodied intelligence, which has been favored by top global Technology companies such as Google, OpenAI, and Microsoft, is rapidly approaching its ChatGPT moment.
Recently, NVIDIA's CEO Jensen Huang officially launched the physical AI large model Cosmos at CES. This model allows developers to generate physics-based videos based on combinations of text, images, and video inputs, as well as robot Sensors or motion data, to predict real environments such as warehouses, factories, and traffic conditions, thereby facilitating training for robots and Smart Automobile.
The so-called physical AI large model refers to a world foundation model that can understand the language of the world, physical characteristics, spatial positions, and other elements, and synthesize relevant physical data. It is key to accelerating the popularization of Smart Automobiles, embodied intelligence, and other AI terminals. Compared to the leapfrog progress of large language models like ChatGPT, world models are still in a relatively early stage, generally facing issues such as high development costs and inability to consistently adhere to physical rules.
It is worth mentioning that the Cosmos released by NVIDIA will be made available in open source. According to the disclosed list, the first batch of users includes more than a dozen domestic and international robot and automobile manufacturers such as 1X, Agile Robots, Agility, Figure AI, Foretellix, Fourier, Galbot, Hillbot, IntBot, Neura Robotics, Skild AI, Virtual Incision, Waabi, and Xpeng Motors.
In fact, NVIDIA's attempt to train robots using realistic physical environments dates back to June 2024 when it utilized the simulation framework RoboCasa to provide thousands of 3D models across over 150 object categories and dozens of interactive Furniture and appliances. The related experiments proved the effectiveness of synthetic physical data in robot training.
Jensen Huang stated, "The world foundation model is the basis for advancing the development of robots and Smart Automobiles, but not all developers possess the expertise and resources required to train models independently. We created Cosmos to popularize physical AI, allowing every developer to access general robot technology."
As of now, several companies have launched world foundation models. On December 5, 2024, Google released the large foundation world model Genie2, capable of generating relatively realistic 3D worlds; in September of the same year, 1X Technologies released a humanoid robot world model that can simulate future scenarios of robots performing different actions.
Additionally, video generation models are also seen as one of the pathways to world foundational models. In the field of video generation, Sora and Runway have both expressed their aspirations to venture into world models. HTSC points out that video generation and world models share many similarities; both encode and compress data obtained from the complex external world into lower-dimensional vectors and learn this knowledge in the spatiotemporal dimensions using Transformers or other models to achieve predictions.
HTSC's research report today indicates that inspired by large text models, humanoid robots are also beginning to build embodied large models, with the primary focus on solving data issues. Autonomous driving can be simplified as 2D motion in a 3D space, while robots operate in 3D motion within that space and also need to include information such as force and touch. Therefore, theoretically, the data requirement for robots is greater than that for autonomous driving. Currently, the collection of training data for humanoid robots relies mainly on three methods:
Collecting real machine data, such as when a person wears motion capture suits; this method produces high-quality data but has high collection costs and slow speeds.
Using simulated environments to generate synthetic data for training robots.
Capturing motion data based on existing internet videos; although there is no need to build a simulation physics engine, it involves complex coordinate transformations and lacks dimensions related to force and touch.
HTSC believes that among the three methods mentioned above, synthetic data will greatly promote the development of robots. The academic community has already proven the feasibility of these methods, and the robot brain has welcomed its ChatGPT moment.