NVIDIA has open-sourced world foundational models, accelerating humanoid robots towards the "ChatGPT moment."

cls.cn · Jan 7 23:50

①近日，英伟达发布物理AI大模型Cosmos，能够预测仓库、路况等环境以训练机器人；②据英伟达披露名单，Cosmos首批用户包括1X、Agility、Figure AI、小鹏汽车等厂商；③券商认为，在人形机器人训练数据的收集方式中，合成数据将大大促进机器人发展。

《科创板日报》1月8日讯被谷歌、OpenAI、微软等全球顶尖科技大厂纷纷看好的具身智能，正加速迎来它的ChatGPT时刻。

近日，英伟达掌门人黄仁勋在CES演讲上正式推出物理AI大模型Cosmos。据介绍，这款模型使开发者能够根据文本、图像和视频等输入组合以及机器人传感器或运动数据生成基于物理学的视频，实现对现实环境（如仓库、工厂、交通路况等）的预测，从而完成对机器人和自动驾驶汽车的训练。

所谓物理AI大模型，即是世界基础模型，其能够理解世界语言、物理特性、空间位置等要素，并合成相关物理数据。是加速智能汽车、具身智能等AI终端普及的关键所在。相较于ChatGPT等大语言模型的飞跃式进程，世界模型仍处于较为早期的阶段。其普遍面临开发成本高、无法持续遵守物理规则等问题。

值得一提的是，此次英伟达发布的Cosmos将以开源的形式发布。根据其披露的名单，首批用户包括1X、Agile Robots、Agility、Figure AI、Foretellix、Fourier、Galbot、Hillbot,、IntBot、Neura Robotics、Skild AI、Virtual Incision、Waabi 和小鹏汽车等十余个国内外机器人和汽车厂商。

事实上，英伟达采用逼真物理环境对机器人进行训练的尝试可以追溯至2024年6月，当时其运用仿真框架RoboCasa，提供了超过150个物体类别的数千个3D模型和数十种可交互的家具和家电。在相关实验中，证明了合成物理数据在机器人训练中的有效性。

黄仁勋表示，“世界基础模型是推动机器人和自动驾驶汽车开发的基础，但并非所有开发者都拥有自主训练模型所需的专业知识和资源。我们创建Cosmos是为了普及物理AI，让每一位开发者都能接触到通用机器人技术。”

截至目前，已有数个公司推出世界基础模型。2024年12月5日，谷歌发布大型基础世界模型Genie2，可生成较为逼真的3D世界；同年9月，1XTechnologies发布人形机器人世界模型，可以模拟出机器人在不同动作下的未来场景。

此外，视频生成模型也被视作通往世界基础模型的路径之一。在视频生成领域，Sora、Runway等均表达了希望进军世界模型的想法。开源证券指出，视频生成和世界模型均有诸多相似之处，均将复杂外部世界获取的数据进行编码和压缩、抽象成为低维度的向量，并采用Transformer或者其他模型在时空维度学习这些知识进而实现预测。

华泰证券今日研报指出，受到文本大模型的启发，人形机器人也开始构建具身大模型，首当其冲的便是解决数据问题。自动驾驶可简化为3D空间中的2D运动，而机器人是3D空间中的3D运动，还需包括力触觉等信息，因此理论上机器人所需数据量高于自动驾驶。目前，人形机器人训练数据的收集主要依赖三种方式：

采集真机数据，例如人穿戴动作捕捉服，这种方式数据质量好，但采集成本高速度慢；

利用仿真环境生成合成数据，再对机器人进行训练；

根据现有的互联网视频捕捉动作数据，虽然不需要构建仿真物理引擎，但涉及复杂的坐标转换和缺少力触觉等信息维度。

华泰证券认为，在上述三种方式中，合成数据将大大促进机器人发展，学界已经证明了上述方式的可行性，机器人大脑已迎来ChatGPT时刻。

① Recently, NVIDIA released the physical AI large model Cosmos, which can predict environments such as warehouses and traffic conditions to train robots; ② According to the list disclosed by NVIDIA, the first batch of users of Cosmos includes manufacturers such as 1X, Agility, Figure AI, and Xpeng Motors; ③ Brokerages believe that among the methods for collecting training data for humanoid robots, synthetic data will greatly promote the development of robots.

According to the Star Daily on January 8, embodied intelligence, which has been favored by top global Technology companies such as Google, OpenAI, and Microsoft, is rapidly approaching its ChatGPT moment.

Recently, NVIDIA's CEO Jensen Huang officially launched the physical AI large model Cosmos at CES. This model allows developers to generate physics-based videos based on combinations of text, images, and video inputs, as well as robot Sensors or motion data, to predict real environments such as warehouses, factories, and traffic conditions, thereby facilitating training for robots and Smart Automobile.

The so-called physical AI large model refers to a world foundation model that can understand the language of the world, physical characteristics, spatial positions, and other elements, and synthesize relevant physical data. It is key to accelerating the popularization of Smart Automobiles, embodied intelligence, and other AI terminals. Compared to the leapfrog progress of large language models like ChatGPT, world models are still in a relatively early stage, generally facing issues such as high development costs and inability to consistently adhere to physical rules.

It is worth mentioning that the Cosmos released by NVIDIA will be made available in open source. According to the disclosed list, the first batch of users includes more than a dozen domestic and international robot and automobile manufacturers such as 1X, Agile Robots, Agility, Figure AI, Foretellix, Fourier, Galbot, Hillbot, IntBot, Neura Robotics, Skild AI, Virtual Incision, Waabi, and Xpeng Motors.

In fact, NVIDIA's attempt to train robots using realistic physical environments dates back to June 2024 when it utilized the simulation framework RoboCasa to provide thousands of 3D models across over 150 object categories and dozens of interactive Furniture and appliances. The related experiments proved the effectiveness of synthetic physical data in robot training.

Jensen Huang stated, "The world foundation model is the basis for advancing the development of robots and Smart Automobiles, but not all developers possess the expertise and resources required to train models independently. We created Cosmos to popularize physical AI, allowing every developer to access general robot technology."

As of now, several companies have launched world foundation models. On December 5, 2024, Google released the large foundation world model Genie2, capable of generating relatively realistic 3D worlds; in September of the same year, 1X Technologies released a humanoid robot world model that can simulate future scenarios of robots performing different actions.

Additionally, video generation models are also seen as one of the pathways to world foundational models. In the field of video generation, Sora and Runway have both expressed their aspirations to venture into world models. HTSC points out that video generation and world models share many similarities; both encode and compress data obtained from the complex external world into lower-dimensional vectors and learn this knowledge in the spatiotemporal dimensions using Transformers or other models to achieve predictions.

HTSC's research report today indicates that inspired by large text models, humanoid robots are also beginning to build embodied large models, with the primary focus on solving data issues. Autonomous driving can be simplified as 2D motion in a 3D space, while robots operate in 3D motion within that space and also need to include information such as force and touch. Therefore, theoretically, the data requirement for robots is greater than that for autonomous driving. Currently, the collection of training data for humanoid robots relies mainly on three methods:

Collecting real machine data, such as when a person wears motion capture suits; this method produces high-quality data but has high collection costs and slow speeds.

Using simulated environments to generate synthetic data for training robots.

Capturing motion data based on existing internet videos; although there is no need to build a simulation physics engine, it involves complex coordinate transformations and lacks dimensions related to force and touch.

HTSC believes that among the three methods mentioned above, synthetic data will greatly promote the development of robots. The academic community has already proven the feasibility of these methods, and the robot brain has welcomed its ChatGPT moment.

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

英伟达开源世界基础模型 人形机器人加速迈向“ChatGPT时刻”

NVIDIA has open-sourced world foundational models, accelerating humanoid robots towards the "ChatGPT moment."

Risk Disclaimer

Statement

英伟达开源世界基础模型人形机器人加速迈向“ChatGPT时刻”