Sora's rival! The most powerful immersive AI media model, Meta, has arrived, with a 30 billion parameter model supporting Movie Gen videos.

wallstreetcn · 16:10

Meta称，Movie Gen是“最先进、沉浸式效果最佳的故事讲述模型套件”，基于授权和公开可用数据训练，以每秒16帧的速度文生视频最长16秒；130亿参数模型支持音频生成；人工评测时，Movie Gen的视频生成能力对Sora净胜率8.2。Meta未明确何时发布，扎克伯格称明年上线Instagram。

本文作者：李丹

来源：硬AI

OpenAI的Sora迎来劲敌，Meta推出号称最先进的媒体基础模型Movie Gen。

Meta称，Movie Gen是公司“针对媒体的突破性生成式AI研究”，它囊括了图像、视频和音频等模态，用户只需输入文本，就可以制作自定义的视频和声音、编辑现有视频，以及将个人图像转换为独特的视频。Movie Gen执行这些任务的表现在人类评估中由于业内类似模型。

Meta介绍，Movie Gen是“最先进、沉浸式效果最佳的故事讲述模型套件”，综合了公司第一波生成式AI媒体研究的Make-A-Scene 系列模型，即可创建图像、音频、视频和 3D 动画的模型，以及随着扩散模型出现而针对Llama Image 基础模型进行的第二波研究模型，即可实现更高质量图像和视频生成以及图像编辑的模型。

文生视频最长16秒 130亿参数音频生成模型人工评测视频生成对Sora净胜率8.2

Meta概括，Movie Gen 具有四种功能：视频生成、个性化视频生成、精确视频编辑、音频生成。

对于视频生成，Meta介绍，用户只要提供一个文本的提示词，Movie Gen就可以利用针对文本转图像和文本转视频进行了优化的联合模型，创建高清的高质量图像和视频。Movie Gen的视频模型有300亿参数，这个转换模型能以每秒16帧的速度生成最长16秒的视频。

Meta称，发现这些模型可以推理物体的运动、拍摄主体与物体之间的相互作用，以及相机的运动，并且可以学习各种概念了解有哪些合理的运动，因此，它们成为同类中最先进的模型。在介绍该功能时，Meta展示了多个10秒长度的视频短片，包括一只像萌翻网络的弹跳猪“Moo Deng”那样的小河马游来游去。

华尔街见闻注意到，单从生成视频的最大长度看，Movie Gen还不敌今年2月OpenAI发布的Sora。Sora令业界震撼的一点是，可以创建长达60秒的文生视频，。不过，相比Meta去年11月官宣的视频模型Emu Video，Movie Gen确实进步不小。Emu Video只能以每秒16帧的速度生成最长4秒的视频。

除了直接文生视频，Movie Gen还有出色的个性化视频制作能力。Meta介绍，其扩展了前述基础模型，支持生成个性化视频。用户可以提供某个人的图像，配合文本的提示词，让Move Gen生成的视频包含参考图像中的人物，以及符合文本提示的视觉细节。Meta称，在创建保留人类身份和动作的个性化视频方面，其模型取得了最先进的成果。

Meta展示的一个视频显示，用户可以提供一个女孩的照片，输入文字“一名身穿粉色马甲的女DJ播放唱片，她身旁有一只猎豹”，然后就生成了以照片中女孩形象的DJ打碟，以及一只猎豹陪伴。

在精确视频编辑方面，Meta称，Movie Gen采用了同一基础模型的编辑变体模型，在用户输入视频和文本提示词后，精确执行任务，生成所需的输出。它将视频生成与高级图像编辑相结合，执行局部编辑，例如添加、删除或替换元素，以及诸如背景或样式修改的全局更改。与需要专业技能或缺乏生成精确度的传统工具不同，Movie Gen 保留了原始内容，仅针对相关像素编辑。

Meta提供的示例之一是，用户输入，让企鹅穿上有英国女王维多利亚在位时期服饰风格的服装，Movie Gen生成的企鹅穿上了带蕾丝的红色女裙。

对于音频生成，Meta称，训练了一个130亿参数的音频生成模型，该模型可以接受视频和可选的文本提示词，生成长达 45 秒的高质量高保真音频，包括环境音、拟声音效（Foley）和乐器背景音乐，所有这些都与视频内容同步。此外，Meta引入了一种音频扩展技术，可以为任意长度的视频生成连贯的音频，在音频质量、视频到音频对齐和文本到音频对齐方面总体上实现了最先进的性能。

Meta提供的一个例子是，生成在吉他音乐的伴奏下，全地形车（ATV）引擎轰鸣加速的声音，还有一个例子是，管弦乐声中有树叶沙沙作响和树枝折断的声音。

Meta还展示了针对以上四种能力进行的A/B对比测试人工评估结果，下图显示的净正值胜率代表，相比Sora等竞品模型，人类评估者更青睐Movie Gen模型生成的结果。在直接生成视频这个功能方面，Movie Gen相比Sora的净胜率达到8.2。

基于授权和公开可用数据训练未明确何时发布扎克伯格称明年上线Instagram

Movie Gen 是基于哪些信息进行训练的？Meta 的声明没有说明具体细节，只是说：“我们基于授权和公开可用的数据集对这些模型进行了训练。”

有评论指出，对生成式 AI 工具而言，训练数据的来源以及从网上抓取哪些数据合理仍然是有争议的问题，而且公众很少知道使用哪些文本、视频或音频片段创建了任何大模型。

还有评论称，Meta说训练用的数据集是“专有/商业敏感”的，未提供细节，那么只能猜测，数据包括很多Instagram 和 Facebook 平台的视频，加上一些Meta合作伙伴的内容，以及其他很多未得到充分保护的内容、也就是所谓的“公开可用”内容。

对于发布时间，Meta本周五并未明确Movie Gen何时面向大众推出，只是含糊地说“可能未来发布”。今年2月OpenAI官宣Sora后迄今还未真正向公众开放使用，也并未透露任何计划将要发布的日期。

不过，Meta CEO扎克伯格称，Movie Gen明年会上线Meta旗下的社交媒体Instagram。他在个人Instagram账号发布了一段Movie Gen生成的视频，显示他在用腿部推举机，随着他开始锻炼，背景发生了变化。先是显示，他在一家霓虹灯照耀的未来风格健身房锻炼，然后变为，他穿着角斗士盔甲锻炼，接着变为他推动一台燃烧的纯金机器，最后变为，他用腿部推举一盒鸡块，周围是一片薯条。

扎克伯格配上文字称，Meta新的 MovieGen AI 模型可以制作和编辑视频，每天都是炼腿的日子。该模型将于明年登陆 Instagram。

在社交媒体X，Meta官宣并演示Movie Gen的帖子下面，一些点赞高的评论显示，网友已经在催促Meta正式发布该模型，有网友问，不知道大家有没有机会来试试它。 big

Meta stated that Movie Gen is the "most advanced, with the best immersive storytelling model kit", trained based on authorized and publicly available data, capable of generating videos at a speed of 16 frames per second and up to 16 seconds long; a 13 billion parameter model supports audio generation; in human evaluations, Movie Gen's video generation capability outperforms Sora by 8.2 net wins. Meta did not specify the release date, but Zuckerberg mentioned it will launch on Instagram next year.

Author of this article: Li Dan.

The weather is good today The weather is good today.

OpenAI's Sora faces a tough opponent as Meta launches the Movie Gen, claimed to be the most advanced media-based model.

Meta claims that Movie Gen is the company's groundbreaking generative AI research for media, encompassing modes such as images, videos, and audio. Users can create custom videos and sounds, edit existing videos, and transform personal images into unique videos simply by entering text. Movie Gen's performance in these tasks in human evaluations surpasses similar models in the industry.

Meta introduces Movie Gen as the most advanced and immersive storytelling model suite, combining the first wave of the company's generative AI media research with the Make-A-Scene series models. It can create models for images, audio, videos, and 3D animations, as well as the second wave of research models focusing on the Llama Image base model with the emergence of diffusion models. This enables higher-quality image and video generation as well as image editing.

Documentary videos up to 16 seconds, 13 billion-parameter audio generative models, and human evaluation of video generation with Sora have a net win rate of 8.2.

In summary, Movie Gen has four main functions: video generation, personalized video generation, precise video editing, and audio generation.

For video generation, Meta introduces that users only need to provide a text prompt. Movie Gen can use a joint model optimized for text-to-image and text-to-video to create high-definition, high-quality images and videos. Movie Gen's video model has 30 billion parameters, and this conversion model can generate videos up to 16 seconds at a speed of 16 frames per second.

Meta states that these models can infer object movements, interactions between the shooting subject and objects, as well as camera movements. They can also learn various concepts to understand reasonable movements, making them the most advanced models in the same category. When introducing this feature, Meta showcased several 10-second video clips, including a small hippo swimming like 'Moo Deng,' a bouncing pig that has gone viral on social media.

Huawei noticed that based solely on the maximum length of generated videos, Movie Gen is not as good as Sora released by OpenAI in February this year. What amazed the industry about Sora is its ability to create up to 60-second natural videos. However, compared to Meta's video model Emu Video announced in November last year, Movie Gen has indeed made significant progress. Emu Video could only generate videos up to 4 seconds at a speed of 16 frames per second.

In addition to directly creating natural videos, Movie Gen also excels in personalized video production. Meta introduces that it has expanded the aforementioned basic models to support the generation of personalized videos. Users can provide an image of a person, along with text prompts, allowing Movie Gen to generate videos that include the person from the reference image and visual details matching the text prompts. Meta states that its model has achieved the most advanced results in creating personalized videos that retain human identity and movements.

In a video demonstrated by Meta, users can provide a photo of a girl, input the text 'a female DJ wearing a pink vest playing records, accompanied by a cheetah,' and then generate a DJ spinning records resembling the girl in the photo, along with a cheetah companion.

In terms of precise video editing, Meta stated that Movie Gen uses a variant model of the same underlying model, accurately performing tasks based on user-input videos and text prompts to generate the desired output. It combines video generation with advanced image editing, performing local edits such as adding, removing, or replacing elements, as well as global changes like background or style modifications. Unlike traditional tools that require professional skills or lack precision in generation, Movie Gen retains the original content and focuses solely on relevant pixel edits.

One example provided by Meta is when users input a request for a penguin to wear clothing in the style of Queen Victoria's era in the United Kingdom. Movie Gen generated an image of the penguin wearing a red female skirt with lace.

For audio generation, Meta mentioned training an audio generation model with 13 billion parameters, capable of accepting videos and optional text prompts to produce high-quality, high-fidelity audio lasting up to 45 seconds. This includes environmental sounds, Foley sound effects, instrumental background music, all synchronized with the video content. Additionally, Meta introduced an audio extension technology that can generate coherent audio for videos of any length, achieving overall state-of-the-art performance in audio quality, video-to-audio alignment, and text-to-audio alignment.

One example provided by Meta involves generating the sound of an ATV engine accelerating under the accompaniment of guitar music. Another example includes orchestral sounds with rustling leaves and snapping twigs.

Meta also presented the results of A/B comparative tests of the above four capabilities, the net positive win rate shown in the figure represents that, compared to competitors like Sora, human evaluators prefer the results generated by the Movie Gen model. In terms of the feature of directly generating videos, Movie Gen's net win rate compared to Sora reaches 8.2.

Based on authorized and publicly available data training, it is not clear when it will be released. Zuckerberg said Instagram will go live next year.

What information is Movie Gen trained on? Meta's statement does not specify specific details, only saying: "We trained these models based on authorized and publicly available datasets."

Some comments point out that for generative AI tools, the source of training data and what data is scraped from the internet are still controversial issues. Moreover, the public rarely knows what text, video, or audio clips were used to create any large models.

Other comments suggest that Meta stated that the dataset used for training is "proprietary/commercially sensitive", without providing details, so one can only speculate that the data includes many videos from Instagram and Facebook platforms, along with some content from Meta's partners, as well as many other inadequately protected contents, namely "publicly available" content.

Regarding the release time, Meta did not specify when Movie Gen will be launched to the public this Friday, just vaguely saying "possible release in the future." In February of this year, OpenAI announced Sora, which has not yet been truly made available to the public, and has not revealed any planned release date.

However, Meta CEO Mark Zuckerberg stated that Movie Gen will be launched on Meta's social media Instagram next year. He posted a video generated by Movie Gen on his personal Instagram account, showing him using a leg press machine. As he started exercising, the background changed. It first showed him working out in a futuristic neon-lit gym, then changed to him exercising in gladiator armor, followed by him pushing a burning pure gold machine, and finally, him using his legs to press a box of chicken nuggets surrounded by fries.

Zuckerberg added that Meta's new MovieGen AI model can create and edit videos, and every day is a leg day. The model will be launched on Instagram next year.

On social media X, Meta officially announced and demonstrated Movie Gen. Below the post, some highly liked comments show that netizens are already urging Meta to officially release the model. Some users ask if everyone will have the opportunity to try it. large

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

Sora劲敌！Meta最强沉浸式AI媒体模型来了，300亿参数模型支持Movie Gen视频

Sora's rival! The most powerful immersive AI media model, Meta, has arrived, with a 30 billion parameter model supporting Movie Gen videos.

文生视频最长16秒 130亿参数音频生成模型 人工评测视频生成对Sora净胜率8.2

基于授权和公开可用数据训练 未明确何时发布 扎克伯格称明年上线Instagram

Documentary videos up to 16 seconds, 13 billion-parameter audio generative models, and human evaluation of video generation with Sora have a net win rate of 8.2.

Based on authorized and publicly available data training, it is not clear when it will be released. Zuckerberg said Instagram will go live next year.

Risk Disclaimer

Statement

文生视频最长16秒 130亿参数音频生成模型人工评测视频生成对Sora净胜率8.2

基于授权和公开可用数据训练未明确何时发布扎克伯格称明年上线Instagram