Google launches Gemini 2.0 Flash, a next-generation AI model that supports image generation and a full transition to Agent

wallstreetcn · 18:25

为应对OpenAI此前推出的众多新产品，谷歌周三推出下一代重要人工智能模型Gemini 2.0 Flash，可以原生生成图像和音频，同时支持文本生成。2.0 Flash还可以使用第三方应用程序和服务，使其能够访问谷歌搜索、执行代码等功能。谷歌表示，Gemini 2.0 Flash是2.0家族第一个模型，主推原生多模态输入输出+Agent，速度比 1.5 Pro快两倍，关键性能指标甚至超过了 1.5 Pro。

作者：赵雨荷

来源：硬AI

为应对OpenAI此前推出的众多新产品，谷歌周三推出下一代重要人工智能模型Gemini 2.0 Flash，可以原生生成图像和音频，同时支持文本生成。2.0 Flash还可以使用第三方应用程序和服务，使其能够访问谷歌搜索、执行代码等功能。

从周三起，2.0 Flash的实验版本将通过Gemini API和谷歌的AI开发平台（AI Studio和Vertex AI）提供。然而，音频和图像生成功能仅对“早期接入合作伙伴”开放，并计划在明年1月全面推出。

未来几个月内，谷歌表示将推出2.0 Flash的不同版本，用于Android Studio、Chrome DevTools、Firebase、Gemini Code Assist等产品。

Flash的升级

第一代Flash（1.5 Flash）只能生成文本，且并未设计用于特别高要求的工作负载。据谷歌称，新版本2.0 Flash模型更具多样性，部分原因是它能够调用工具（如搜索）并与外部API交互。

谷歌Gemini模型产品负责人Tulsee Doshi表示，

“我们知道，Flash因其在速度和性能上的良好平衡而备受开发者喜爱。在2.0 Flash中，它依然保持了速度的优势，但现在更加强大。”

谷歌声称，根据公司内部测试，2.0 Flash在某些基准测试中的运行速度是Gemini 1.5 Pro模型的两倍，并在编码和图像分析等领域“显著”改进。事实上，该公司表示，2.0 Flash凭借其更好的数学性能和“事实性”取代了1.5 Pro，成为Gemini的旗舰模型。

2.0 Flash可以生成并修改图像，同时支持文本生成。该模型还可以读取照片、视频以及音频录制内容，从而回答与这些内容相关的问题。

音频生成是2.0 Flash的另一个关键功能，Doshi将其描述为“可操控”和“可定制”。例如，该模型可以用八种针对不同口音和语言优化的声音来朗读文本。

不过，谷歌并未提供2.0 Flash生成的图像或音频样本，因此无法判断其输出质量与其他模型的比较。

谷歌表示，它正在使用其SynthID技术为2.0 Flash生成的所有音频和图像添加水印。在支持SynthID的软件和平台（即部分谷歌产品）上，该模型的输出将被标记为合成内容。

此举旨在缓解人们对滥用的担忧。事实上，“深度伪造”（deepfake）正成为日益严重的威胁。据身份验证服务Sumsub的数据，从2023年到2024年，全球检测到的深度伪造数量增长了四倍。

多模态API

2.0 Flash的生产力版本将于明年1月推出。但与此同时，谷歌推出了一个API，名为Multimodal Live API，以帮助开发者构建具有实时音频和视频流功能的应用程序。

通过Multimodal Live API，谷歌表示开发者可以创建具有来自摄像头或屏幕音频和视频输入的实时多模态应用程序。该API支持工具集成以完成任务，并能够处理“自然对话模式”，例如打断——与OpenAI的实时API功能类似。

Multimodal Live API已于周三上午全面开放使用。

AI代理操作网页

谷歌周三还发布了其首个能够在网页上执行操作的AI代理，这是由其DeepMind部门推出的研究模型，名为Project Mariner。该代理由Gemini驱动，能够接管用户的Chrome浏览器，移动屏幕上的光标，点击按钮，填写表单，从而像人类一样使用和浏览网站。

谷歌表示，从周三开始，这款AI代理将首先面向一小部分预先选定的测试者推出。

媒体报道，谷歌正在继续尝试新的方式，让Gemini能够读取、总结甚至使用网站。一位谷歌高管告诉媒体，这标志着一种“全新的用户体验范式转变”：用户不再直接与网站交互，而是通过生成式AI系统完成这些交互。

分析认为，这种转变可能会影响数百万家企业——从TechCrunch等出版商到沃尔玛等零售商——这些企业一直以来都依赖谷歌将真实用户引导到他们的网站。

在与科技媒体TechCrunch的演示中，谷歌实验室总监Jaclyn Konzelmann展示了Project Mariner的工作原理。

在Chrome浏览器中安装一个扩展程序后，浏览器右侧会弹出一个聊天窗口。用户可以指示代理完成诸如“根据这份清单在超市创建购物车”之类的任务。

接着，AI代理会导航到一家超市的网站，然后搜索并将商品添加到虚拟购物车中。一个显而易见的问题是代理运行速度较慢——每次光标移动之间约有5秒的延迟。有时，代理会中断任务并返回到聊天窗口，要求澄清某些物品（比如需要多少胡萝卜等）。

谷歌的代理无法完成结账，因为它不会填写信用卡号或账单信息。此外，Project Mariner也不会为用户接受cookies，或签署服务条款协议。谷歌表示，这是出于对用户更好控制的考虑，故意不允许代理执行这些操作。

在后台，谷歌的代理会截取用户浏览器窗口的截图（用户需在服务条款中同意这一点），并将其发送到云端的Gemini进行处理。然后，Gemini会将导航网页的指令发送回用户的电脑。

Project Mariner还可以用于搜索航班和酒店、购物家居用品、查找食谱以及其他目前需要用户点击网页才能完成的任务。

不过，Project Mariner仅能在Chrome浏览器的最前端活动标签页上工作，这意味着在代理后台运行时，用户无法用电脑做其他事情，而是需要观看Gemini慢慢地点击操作。谷歌DeepMind首席技术官Koray Kavukcuoglu表示，这是一个非常有意的决定，目的是让用户知道谷歌的AI代理正在做什么。

Konzelmann表示，

“[Project Mariner]标志着我们现在看到的一种根本性的新用户体验范式转变。我们需要探索一种正确的方法，让这一切改变用户与网页互动的方式，同时也改变出版商为用户以及代理创建体验的方式。”

AI代理做研究、写代码、熟悉游戏

除了Project Mariner，谷歌周三还推出了几款专门用于特定任务的新AI代理。

其中一个AI代理Deep Research旨在通过创建多步骤研究计划帮助用户研究复杂研究。它似乎是OpenAI的o1的竞争对手，后者也能够进行多步骤推理。然而，谷歌发言人指出，该代理并不用于解决数学和逻辑推理问题、编写代码或进行数据分析。Deep Research现已在Gemini Advanced中推出，并将在2025年登陆Gemini应用。

当收到一个困难或规模较大的问题时，Deep Research会创建一个多步骤的行动计划来回答问题。在用户批准计划后，Deep Research需要几分钟的时间来回答问题、搜索网页，然后生成一份详细的研究报告。

另一个新AI代理Jules旨在帮助开发者完成代码任务。它直接集成到GitHub工作流中，使Jules能够查看现有工作并直接在GitHub中进行更改。Jules现已向一小部分测试者推出，并将在2025年晚些时候发布。

最后，谷歌DeepMind表示正在开发一款用于帮助用户熟悉游戏的AI代理，这是基于其在创建游戏AI方面的长期经验。谷歌正在与Supercell等游戏开发商合作测试Gemini解释《部落冲突》等游戏世界的能力。

AI生成摘要

谷歌周三还发布了基于Gemini 2.0模型的AI生成摘要功能“AI Overviews”，为某些Google搜索查询提供的摘要内容，将很快能够处理“更复杂的话题”，以及“多模态”和“多步骤”的搜索内容。谷歌表示，这包括高级数学问题和编程问题。

新的AI Overviews功能将在本周开始进行有限测试，并将在明年初广泛推广。

不过，自今年春天推出以来，AI Overviews引发了许多争议，其提供的某些可疑陈述和建议（例如推荐在披萨上加胶水）曾在网上引发热议。根据SEO平台SE Ranking的一份最近报告，AI Overviews引用了“并不完全可靠或基于证据的”网站，包括过时的研究和付费产品列表。

分析认为，主要问题在于，AI Overviews有时难以辨别信息来源是事实、虚构、讽刺还是严肃内容。过去几个月里，谷歌改变了AI Overviews的工作方式，限制了与时事和健康话题相关的答案。但谷歌并不声称这一功能已经完美无缺。

尽管如此，谷歌表示，AI Overviews提升了搜索参与度，尤其是在18到24岁这一关键用户群体中——这是谷歌的重点目标人群。

Flash upgrade

First generation Flash (1.5 Flash) can only generate text and is not designed for particularly demanding workloads. According to Google, the new 2.0 Flash model is more diverse, partly due to its ability to call tools (such as search) and interact with external APIs.

Tulsee Doshi, head of Google's Gemini model product, said

“We know that Flash is loved by developers because of its good balance of speed and performance. In 2.0 Flash, it still maintained its speed advantage, but now it's even more powerful.”

Google claims that according to the company's internal testing, 2.0 Flash ran twice as fast as the Gemini 1.5 Pro model in some benchmarks and had “significant” improvements in areas such as coding and image analysis. In fact, the company said the 2.0 Flash replaced the 1.5 Pro with its better math performance and “factual nature,” becoming Gemini's flagship model.

2.0 Flash can generate and modify images while supporting text generation. The model can also read photos, videos, and audio recordings to answer questions related to these contents.

Audio generation is another key feature of 2.0 Flash, which Doshi describes as “manipulable” and “customizable.” For example, the model can read text aloud with eight voices optimized for different accents and languages.

However, Google did not provide an image or audio sample generated by 2.0 Flash, so it is impossible to judge the comparison of its output quality with other models.

Google says it's using its SynthID technology to watermark all audio and images generated by 2.0 Flash. On software and platforms that support SynthID (that is, some Google products), the output of this model will be marked as synthetic content.

The move is aimed at allaying concerns about misuse. In fact, “deepfakes” (deepfakes) are becoming a growing threat. According to data from authentication service Sumsub, the number of deep forgeries detected worldwide increased fourfold from 2023 to 2024.

Multi-modal API

The 2.0 Flash productivity version will launch next January. But at the same time, Google launched an API called the Multimodal Live API to help developers build apps with real-time audio and video streaming capabilities.

Through the Multimodal Live API, Google says developers can create real-time multi-modal applications with audio and video input from a camera or screen. The API supports tool integration to complete tasks and is capable of handling “natural conversation patterns” such as interruptions — similar to OpenAI's real-time API functionality.

The Multimodal Live API was fully available Wednesday morning.

AI agent operation page

On Wednesday, Google also unveiled its first AI agent capable of performing operations on web pages, a research model launched by its DeepMind division called Project Mariner. Powered by Gemini, this generation can take over the user's Chrome browser, move the cursor on the screen, click buttons, and fill out forms, so they can use and browse websites like humans.

Google said that starting Wednesday, the AI agent will first be launched to a small group of pre-selected testers.

According to media reports, Google is continuing to experiment with new ways to allow Gemini to read, summarize, and even use websites. A Google executive told the media that this marks a “new paradigm shift in user experience”: users no longer directly interact with websites, but complete these interactions through generative AI systems.

Analysts believe this shift could affect millions of businesses — from publishers like TechCrunch to retailers like Walmart — which have long relied on Google to direct real users to their websites.

In a presentation with tech media TechCrunch, Google Labs director Jaclyn Konzelmann showed how Project Mariner works.

After installing an extension in the Chrome browser, a chat window will pop up on the right side of the browser. Users can instruct agents to complete tasks such as “create a shopping cart at the supermarket based on this list.”

The AI agent then navigates to a supermarket's website, then searches for and adds items to the virtual shopping cart. One obvious problem is that the agent is slow — there is a delay of about 5 seconds between each cursor movement. Sometimes agents interrupt tasks and return to the chat window to ask for clarification on certain items (such as how many carrots are needed, etc.).

Google's agent can't complete the checkout because it doesn't fill in credit card numbers or billing information. Furthermore, Project Mariner does not accept cookies or sign terms of service agreements for users. Google said this was due to considerations of better control over users and deliberately did not allow agents to perform these operations.

In the background, Google's agent will take a screenshot of the user's browser window (the user must agree to this in the terms of service) and send it to Gemini in the cloud for processing. Gemini will then send instructions to navigate the web back to the user's computer.

Project Mariner can also be used to search for flights and hotels, shop for household items, find recipes, and other tasks that currently require users to click on a web page to complete.

However, Project Mariner only works on the front-end activity tab of the Chrome browser, which means that when the agent is running in the background, users cannot use the computer to do other things, but instead need to watch Gemini slowly click and operate. Koray Kavukcuoglu, chief technology officer at Google DeepMind, said it was a very intentional decision to let users know what Google's AI agents are doing.

Konzelmann said,

“[Project Mariner] marks a radical new user experience paradigm shift we're seeing now. We need to explore the right way to make all of this change the way users interact with the web, and also change the way publishers create experiences for users and agents.”

AI agents do research, write code, and get familiar with games

In addition to Project Mariner, Google on Wednesday also unveiled several new AI agents dedicated to specific tasks.

One AI agent, Deep Research, aims to help users research complex research by creating multi-step research programs. It appears to be a competitor to OpenAI's O1, which is also capable of multi-step reasoning. However, a Google spokesperson stated that the agent was not used to solve mathematical and logical reasoning problems, write code, or perform data analysis. Deep Research is now available in Gemini Advanced and will launch the Gemini app in 2025.

When a difficult or large-scale question is received, Deep Research creates a multi-step action plan to answer the question. After users approve the plan, Deep Research takes a few minutes to answer questions, search the web, and generate a detailed research report.

Another new AI agent, Jules, aims to help developers complete code tasks. It integrates directly into GitHub workflows, enabling Jules to review existing work and make changes directly in GitHub. Jules is now available to a small group of testers and will be released later in 2025.

Finally, Google DeepMind said it is developing an AI agent to help users familiarize themselves with games, based on its long experience in creating game AI. Google is teaming up with game developers like Supercell to test Gemini's ability to explain game worlds like Clash of Clans.

AI-generated summary

On Wednesday, Google also released “AI Overviews,” an AI-generated summary function based on the Gemini 2.0 model. The summary content provided for certain Google search queries will soon be able to handle “more complex topics”, as well as “multi-modal” and “multi-step” search content. Google says this includes advanced math problems and programming questions.

The new AI Reviews feature will begin limited testing this week and will be rolled out widely early next year.

However, since its launch this spring, AI Reviews has sparked a lot of controversy, and some of the questionable statements and suggestions it provided (such as recommending gluing pizza) have sparked a buzz online. According to a recent report from SEO platform SE Ranking, AI Reviews cites websites that are “not completely reliable or evidence-based”, including outdated research and paid product listings.

Analysts believe the main problem is that AI Overviews sometimes makes it difficult to discern whether the source of information is fact, fiction, irony, or serious content. Over the past few months, Google has changed the way AI Overviews works, limiting answers related to current events and health topics. But Google doesn't claim that this feature is perfect.

Despite this, Google said AI Overviews increased search engagement, particularly among the key user group of 18 to 24 year olds — a key target group for Google.

The latest AI accelerator chip, Trillium, is exclusive to Gemini 2.0

Google unveiled its sixth-generation AI accelerator chip Trillium on Wednesday, claiming that the chip's performance improvements may fundamentally change the economic model for AI development.

This custom processor is used to train Google's latest Gemini 2.0 AI model. Its training performance is four times that of the previous generation, while drastically reducing energy consumption.

Google CEO Sundar Pichai (Sundar Pichai) explained in an announcement article that Google has connected more than 0.1 million Trillium chips into a single network structure, forming one of the world's most powerful AI supercomputers.

Trillium has made significant progress on multiple dimensions. Compared with its predecessor, the peak computing performance of each chip of this chip has been increased 4.7 times, while the high-bandwidth memory capacity and inter-chip interconnection bandwidth have doubled. More importantly, its energy efficiency has increased by 67%, which is a key indicator for data centers when dealing with the huge energy demand for AI training.

Trillium's commercial impact isn't limited to performance metrics. Google claims that compared to the previous generation chip, the chip has improved training performance by 2.5 times per dollar, which may reshape the economic model for AI development.

Analysts believe that the release of Trillium has intensified competition in the field of AI hardware, while Nvidia has long dominated with its GPU-based solutions. While Nvidia's chips are still the industry standard for many AI applications, Google's custom chip approach may have advantages for specific workloads, particularly for training very large models.

Other analysts say that Google's huge investment in custom chip development reflects its strategic bet on the importance of AI infrastructure. Google's decision to offer Trillium to cloud customers shows that it wants to be more competitive in the cloud AI market and compete fiercely with Microsoft Azure and Amazon AWS. For the tech industry as a whole, the release of Trillium shows that the battle for AI hardware supremacy is entering a new phase.

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

谷歌推出新一代AI模型Gemini 2.0 Flash，支持生成图像，全面转向Agent

Google launches Gemini 2.0 Flash, a next-generation AI model that supports image generation and a full transition to Agent

Flash的升级

多模态API

AI代理操作网页

AI代理做研究、写代码、熟悉游戏

AI生成摘要

最新AI加速器芯片Trillium 专供Gemini 2.0

Flash upgrade

Multi-modal API

AI agent operation page

AI agents do research, write code, and get familiar with games

AI-generated summary

The latest AI accelerator chip, Trillium, is exclusive to Gemini 2.0

谷歌推出新一代AI模型Gemini 2.0 Flash，支持生成图像，全面转向Agent

Google launches Gemini 2.0 Flash, a next-generation AI model that supports image generation and a full transition to Agent

Flash的升级

多模态API

AI代理操作网页

AI代理做研究、写代码、熟悉游戏

AI生成摘要

最新AI加速器芯片Trillium 专供Gemini 2.0

Flash upgrade

Multi-modal API

AI agent operation page

AI agents do research, write code, and get familiar with games

AI-generated summary

The latest AI accelerator chip, Trillium, is exclusive to Gemini 2.0

Risk Disclaimer

Statement