Nvidia's financial report anticipates precision sniping? This unicorn is aggressively entering the field of AI reasoning and achieving the world's fastest speed without using HBM.

cls.cn · 02:24

①Cerebras基于其自己的芯片计算系统发布了号称世界上速度最快的AI推理服务；②Cerebras将内存直接内置在巨大的芯片中，从而拥有巨大的片上内存和极高的内存带宽。

《科创板日报》8月28日讯（编辑朱凌）当地时间周三盘后，英伟达即将公布整个二级市场的最后一份重磅二季报，全球投资者因此高度紧张。而就在前一天（当地时间8月27日），美国人工智能处理器芯片独角兽Cerebras Systems基于其自己的芯片计算系统发布了号称世界上速度最快的AI推理服务，声称比使用英伟达H100 GPU构建的系统快十到二十倍。

目前，英伟达GPU在AI训练和推理两方面都占据着市场主导地位。Cerebras自2019年推出首款AI芯片以来，一直专注销售AI芯片和计算系统，致力于在AI训练领域挑战英伟达。

根据美国科技媒体The Information的报道，得益于AI推理服务，OpenAI今年收入预计将达到34亿美元。既然AI推理的蛋糕这么大，Cerebras联合创始人兼首席执行官安德鲁·费尔德曼表示，Cerebras也要在AI市场上占据一席之地。

Cerebras这次推出AI推理服务不仅开启了AI芯片和计算系统之外，基于使用量的第二收入曲线，更是发起了对英伟达的全面进攻。“从英伟达手中抢走足够让他们生气了的市场份额。”费尔德曼如是说。

又快又便宜

Cerebras的AI推理服务在速度和成本上均表现出显著优势。据费尔德曼介绍，以每秒可输出的token数量来衡量，Cerebras的AI推理速度是微软Azure、亚马逊AWS等云服务商运行的AI推理服务的20倍。

费尔德曼在发布会现场同时启动了Cerebras与亚马逊AWS的AI推理服务，Cerebras可以瞬间完成推理工作并输出，处理速度达到每秒1832个tokens，而AWS需要几秒钟才能完成输出，处理速度仅为每秒93个tokens。

费尔德曼称，更快的推理速度意味着，可以实现实时交互式语音回答，或通过调用多轮结果、更多外部来源、更长文档，从而获得更准确、更相关的回答，给AI推理带来质的飞跃。

除了速度优势外，Cerebras还有巨大的成本优势。费尔德曼表示，Cerebras的AI推理服务性价比是AWS等的100倍。以运行Meta的Llama 3.1 70B开源大型语言模型为例，该服务的价格为每个token仅需60美分，而一般云服务商提供的相同服务每个令牌价格为2.90美元。

目前最大GPU面积的56倍

Cerebras的AI推理服务又快又便宜的原因在于其WSE-3芯片的设计。这是Cerebras今年3月推出的第三代处理器芯片，它的尺寸巨大，几乎相当于一个12英寸半导体晶片的整个表面，或者说比一本书还要大，单体面积达到约462.25平方厘米。它是目前最大GPU面积的56倍。

WSE-3芯片没有像英伟达那样采用需要通过接口连接才能访问的独立高带宽存储器（HBM）。相反，它将内存直接内置在芯片中。

得益于芯片尺寸，WSE-3的片上存储器（On-chip memory）高达44G，几乎是英伟达H100的900倍，内存带宽是英伟达H100的7000倍。

费尔德曼表示，内存带宽是限制语言模型推理性能的根本因素。而Cerebras将逻辑和内存整合到一个巨型芯片中，拥有巨大的片上内存和极高的内存带宽，可以快速地处理数据并产生推理结果。“这是GPU不可能达到的速度。”

除了速度和成本优势外，WSE-3芯片还是AI训练和推理两面手，在处理各种AI任务时都具有卓越的性能。

根据计划，Cerebras将在多个地点建立AI推理数据中心，并将按请求次数对推理能力收费。同时，Cerebras还将向尝试向云服务商出售基于WSE-3的CS-3计算系统。

① Based on its own chip computing system, Cerebras released what claims to be the fastest AI inference service in the world; ② Cerebras built the memory directly into a huge chip, so it has huge on-chip memory and extremely high memory bandwidth.

“Science and Technology Innovation Board Daily”, August 28 (Editor Zhu Ling) After the market on Wednesday local time, Nvidia is about to release the last major second-quarter report for the entire secondary market. Global investors are highly nervous as a result. Just the day before (August 27, local time), the American artificial intelligence processor chip unicorn Cerebras Systems released what claims to be the fastest AI inference service in the world based on its own chip computing system, claiming to be ten to twenty times faster than the system built using the Nvidia H100 GPU.

Currently, Nvidia GPUs dominate the market in both AI training and inference. Since launching the first AI chip in 2019, Cerebras has focused on selling AI chips and computing systems, and is committed to challenging Nvidia in the field of AI training.

According to a report by the US tech media The Information, OpenAI's revenue is expected to reach 3.4 billion US dollars this year thanks to AI inference services. Since the cake for AI reasoning is so big, Andrew Feldman, co-founder and CEO of Cerebras, said that Cerebras will also have a place in the AI market.

The launch of the AI inference service by Cerebras this time not only launched AI chips and computing systems, but also launched a comprehensive attack on Nvidia based on the second revenue curve based on usage. “Stealing enough market share from Nvidia to make them angry.” That's what Feldman said.

Fast and cheap

Cerebras' AI inference service has shown significant advantages in terms of speed and cost. According to Feldman, measured by the number of tokens that can be output per second, Cerebras' AI inference speed is 20 times faster than AI inference services run by cloud service providers such as Microsoft Azure and Amazon AWS.

Feldman simultaneously launched the AI inference service of Cerebras and Amazon AWS at the press conference. Cerebras can instantly complete inference work and output. The processing speed reached 1,832 tokens per second, while AWS took a few seconds to complete the output, and the processing speed was only 93 tokens per second.

According to Feldman, faster reasoning speed means that real-time interactive voice answers can be achieved, or more accurate and relevant answers can be obtained by calling multiple rounds of results, more external sources, and longer documents, bringing a qualitative leap forward in AI reasoning.

In addition to the speed advantage, Cerebras also has a huge cost advantage. Feldman said that Cerebras' AI inference service is 100 times more cost-effective than AWS, etc. Take the Llama 3.1 70B open source large-scale language model running Meta as an example. The price of this service is only 60 cents per token, while the same service provided by a typical cloud service provider costs $2.90 per token.

56 times the area of the current largest GPU

The reason Cerebras' AI inference service is fast and cheap is because of the design of its WSE-3 chip. This is the third-generation processor chip launched by Cerebras in March of this year. It is huge, almost equivalent to the entire surface of a 12-inch semiconductor chip, or larger than a book, with a single unit area of about 462.25 square centimeters. That's 56 times the area of the current largest GPU.

The WSE-3 chip does not use a separate high-bandwidth memory (HBM) that requires an interface connection to be accessed like Nvidia. Instead, it has the memory built directly into the chip.

Thanks to the chip size, the WSE-3's on-chip memory (on-chip memory) is 44G, almost 900 times that of the Nvidia H100, and the memory bandwidth is 7000 times that of the Nvidia H100.

According to Feldman, memory bandwidth is a fundamental factor limiting the inference performance of language models. Cerebras, on the other hand, integrates logic and memory into a giant chip. It has huge on-chip memory and extremely high memory bandwidth, which can quickly process data and generate inference results. “It's a speed a GPU can't reach.”

In addition to its speed and cost advantages, the WSE-3 chip is an all-rounder for AI training and inference, and has excellent performance in handling various AI tasks.

According to the plan, Cerebras will establish AI inference data centers at multiple locations and will charge for inference capabilities based on the number of requests. At the same time, Cerebras will also sell CS-3 computing systems based on WSE-3 to trial cloud service providers.

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

英伟达财报前精准狙击？这家独角兽强势进军AI推理 不用HBM做到世界最快

Nvidia's financial report anticipates precision sniping? This unicorn is aggressively entering the field of AI reasoning and achieving the world's fastest speed without using HBM.

Risk Disclaimer

Statement

英伟达财报前精准狙击？这家独角兽强势进军AI推理不用HBM做到世界最快