In conversation with Academician Zheng Weimin: Using supercomputers for AI large model training costs only 1/6 of that of NVIDIA.

Sina Technology · Dec 31, 2024 12:35

文丨新浪科技周文猛

在训练人工智能大模型的过程中，采购超级计算服务或存储器，也正在成为有效缓解算力焦虑的新途径。

近日，中国工程院院士、清华大学计算机系教授郑纬民在与新浪科技沟通中指出，“过去训练一个大模型要投入几十亿，但如果把大模型训练拿到超算上去做，价钱只需要用到英伟达的六分之一。”

此外，郑纬民还指出了一种全新的AI推理发展新趋势——“以存换算”。他以清华大学与AI独角兽企业月之暗面共同开发的Mooncake技术框架为例，介绍了该技术基于“以存换算”思路，帮助月之暗面kimi智能助手缓解算力紧张需求的原理，从而避免服务器宕机。

　“把大模型训练搬到超算上，价钱只需英伟达1/6”

郑纬民看到，在经过ChatGPT发布后全球各科技企业快速追赶后，今年大模型有两个特点：第一，基础大模型进入多模态状态，不只有文本，还有图像、视频等；第二，真的用起来了，大模型正与各行业实际的结合，比如大模型+金融，大模型+医疗、大模型+汽车、大模型+智能制造等。

“大模型真的在与国民经济GDP、跟人们的生活水平密切结合，我一直认为基础大模型我们的水平跟美国比还是差一点，但‘大模型+’这件事，我们还是有希望超过美国的。”郑纬民表示。

然而，大模型真实的应用过程中，在涉及数据获取、数据预处理、模型训练、模型微调、模型推理等五个环节在内的全生命周期中，却需要大量的算力资源。如何以更低的成本获取更加高效、更高可靠的AI大模型算力资源，成为每家企业都在思考的问题。

迫于海外高端芯片获取的困难，目前国内为满足AI大模型训练带来的海量算力需求，主要发展出的一种解决思路是：通过搭建千卡、万卡集群，通过海量芯片半导体堆叠，采购多家厂商的芯片进行异构卡联合训练，来满足自己大模型产品训练的海量算力需求。但在郑纬民看来，这种方式虽能解决算力紧缺问题，但也存在一些弊端。

首先，对于构建国产万卡系统，建成固然重要，但用好却很难。郑纬民以自己做高性能计算的亲身经历现身说法道：“建一个2000卡的系统，其中1000块用英伟达芯片，另外1000块用其他厂家的，系统建成也运转起来了，但最终发现这些芯片性能不一，有的本事小一点，有的本事大一点，一个任务下来分成2000份，还要给其中1000个芯片分小一点的任务，另外1000个分大一点的任务，这还是静态的，如果是动态的，则直接分成了2000份最小的进行处理，性能很低。”

郑纬民指出，大规模算力集群建设过程中存在木桶效应，有的计算卡能力强，有的则弱，就像整个桶装多少水最终是由短板决定的，板子再长也没有用。“所以1000个老GPU和1000个新GPU合起来，性能比2000个老GPU性能还低一点，做大规模算力集群的成本也挺大。”

在郑纬民看来，进行大规模异构卡联合训练，在静态环境下想要实现最高的性能很难，并不划算，如果再涉及异地卡，就会更难，数据从北京传到贵州，贵州做出来结果再送到上海，这中间涉及的时间成本极高。“钱少的人不需要做，钱多的人可以试试。”

郑纬民建议企业尝试采用超算来进行AI大模型训练。“我国有14亿超算系统，钱都是国家付的，有的机器还有一点富余，因为国内超算机器收费便宜，不像英伟达要把机器成本收回来还要赚钱，所以大家做大模型训练到青岛神威超算上做，六分之一的价格就够了。”郑纬民表示。

　“以存换算，能够有效降低AI推理成本”

事实上，大模型真实的应用过程中，在涉及数据获取、数据预处理、模型训练、模型微调、模型推理等五个环节在内的全生命周期中，需要大量算力资源的同时，也需要有着大量的存储资源，用于存储海量计算结果。尤其在模型推理过程中，如何存得多、传得快、性价比高，成为整个行业都在共同思考的问题。

此前，郑纬民曾公开提及，“AI存储是人工智能大模型的关键基座，存储系统存在于大模型生命周期的每一环，是大模型的关键基座，通过以存强算、以存换算，先进的AI存储能够提升训练集群可用度，降低推理成本，提升用户体验。”

在与新浪科技沟通中，郑纬民分享了“以存换算”的基本原理。他指出，“大模型不管是训练还是推理，都需要很大的算力，同时也需要很多存储，用来存储大模型训练出来的海量参数，以及推理过程中产生的一些过程数据。”然而，如果整个训练或推理过程中需要存储的数据越来越多，这会导致存储器资源紧缺，最终反而又会成为大模型性能提升的“负担”。

据郑纬民介绍，为解决上述问题，清华大学想了两个办法：第一，在推理过程当中，目前主要是推理卡工作，主机CPU跟主机存储器是不用的，因此，可以想办法把主机上的存储器用到推理过程中，提升了存储器利用率，性能提升的同时，也节省了不断购买推理卡的资金成本；第二，将推理过程中产生的共性的、用户共用的内容存储起来，通过存储必要推理过程中产生的数据，当后续遇到类似问题的时候直接调用，直接省去了每次遇到类似问题时推理过程，提升了效率，同时节省了资源。

郑纬民以清华大学与月之暗面共同研发的Mooncake技术框架为例介绍指出，“通过将不同用户与Kimi对话的公共内容提炼出来，存储下来，这不仅减少了每次用户提问都要重新生成的过程，节省了许多算力卡，也减少了kimi因访问过大导致的‘访问延迟’或‘宕机’等问题。”

Text丨Zhou Wenmeng of Sina Technology

In the process of training big models of artificial intelligence, purchasing supercomputing services or memories is also becoming a new way to effectively relieve computing power anxiety.

Recently, Zheng Weimin, an academician of the Chinese Academy of Engineering and a professor of computer science at Tsinghua University, pointed out in a communication with Sina Technology, “In the past, training a big model required several billion dollars, but if big model training was carried out using supercomputing, it would only cost one-sixth of Nvidia's.”

Furthermore, Zheng Weimin also pointed out a new trend in the development of AI reasoning — “save-to-save conversion.” Using the Mooncake technology framework jointly developed by Tsinghua University and the AI unicorn company Dark Side of the Moon as an example, he introduced the principle that the technology is based on the “save-to-save conversion” idea to help the Dark Side of the Moon Kimi smart assistant relieve tight computing power requirements, thus avoiding server downtime.

　“Move big model training to supercomputing, the price is only 1/6 of Nvidia”

Zheng Weimin saw that after being quickly caught up by various technology companies around the world after the release of ChatGPT, this year's big model has two characteristics: first, the basic big model has entered a multi-modal state, not only text, but also images, videos, etc.; second, it has actually been used. The big model is actually being integrated with various industries, such as big model+finance, big model+medical, big model+intelligent manufacturing, etc.

“The big model is really closely integrated with the GDP of the national economy and people's living standards. I always think that our standard in the basic big model is still a little worse than that of the US, but with the 'big model+' thing, we still have hope of surpassing the US.” Zheng Weimin said.

However, in the actual application process of a large model, a large amount of computing power resources is required in the entire life cycle, which includes the five steps of data acquisition, data preprocessing, model training, model fine-tuning, and model inference. How to obtain more efficient and reliable AI big model computing power resources at a lower cost has become a question that every company is thinking about.

Due to difficulties in obtaining high-end chips from overseas, one of the main solutions developed domestically to meet the massive computing power requirements brought about by AI large model training is to meet the massive computing power requirements of one's own large-scale model product training by building kilocalories and 10,000 card clusters, purchasing chips from multiple manufacturers to conduct joint training with heterogeneous cards through massive chip semiconductor stacking. However, according to Zheng Weimin, although this method can solve the problem of scarce computing power, it also has some drawbacks.

First, for building a domestic 10,000 card system, it is certainly important to build it, but it is very difficult to use it well. Zheng Weimin said with his own personal experience of doing high-performance computing: “Build a 2,000 card system, with 1,000 Nvidia chips and another 1000 from other manufacturers. The system was built and operated, but in the end, it was discovered that these chips had different performance. Some had smaller skills, some had larger skills, and one task was divided into 2,000 parts. In addition, 1000 tasks were divided into smaller tasks. If it was dynamic, it was directly divided into 2,000 copies for processing and performance. Very low.”

Zheng Weimin pointed out that during the construction of large-scale computing power clusters, there was a wooden barrel effect. Some computing cards are powerful, while others are weak. Just as how much water is contained in the entire barrel is ultimately determined by shortcomings, it is useless no matter how long the board is. “So when 1,000 old GPUs are combined with 1,000 new GPUs, the performance is a little lower than that of 2,000 old GPUs, and the cost of building a large-scale computing power cluster is also quite high.”

According to Zheng Weimin, carrying out large-scale joint training with heterogeneous cards is very difficult and uneconomical. If offsite cards are involved, it will be even more difficult. The data is transmitted from Beijing to Guizhou, and Guizhou produces results and then sent to Shanghai. The time costs involved in this are extremely high. “People with less money don't need to do it; people with more money can try it.”

Zheng Weimin suggested that companies try using supercomputing to train large AI models. “Our country has a 1.4 billion supercomputing system. The money is paid by the state, and some machines have a bit of surplus. Because domestic supercomputing machines are cheap, unlike Nvidia, which has to make money if they want to recover the cost of the machine, it's enough for everyone to increase their model training and do it on Qingdao Shenwei Supercomputing.” Zheng Weimin said.

　“Converting to savings can effectively reduce AI inference costs”

In fact, in the actual application process of a large model, during the entire life cycle involving the five steps of data acquisition, data preprocessing, model training, model fine-tuning, and model inference, a large amount of computing power resources are required, while also having a large amount of storage resources to store massive calculation results. Especially in the model reasoning process, how to store much more, communicate quickly, and is cost-effective has become a question that the entire industry is thinking about together.

Earlier, Zheng Weimin publicly mentioned, “AI storage is the key foundation of the big model of artificial intelligence. The storage system exists in every part of the life cycle of the big model and is the key foundation for the big model. Through forced calculation and store-to-store conversion, advanced AI storage can improve the availability of training clusters, reduce inference costs, and enhance the user experience.”

In communication with Sina Technology, Zheng Weimin shared the basic principle of “deposit-to-deposit conversion.” He pointed out, “Whether it's training or inference, large models require a lot of computing power, and also require a lot of storage to store the massive parameters trained by the big model, as well as some process data generated during the inference process.” However, if more and more data needs to be stored throughout the training or inference process, this will lead to a shortage of memory resources, which will eventually become a “burden” for improving the performance of large models.

According to Zheng Weimin, in order to solve the above problems, Tsinghua University has thought of two solutions: First, in the inference process, the main inference card is currently mainly used, and the host CPU and host memory are not used. Therefore, it is possible to find ways to use the memory on the host computer during the inference process, which improves memory utilization. While improving performance, it also saves the capital costs of continuously purchasing inference cards; second, by storing the data generated during the inference process, which is directly used by users when similar problems are encountered later, it can be directly used and eliminated The reasoning process each time a similar problem is encountered, improving efficiency and saving resources.

Using the Mooncake technology framework jointly developed by Tsinghua University and Dark Side of the Moon as an example, Zheng Weimin pointed out, “By extracting and storing public content of conversations between different users and Kimi, this not only reduces the process of having to be regenerated every time a user asks a question, saving many computing power cards, but also reducing problems such as 'access delay' or 'downtime' caused by Kimi's excessive visits.”

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

对话郑纬民院士：用超算做AI大模型训练，成本只需英伟达1/6

In conversation with Academician Zheng Weimin: Using supercomputers for AI large model training costs only 1/6 of that of NVIDIA.

Risk Disclaimer

Statement