share_log

Starburst Announces 100GB/second Streaming Ingest From Apache Kafka to Apache Iceberg Tables

Starburst Announces 100GB/second Streaming Ingest From Apache Kafka to Apache Iceberg Tables

Starburst宣布从阿帕奇石油Kafka流式传输每秒100GB到阿帕奇冰山表
PR Newswire ·  10/24 08:00

Go from data ingestion to blazing-fast SQL analytics in near real-time with the Starburst Open Hybrid Lakehouse

利用 Starburst Open Hybrid Lakehouse,近乎实时地从数据摄取到极快的 SQL 分析

BOSTON, Oct. 24, 2024 /PRNewswire/ -- Starburst, the Trino company, today announced a range of new capabilities for their Trino-based open hybrid lakehouse platform, Galaxy: the general availability of fully managed streaming ingestion from Apache Kafka to Apache Iceberg tables; the public preview of fully managed ingestion from files landing in Amazon Web Services (AWS) S3 to Iceberg tables; and multiple enhancements to performance and price-performance of their lakehouse platform. Galaxy customers can now easily configure and ingest data at a verified scale of up to 100GB/second per Iceberg table at leading price-performance. In addition, Galaxy users can now benefit from faster and more accurate auto-scaling of resources, simplified policy-based routing of user queries, and enhanced performance through improved automatic caching and indexing.

波士顿,2024年10月24日 /PRNewswire/ — Trino公司Starburst今天宣布了其基于Trino的开放式混合湖畔平台Galaxy的一系列新功能:从Apache Kafka到Apache Iceberg表的完全托管式流媒体采集已正式上线;公开预览从亚马逊网络服务 (AWS) S3到Iceberg表中的文件进行完全托管的提取;以及对性能和性价比的多项增强他们的湖畔平台。Galaxy 客户现在可以轻松配置和摄取数据,经过验证的规模,每张 Iceberg 桌子最高可达 100GB/秒,性价比领先。此外,Galaxy用户现在可以受益于更快、更准确的资源自动扩展、基于策略的用户查询路由以及通过改进的自动缓存和索引增强的性能。

Businesses that require data to be available for analytics in their cloud data lake with minimal delay traditionally build complex ingestion systems that require cobbling together multiple tools and writing custom software to stream data into cloud data lakes. Alternatively, these organizations may rely on incomplete solutions that only handle the ingestion process. Both approaches tend to be fragile, difficult to scale, costly to maintain, and solve only part of the problem. After the data lands in the lake, it still needs to be transformed and optimized for efficient querying—requiring even more code, pipelines, tools, and added complexity. In addition, the pressure for cost optimization across analytics functions is increasing. CIOs are looking for ways to improve their operational overhead against traditional lakehouses and legacy data warehouses while maintaining control of their data and analytics stack.

传统上,需要将数据用于云数据湖中分析的企业会构建复杂的摄取系统,这些系统需要拼凑多个工具并编写自定义软件才能将数据流式传输到云数据湖。或者,这些组织可能依赖于仅处理摄取过程的不完整解决方案。这两种方法往往都很脆弱,难以扩展,维护成本高昂,而且只能解决部分问题。数据落入湖中后,仍需要对其进行转换和优化,以实现高效的查询,这需要更多的代码、管道、工具和更高的复杂性。此外,跨分析职能的成本优化的压力越来越大。首席信息官们正在寻找方法来改善传统湖库和传统数据仓库的运营开销,同时保持对数据和分析堆栈的控制。

"As businesses strive to perform analytics on real-time data, they seek frictionless solutions for continuous data ingestion. They also prioritize open standards like Apache Iceberg to future-proof their environments amid rapidly evolving technologies. Furthermore, reducing complexity and simplifying architectures is critical, helping organizations optimize IT investments and avoid unnecessary costs associated with integrating disparate systems," said Sanjeev Mohan, Principal and Founder of SanjMo. "Starburst's latest announcements are significant because they address these exact needs—delivering improved price performance, simplicity, and efficient elastic scaling for modern data workloads."

“当企业努力对实时数据进行分析时,他们寻求顺畅的解决方案来持续摄取数据。他们还优先考虑像Apache Iceberg这样的开放标准,以使他们的环境在快速发展的技术中经得起未来的考验。此外,降低复杂性和简化架构至关重要,这有助于组织优化IT投资,避免与集成不同系统相关的不必要成本。” SanjMo的负责人兼创始人Sanjeev Mohan说。“Starburst 的最新公告意义重大,因为它们满足了这些确切的需求,为现代数据工作负载提供了更高的性价比、简单性和有效的弹性扩展。”

Streaming Ingest from Kafka (general availability) - Starburst now enables the easy creation of fully managed ingestion pipelines for Kafka topics at a verified scale up to 100GB/second, at half the cost of alternative solutions. Configuration is completed in minutes and simply entails selecting the Kafka topic, the auto-generated table schema, and the location of the resulting Iceberg table.

来自 Kafka 的流媒体采集(现已正式上市)-Starburst 现在可以轻松地为 Kafka 主题创建完全托管的采集管道,规模经过验证,最高可达 100GB/秒,成本是替代解决方案的一半。配置只需几分钟即可完成,只需选择 Kafka 主题、自动生成的表架构以及生成的 Iceberg 表的位置即可。

  • Starburst Galaxy's streaming ingestion is serverless and does the heavy lifting without any manual configuration, tuning, or additional tools required by the customer. Galaxy automatically ingests incoming messages from Kafka topics into managed Iceberg tables in S3, compacts and transforms the data, applies the necessary governance, and makes it available to query within about one minute.
  • Starburst's streaming ingestion can connect to Kafka-compliant systems, which includes Confluent Cloud, Amazon Managed Streaming for Apache Kafka (MSK), and Apache Kafka.
  • Starburst guarantees exactly once delivery, ensuring no duplicate messages are read, and no messages are missed to ensure accuracy.
  • It is built for a massive scale and has been tested to ingest 100 gigabytes of streaming data per second.
  • Starburst Galaxy的流媒体采集是无服务器的,无需任何手动配置、调整或客户需要的其他工具即可完成繁重的工作。Galaxy 会自动将来自Kafka主题的传入消息提取到S3中的托管Iceberg表中,压缩和转换数据,应用必要的管理,并在大约一分钟内使其可供查询。
  • Starburst 的直播摄取可以连接到符合 Kafka 的系统,其中包括 Confluent Cloud、适用于 Apache Kafka 的亚马逊托管直播 (MSK) 和 Apache Kafka。
  • Starburst 保证精确发送一次,确保不会读取重复的消息,也不会遗漏任何消息以确保准确性。
  • 它专为大规模而构建,经过测试,每秒可摄取 100 千兆字节的流媒体数据。

Ingest from Files landing in S3 (public preview) - Additionally, Starburst is expanding its ingestion capabilities by introducing file loading, offering customers a powerful, automated alternative to DIY or off-the-shelf solutions. This feature reads, parses, and writes records from files directly into Iceberg tables, which leverage the new ingestion capabilities to automatically optimize the tables for read performance through capabilities like compaction, snapshot retention, orphaned file removal, and statistics collection. The public preview of file loading will be available in November 2024.

从文件中收录 S3(公开预览版)——此外,Starburst 正在通过引入文件加载来扩展其摄取功能,为客户提供 DIY 或现成解决方案的强大自动化替代方案。此功能将文件中的记录直接读取、解析和写入到 Iceberg 表中,Iceberg 表利用新的摄取功能,通过压缩、快照保留、孤立文件删除和统计信息收集等功能自动优化表以提高读取性能。文件加载的公开预览版将于 2024 年 11 月推出。

Enhanced Auto Scaling (general availability) - Starburst makes auto scaling smarter in Starburst Galaxy. In environments with high concurrent users, demand for compute resources can fluctuate dynamically. The enhanced Auto Scaling intelligently monitors both active and pending queries to understand and allocate how much compute resources are needed per query up to 50% faster. Not only does enhanced Auto Scaling provision additional compute resources faster, but it also includes the ability to automatically reactivate draining worker nodes, improving the efficiency of resource utilization.

增强型自动缩放(现已上市)-Starburst 让 Starburst Galaxy 中的自动缩放变得更智能。在并发用户数高的环境中,对计算资源的需求可能会动态波动。增强的 Auto Scaling 可以智能地监控活动和待处理的查询,以了解和分配每个查询所需的计算资源量,速度最多可提高 50%。增强型 Auto Scaling 不仅可以更快地配置额外的计算资源,而且还包括自动重新激活耗尽工作节点的能力,从而提高资源利用效率。

Next Gen Caching (private preview) - Data engineers undertake various labor-intensive data preparation tasks. Starburst Warp Speed helps automate some of those tasks. Still, as business needs evolve and teams turn to a semantic layer approach with tools like dbt, data engineers struggle to provide fast query performance, scalability, and stability for BI and dashboarding without significant overhead. The next-generation caching in Starburst Galaxy combines the power of Warp Speed's smart indexing and caching capabilities to intermediate workload results. Warp Speed will now be able to identify patterns of similar subqueries across different workloads while improving performance up to 62% compared to non-accelerated queries.

下一代缓存(私有预览版)-数据工程师执行各种劳动密集型数据准备任务。Starburst Warp Speed 有助于自动执行其中一些任务。尽管如此,随着业务需求的演变以及团队转向使用 dbt 等工具的语义层方法,数据工程师仍难以在没有大量开销的情况下为 BI 和仪表板提供快速的查询性能、可扩展性和稳定性。Starburst Galaxy中的下一代缓存将Warp Speed的智能索引和缓存功能的强大功能与中级工作负载结果相结合。与非加速查询相比,Warp Speed现在将能够识别不同工作负载中相似子查询的模式,同时将性能提高多达62%。

User Role Based Routing (private preview) - Previously, users would spend too much effort determining which queries were appropriate for different cluster types. Also, administrators weren't able to assign groups of users to a cluster via roles and privileges. With User Role Based Routing, Starburst now supports the easy allocation of resources by cluster type. Customers can programmatically route queries to the appropriate Galaxy cluster based on a predefined set of rules. Users can send all queries to a single URL, which will route the queries based on the user's role, minimizing human intervention while improving what is already industry-leading price-performance against other leading cloud data warehouses and lakehouses.

基于用户角色的路由(私有预览版)-以前,用户会花费太多精力来确定哪些查询适用于不同的集群类型。此外,管理员无法通过角色和权限将用户组分配给集群。借助基于用户角色的路由,Starburst 现在支持按集群类型轻松分配资源。客户可以根据一组预定义的规则,以编程方式将查询路由到相应的Galaxy集群。用户可以将所有查询发送到单个网址,该网址将根据用户的角色路由查询,从而最大限度地减少人为干预,同时提高与其他领先的云数据仓库和湖库相比业界领先的性价比。

"With our new ingestion capabilities to Iceberg, customers don't have to worry about how fast or how much data they need to land in their data lake. At 100GB/second, Galaxy's ingestion can handle the scale of the most demanding use cases. Because it is so easy to configure and cost-effective to operate, customers don't have to artificially limit the number of up-to-date, fresh tables in their lake, enabling them to make the most informed business decisions," said Tobias Ternstrom, Starburst's Chief Product Officer.

“有了我们为Iceberg提供的新采集功能,客户不必担心他们需要多快或有多少数据才能进入他们的数据湖。Galaxy的摄取速度为100GB/秒,可以应对最苛刻的用例的规模。Starburst首席产品官Tobias Ternstrom表示,由于它易于配置且运营成本效益高,因此客户不必人为地限制湖中最新的新鲜桌子的数量,从而使他们能够做出最明智的业务决策。

Supporting Resources

支持资源

For more information, read Starburst's Icehouse launch blog.
Download an image of the Starburst Open Data Lakehouse here.

欲了解更多信息,请阅读星爆的 Icehouse 发布博客。
在此处下载星爆开放数据湖库的图片。

About Starburst

关于 Starburst

Starburst, the Open Hybrid Lakehouse, is the leading end-to-end data platform to securely access, analyze, and share data for analytics and AI across hybrid, on-premises, and multi-cloud environments. As the leaders in Trino, a modern open-source SQL engine, Starburst empowers the most data-intensive and security-conscious organizations like Comcast, Halliburton, Vectra, EMIS Health, and 7 of the top 10 global banks to democratize data access, enhance analytics performance, and improve architecture optionality. With the Open Hybrid Lakehouse from Starburst, enterprises globally can easily discover and use all their relevant business data to power new applications and analytics across risk mitigation, supply chain, customer experiences, product optimization, streaming, and more.

Starburst,开放式混合湖屋,是领先的端到端数据平台,可在混合、本地和多云环境中安全地访问、分析和共享用于分析和人工智能的数据。作为现代开源SQL引擎Trino的领导者,Starburst赋予康卡斯特、哈里伯顿、Vectra、EMIS Health等数据密集度最高和最具安全意识的组织以及全球十大银行中的7家实现数据访问民主化、增强分析性能和提高架构可选性。借助 Starburst 的 Open Hybrid Lakehouse,全球企业可以轻松发现和使用所有相关的业务数据,为风险缓解、供应链、客户体验、产品优化、流媒体等领域的新应用程序和分析提供支持。

For additional information, please visit

欲了解更多信息,请访问

SOURCE Starburst

来源 Starburst

WANT YOUR COMPANY'S NEWS FEATURED ON PRNEWSWIRE.COM?

想在 PRNEWSWIRE.COM 上刊登贵公司的新闻吗?

440k+
440k+

Newsrooms &
新闻编辑室和

Influencers
有影响力的人
9k+
9k+

Digital Media
数字媒体

Outlets
网点
270k+
270k+

Journalists
记者

Opted In
选择加入
GET STARTED
开始吧
声明:本内容仅用作提供资讯及教育之目的,不构成对任何特定投资或投资策略的推荐或认可。 更多信息
    抢沙发