UPDATED 16:54 EST / JANUARY 16 2025

BIG DATA

Onehouse says its runtime accelerator can speed data lakehouse operations up to 30-fold

Managed open data lakehouse provider Onehouse Inc. today released a runtime engine that it said can accelerate workloads across the most popular open data lake table formats up to 30-fold.

The company said it achieves these results by deeply understanding common data workloads such as ingestion and transformation and implementing specialized optimizations for those workloads.

Onehouse said customers using the Onehouse Compute Runtime see at least double the query performance on open data lakehouse table formats and between 20% and 80% reductions in cloud infrastructure costs for operations such as data ingestion, table optimizations and extract/transform/load operations. The technology is an integral part of the Onehouse platform and won’t be released to open source as much of the optimization is done through a tightly integrated runtime.

The OCR supports Amazon Web Services Inc.’s Redshift, Google LLC’s BigQuery, Databricks Inc.’s platform and Snowflake Inc.’s cloud data warehouse. It also works with the Apache Hudi, Apache Iceberg, and Delta Lake open-source open table formats.

Integration is at the catalog level with support for Databricks’ proprietary Unity Catalog, AWS’ Glue, the Hive metastore, Snowflake’s Horizon and Google’s Data Catalog. Support for the open-source Unity Catalog is planned. “We pretty much support every major catalog in the market today,” said Vinoth Chandar, Onehouse’s chief executive.

Dynamic adaptation

The software dynamically adapts performance and configurations to users’ workload patterns at runtime to achieve efficiency beyond the levels currently available in open source optimizers, Onehouse said.

“We’ve implemented some core lakehouse operations like merging data to have a much more optimized implementation in our runtime,” Chandar said. “It can scale that operation linearly with the input size as opposed to the amount of data in storage. It’s understanding deeply the lakehouse workload and recognizing when such workloads are running.”

Onehouse built OCR for its own data lakehouse platform, which is based on Apache Hudi, and extended support to the incubating Apache XTable project to enable interoperability among various data lakehouse table formats, to expand available to other platforms.

The software optimizes in three major areas.

A serverless compute manager provides elastic cluster scaling to accommodate spikes and swings in workloads. It manages multiple clusters for flexible resource allocation and isolation and supports custom cloud security, sovereignty and flexibility.

Adaptive workload optimizations tune execution to the characteristics of the workload using a multiplexed job scheduler that shares computing resources across multiple jobs in parallel. The runtime assumes the task of assigning jobs to underlying clusters and the multiplexed scheduler shares clusters efficiently across multiple jobs, reducing the compute resource that is needed for any given workload.

“If the schedules needs a new cluster, it can create a new cluster; if it doesn’t, it can fit to an existing one based on how well that cluster is doing,” Chandar said. We are taking that complexity out of the user’s hands.”

Lag-aware scheduling enforces service level agreements for latency and performance profiles balance write and query performance.

High-performance lakehouse input/output uses vectorized columnar merging for faster writes, parallel pipelined execution to maximize CPU efficiency and storage optimization to reduce network requests.

Chandar said users won’t see much performance improvement in simple operations such as writing a Parquet file, but “most people today write data pipelines, which are pumping data constantly into tables,” he said. “For all these modern workloads there are plenty of opportunities for a runtime to be aware of the workload and change cost/performance in a meaningful way.”

Chief Marketing Officer Gaetan Castelein said the debates over table formats and catalogs that have dominated the lakehouse conversation over the past two years overlook the greater value that can be found at other levels of the computing stack.

“A table format is just a table format,” he said. “It doesn’t matter that much. Table services and optimization are much more critical to the success of a lakehouse than the choice of table.”

Photo: Pok_Rie/Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU