VOOZH about

URL: https://thenewstack.io/how-to-get-data-warehouse-performance-on-the-data-lakehouse/

⇱ How to Get Data Warehouse Performance on the Data Lakehouse - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-11-20 09:00:34
How to Get Data Warehouse Performance on the Data Lakehouse
sponsor-celerdata,sponsored-post-contributed,
Data / Storage

How to Get Data Warehouse Performance on the Data Lakehouse

A novel approach to the data lakehouse combines all the advantages of lakehouse analytics with the high performance of a data warehouse.
Nov 20th, 2023 9:00am by Sida Shen
👁 Featued image for: How to Get Data Warehouse Performance on the Data Lakehouse
Featured image by Lo Sarno on Unsplash.
CelerData sponsored this post.

Data lakehouse architectures continue to grow in popularity, and that should come as no surprise. Their potential for seamlessly integrating the best features of data lakes and data warehouses promises a transformative experience for data processing and analysis. Yet, there are shortcomings to this approach. This article examines these challenges, like query performance and high costs, and identifies new technologies that are helping data lakehouses tackle them.

The Status Quo of Analytics on the Data Lakehouse

Data lakehouses have enticed numerous enterprises with the promise of flexibility, scalability and cost effectiveness. The reality, however, is that current lakehouse query engines fail at delivering query performance for low-latency or high-concurrency analytics at scale. Presently, the query engines that power these data lakehouses are bifurcated. On the one hand, we have engines optimized for extract, transform, and load (ETL) workflows, focusing on stage-by-stage operations. On the other hand, we see engines not leveraging modern optimizations such as single instruction, multiple data (SIMD) instruction sets, which are essential for harnessing the full power of modern CPUs.

This inherent performance limitation has pushed most users to copy their data from the lakehouse into proprietary data warehouses to achieve their desired query performance. But this is a costly workaround.

Cost #1: Data Ingestion Is Expensive

👁 Diagram of a common data pipeline

Figure 1: A common data lake pipeline

At the outset, ingesting data into a data warehouse seems like a straightforward procedure, but it’s far from it. This process necessitates converting data into the warehouse’s specific format, a task that demands considerable hardware resources. Moreover, this replication results in the redundancy of data storage — an expensive proposition in terms of cost and space.

It’s not just the physical resources either; the human effort demanded is equally significant. Tasks that seem mundane, such as aligning data types between the two systems, can drain resources. Furthermore, this data ingestion process inadvertently introduces latency, undermining the freshness of your data.

Cost #2: The Data Ingestion Pipeline Is Bad for Data Governance

The integrity and accuracy of data are paramount for any enterprise. Ironically, the very act of ingesting data into another warehouse, which should technically amplify its utility, poses serious challenges to data governance. How can you ensure all copies are consistently updated? How can you prevent discrepancies among different copies? And how can you do this while maintaining strong data governance? These are not just theoretical questions; they are serious technical challenges that require significant engineering effort and, when done incorrectly, have the potential to impact the veracity of your data-driven decisions.

A Modern Approach: The Pipeline-Free Data Lakehouse

The inherent challenges of data lakehouse query performance and the use of proprietary data warehouses as workarounds are pushing an increasing number of enterprises to seek out more efficient alternatives. One popular approach has been to adopt an ingestion-free lakehouse architecture. Here’s how this works.

An MPP Architecture with In-Memory Data Shuffling

Data lake query engines employ data shuffling for scalable performance, particularly with complex join operations and aggregations. However, many data lakehouse engines, originally designed for data lakes’ diverse and affordable storage, focus on data transformation and ad hoc queries, persisting intermediate results to disk. Although suitable for batch jobs, this method hampers the lakehouse’s evolving workloads, especially real-time, customer-facing queries. Additionally, disk-based shuffling introduces latency, impeding query performance and hindering immediate insights.

👁 Diagram comparing MPP and MapReduce

Figure 2: MPP vs. MapReduce framework

To navigate this challenge and run low-latency queries directly on the data lakehouse, embracing massively parallel processing (MPP) query engines equipped with in-memory data shuffling is a smart move. Unlike traditional approaches, in-memory shuffling bypasses disk persistence entirely. This ensures that the query execution is streamlined, with virtually zero wait time. Such operations are not only efficient but pivotal for achieving low query latency, enabling instantaneous insights directly from the data lakehouse.

A Well-Architected Caching Framework

One of the primary hurdles in optimizing data lakehouse queries lies in the expensive overhead of retrieving data from remote storage locations. The sheer volume and distributed nature of data in lakehouses make each scan a resource-intensive task. A well-designed built-in data caching system is necessary. The caching system should employ a hierarchical caching mechanism, leveraging not just disk-based caching but also in-memory caching, reducing data access from remote storage and thus reducing latency.

Furthermore, the efficacy of this caching framework hinges on its integration with the query engine. Instead of it being a standalone module that requires separate deployment — which can introduce complexity and potential performance bottlenecks — it should be embedded natively within the system. This cohesive architecture simplifies operations and ensures that the cache operates at peak efficiency, thereby delivering the best possible performance for data retrieval and query execution.

Further System-Level Optimizations

👁 Diagram of SIMD optimizations

Figure 3: SIMD optimizations

System-level optimizations like SIMD play an indispensable role in further improving lakehouse performance. For instance, SIMD enhancements facilitate concurrent processing of several data points with unified instruction. When combined with columnar storage, typically found in open data lake file formats like Parquet or Optimized Row Columnar (ORC), it allows data to be processed in bigger batches and significantly elevates the performance of online analytical processing (OLAP) queries, particularly those involving join operations.

Consider Open Source Solutions

Lastly, prioritize open source solutions. Embracing open source is critical if you want to maximize the benefits of your data lakehouse architecture. The data lakehouse’s inherent open nature extends beyond just the formats it supports; one of its paramount advantages is the flexibility it offers. This modularity means that components, including query engines, can be interchanged with minimal effort, allowing you to remain agile and adapt to the evolving landscape of data analytics with ease.

Pipeline-Free Data Lakehouses in Action: Trip.com’s Artnova Platform

All of this may sound good in theory, but what about in practice? Trip.com’s unified internal reporting platform, Artnova, offers a great example.

👁 Architecture before implementing StarRocks

Figure 4. Before: Business-critical workloads ingested into StarRocks

Initially, Artnova used Apache Hive as the data lake and Trino as the query engine. However, due to the vast volume of data coupled with the need for low latency and the ability to handle a high number of concurrent requests, Trino could not meet some use cases. Trip.com had to replicate and transfer the data into StarRocks, its high-performance data warehouse. While this strategy solved some performance issues, it also introduced more problems:

  • Data freshness lagged despite the relatively fast ingestion, affecting the flexibility and timeliness of queries.
  • It added complexity in the data pipeline due to additional ingestion tasks and table schema and index design requirements.

Duplicating data to another data warehouse is complex and expensive. Trip.com chose to initially move only the most business-critical workloads to StarRocks, but ultimately decided an architectural overhaul was necessary and expanded its use of StarRocks.

👁 Architecture after implementing StarRocks

Figure 5. After: StarRocks as the unified query engine

According to performance tests conducted by Trip.com, using StarRocks as the query engine is 7.4 times faster than Trino when querying the same data. With business-critical use cases further accelerated by StarRocks’ built-in materialized view, the performance gain is significant.

Go Pipeline-Free with Your Data Lakehouse

The evolution of the data lakehouse has reshaped data analytics, blending the advantages of data lakes and data warehouses. Despite its transformative potential, challenges like efficient query performance persist. Innovative solutions like MPP query execution, caching frameworks and system-level optimizations may bridge these gaps and enable enterprises to take advantage of all the benefits of the lakehouse with none of the drawbacks.

CelerData helps enterprises accelerate business growth with a unified analytics platform that delivers 3X the performance of any other solution on the market while reducing operating costs by up to 80%. Powered by StarRocks, CelerData is used worldwide by leading brands including Airbnb and Lenovo.
Learn More
The latest from CelerData
TRENDING STORIES
Sida Shen is product marketing manager at CelerData. An engineer with backgrounds in building machine learning and big data infrastructures, he oversees the company’s market research and works closely with engineers and developers across the analytics industry to tackle challenges...
Read more from Sida Shen
CelerData sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.