![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Data lakehouse architectures continue to grow in popularity, and that should come as no surprise. Their potential for seamlessly integrating the best features of data lakes and data warehouses promises a transformative experience for data processing and analysis. Yet, there are shortcomings to this approach. This article examines these challenges, like query performance and high costs, and identifies new technologies that are helping data lakehouses tackle them.
Data lakehouses have enticed numerous enterprises with the promise of flexibility, scalability and cost effectiveness. The reality, however, is that current lakehouse query engines fail at delivering query performance for low-latency or high-concurrency analytics at scale. Presently, the query engines that power these data lakehouses are bifurcated. On the one hand, we have engines optimized for extract, transform, and load (ETL) workflows, focusing on stage-by-stage operations. On the other hand, we see engines not leveraging modern optimizations such as single instruction, multiple data (SIMD) instruction sets, which are essential for harnessing the full power of modern CPUs.
This inherent performance limitation has pushed most users to copy their data from the lakehouse into proprietary data warehouses to achieve their desired query performance. But this is a costly workaround.
At the outset, ingesting data into a data warehouse seems like a straightforward procedure, but it’s far from it. This process necessitates converting data into the warehouse’s specific format, a task that demands considerable hardware resources. Moreover, this replication results in the redundancy of data storage — an expensive proposition in terms of cost and space.
It’s not just the physical resources either; the human effort demanded is equally significant. Tasks that seem mundane, such as aligning data types between the two systems, can drain resources. Furthermore, this data ingestion process inadvertently introduces latency, undermining the freshness of your data.
The integrity and accuracy of data are paramount for any enterprise. Ironically, the very act of ingesting data into another warehouse, which should technically amplify its utility, poses serious challenges to data governance. How can you ensure all copies are consistently updated? How can you prevent discrepancies among different copies? And how can you do this while maintaining strong data governance? These are not just theoretical questions; they are serious technical challenges that require significant engineering effort and, when done incorrectly, have the potential to impact the veracity of your data-driven decisions.
The inherent challenges of data lakehouse query performance and the use of proprietary data warehouses as workarounds are pushing an increasing number of enterprises to seek out more efficient alternatives. One popular approach has been to adopt an ingestion-free lakehouse architecture. Here’s how this works.
Data lake query engines employ data shuffling for scalable performance, particularly with complex join operations and aggregations. However, many data lakehouse engines, originally designed for data lakes’ diverse and affordable storage, focus on data transformation and ad hoc queries, persisting intermediate results to disk. Although suitable for batch jobs, this method hampers the lakehouse’s evolving workloads, especially real-time, customer-facing queries. Additionally, disk-based shuffling introduces latency, impeding query performance and hindering immediate insights.
To navigate this challenge and run low-latency queries directly on the data lakehouse, embracing massively parallel processing (MPP) query engines equipped with in-memory data shuffling is a smart move. Unlike traditional approaches, in-memory shuffling bypasses disk persistence entirely. This ensures that the query execution is streamlined, with virtually zero wait time. Such operations are not only efficient but pivotal for achieving low query latency, enabling instantaneous insights directly from the data lakehouse.
One of the primary hurdles in optimizing data lakehouse queries lies in the expensive overhead of retrieving data from remote storage locations. The sheer volume and distributed nature of data in lakehouses make each scan a resource-intensive task. A well-designed built-in data caching system is necessary. The caching system should employ a hierarchical caching mechanism, leveraging not just disk-based caching but also in-memory caching, reducing data access from remote storage and thus reducing latency.
Furthermore, the efficacy of this caching framework hinges on its integration with the query engine. Instead of it being a standalone module that requires separate deployment — which can introduce complexity and potential performance bottlenecks — it should be embedded natively within the system. This cohesive architecture simplifies operations and ensures that the cache operates at peak efficiency, thereby delivering the best possible performance for data retrieval and query execution.
System-level optimizations like SIMD play an indispensable role in further improving lakehouse performance. For instance, SIMD enhancements facilitate concurrent processing of several data points with unified instruction. When combined with columnar storage, typically found in open data lake file formats like Parquet or Optimized Row Columnar (ORC), it allows data to be processed in bigger batches and significantly elevates the performance of online analytical processing (OLAP) queries, particularly those involving join operations.
Lastly, prioritize open source solutions. Embracing open source is critical if you want to maximize the benefits of your data lakehouse architecture. The data lakehouse’s inherent open nature extends beyond just the formats it supports; one of its paramount advantages is the flexibility it offers. This modularity means that components, including query engines, can be interchanged with minimal effort, allowing you to remain agile and adapt to the evolving landscape of data analytics with ease.
All of this may sound good in theory, but what about in practice? Trip.com’s unified internal reporting platform, Artnova, offers a great example.
Figure 4. Before: Business-critical workloads ingested into StarRocks
Initially, Artnova used Apache Hive as the data lake and Trino as the query engine. However, due to the vast volume of data coupled with the need for low latency and the ability to handle a high number of concurrent requests, Trino could not meet some use cases. Trip.com had to replicate and transfer the data into StarRocks, its high-performance data warehouse. While this strategy solved some performance issues, it also introduced more problems:
Duplicating data to another data warehouse is complex and expensive. Trip.com chose to initially move only the most business-critical workloads to StarRocks, but ultimately decided an architectural overhaul was necessary and expanded its use of StarRocks.
According to performance tests conducted by Trip.com, using StarRocks as the query engine is 7.4 times faster than Trino when querying the same data. With business-critical use cases further accelerated by StarRocks’ built-in materialized view, the performance gain is significant.
The evolution of the data lakehouse has reshaped data analytics, blending the advantages of data lakes and data warehouses. Despite its transformative potential, challenges like efficient query performance persist. Innovative solutions like MPP query execution, caching frameworks and system-level optimizations may bridge these gaps and enable enterprises to take advantage of all the benefits of the lakehouse with none of the drawbacks.