VOOZH about

URL: https://thenewstack.io/presto-for-big-data-sql-challenges-considerations-cloud-solutions/

⇱ Presto for Big Data SQL: Challenges, Considerations, Cloud Solutions - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-01-13 10:00:04
Presto for Big Data SQL: Challenges, Considerations, Cloud Solutions
contributed,
Data

Presto for Big Data SQL: Challenges, Considerations, Cloud Solutions

In this post, we introduced you to Presto, a little bit of how it works, and some of the challenges of deploying and managing a Presto service.
Jan 13th, 2023 10:00am by Ying Su
👁 Featued image for: Presto for Big Data SQL: Challenges, Considerations, Cloud Solutions
Image via Pixabay.

SQL is the lingua franca for data practitioners and analysts. So even as newer data lake analytical data systems emerged, a SQL interface was a must. Meta (formerly Facebook) developed Hive to be a SQL interface on top of the first generation of data lakes built on Hadoop. But, Hive had its shortcomings. The nature of Hive was not conducive for human-scale interactivity needed for dashboards and ad hoc analysis.

👁 Image

Hence, based on an in-memory architecture, Presto was built to meet the low-latency needs of these interactive and ad hoc use cases. Presto also made improvements in other areas including standardizing on an ANSI SQL interface — versus a SQL dialect like HiveQL — and a pluggable connector to directly connect to other data sources, allowing for more flexibility and federated workloads.

👁 Image

Today, Presto is used heavily at some of the most successful internet-scale companies, including Meta (holding 300PB of data, 1,000 daily active users and 30,000 queries per day), Uber (50PB, with 7,000 weekly active users and 100 million queries per day) and ByteDance (1 million queries per day).

The Presto project has been made part of the Linux Foundation, ensuring open governance and community stewardship.

👁 Image

The adoption and performance of Presto have made query engines and data stacks built around query engines, like Lakehouses, a viable — and arguably more flexible — alternative to all-in-one SQL solutions for big data.

In this article, we will learn more about the technical aspects of Presto and how cloud solutions like Ahana Cloud make it possible for data teams and organizations to leverage it.

Presto Architecture

Let’s briefly dive into the Presto architecture and system.

Coordinator and Workers

Presto is a distributed system, which allows it to scale with data sizes. Like many distributed compute systems, Presto separates the activity of planning work and the actual processing of work itself. In Presto, processes are separated into two server types — Coordinator and Worker. A Presto cluster has a single Coordinator and one or more Workers. It is not uncommon for large clusters to have hundreds of Workers. Recently, the Presto community has introduced multiple, disaggregated Coordinators to scale out to an even larger number of Workers in a single cluster as well and improve cluster availability.

👁 Image

Connectors

Presto is designed to support data source connectors (versus only being tied to a data lake). A connector provides a table-based data source model similar to what you find in relational database management systems (RDBMS). In Presto, a connector provides a three-level hierarchy — catalog, schema, and table — to expose data source data as tables; this is regardless if the underlying data source is natively relational, or tabular, in nature or not (e.g., NoSQL). A connector implementation handles the details of converting the native data source format to a uniform in-memory table format for the Presto query engine; the end user simply works with data through SQL in terms of tables, columns and rows.

Configuration and Administration

Presto is configured through properly formatted and placed configuration files.

Access Control

Presto provides a basic access control framework at two levels — system access control and connector access Control. System access control primarily deals with who has access to what catalog and to what degree (e.g., read-only). Connector access control is specific to each connector and is handled by the connector implementation. Connector access control exposes data access controls via familiar GRANT and REVOKE SQL semantics. For finer-grained data access control, Presto can integrate with third-party data authorization frameworks, such as Apache Ranger and AWS Lake Formation.

Resource Groups

Presto also provides a mechanism — called resource groups — to place limits on resource usage and enforce queueing policies, such as limits on the number of concurrently running queries, CPU time and memory usage. These policies can be defined against users, client source and query type (e.g., SELECT versus INSERT versus DDL).

Getting Started with Presto

There are several ways to get started and deploy Presto. For direct local installation, you can grab the tarball and/or JAR for any release (see release notes) and follow the instructions to deploy those artifacts. For a more packaged deployment, you can experiment with a Presto Docker image. Finally, if you don’t want to worry about managing infrastructure altogether, there is a free-forever Presto managed service community edition provided by Ahana. After you’ve deployed Presto, you can connect to the cluster via JDBC or Presto CLI and query your data sources via standard ANSI SQL syntax.

Challenges

Challenge 1: Managing and Configuring Distributed Infrastructure

Getting started with Presto is one thing, but managing and configuring a distributed system like Presto for production workloads on an ongoing basis can be challenging, particularly in on-premises environments. Even with the cloud, there can be friction in infrastructure and configuration.

On the infrastructure side, you’ll want to take advantage of any elasticity and specialized compute resources. For example, you may want to scale out and in the number of Workers during periods of heavy and low usage, respectively, to optimally utilize compute resources and manage costs. You may want to fine-tune hardware for optimal price-performance on workload SLA and predictability, such as leveraging spot instances and making use of ARM-based processors.

Even after the infrastructure can be managed, there is complexity in configuring Presto. Presto has exposed hundreds of configuration parameters to allow for tuning. These configuration parameters can provide flexibility and are a natural offshoot of open source development. However, it can quickly lead to a management headache.

When it comes to managing Presto infrastructure and configuration in the cloud, you’ll want to consider to what degree of control you want or need and how much of that can be traded for simplicity. On one end of the spectrum, you can go serverless (e.g., Amazon Athena), abstracting away much of the underlying infrastructure and exposing a simple query interface and connectivity — query-as-a-service if you will. This can work well when getting started, but over time, the opaqueness (i.e., limited visibility and control) can prevent advanced tuning required for better or more optimal price performance. Further, in some cases, limits on quota or multitenant sharing of compute resources can result in inconsistent latency performance, where the same query on the same data can take longer in subsequent submissions.

On the other end of the spectrum (e.g., Amazon EMR), you can fully manage the infrastructure yourself and go with a more do-it-yourself approach. You will have full control. But, in this scenario, you’ll need to understand and manage any ongoing tuning (e.g., configurations) and operations. All this is exacerbated when looking at several clusters, and given the importance of tuning, it is quite common to have multiple clusters designed for different workloads.

Finally, a managed service (e.g., Ahana Cloud) can offer a balance between control and simplicity between both ends of the spectrum, where the most significant configurations are exposed in an easy-to-use interface. This can be the right choice for teams that outgrow the limitations of serverless options, but do not want to or have the expertise to completely control the infrastructure and configurations. No matter what deployment style or offering you choose, you also need to consider new software versions and upgrades for new functionality, stability and fixes — both for Presto itself and any underlying dependencies (e.g., Kubernetes cluster).

One final point is around data privacy and the risk of data extrication. For data infrastructure that touches data, it’s important to understand how the data is flowing and the potential points of risk — not only in terms of an organization’s own data, but in terms of any data that might put their customers at risk. For example, Amazon EMR and the Ahana Cloud managed service deliberately deploys all the Presto cluster infrastructure into a customer’s account and network for this reason.

Challenge 2: Coordinating Disaggregated Services

As a query engine, Presto is only one piece of a broader stack of disaggregated services required for SQL workloads on top of a data lake or federated across data sources. A metadata catalog is required to allow Presto to determine which files to properly scan for a given query; further, the metadata catalog can contain valuable data statistics that are crucial for optimal query performance. Another important piece is fine-grained data access control, governing which users have access to which table, columns and rows.

Ahana Cloud not only makes it easy to deploy Presto but also to integrate with related services. In the case of the metadata catalog, Ahana Cloud can integrate with an open option like the Hive Metastore or a cloud native one, like AWS Glue. In fact, for convenience, Ahana Cloud provides a built-in Hive Metastore, as well. For data access control, again, Ahana Cloud provides easy integration to open and cloud native options, such as Apache Ranger and AWS Lake Formation, respectively.

Challenge 3: Price Performance

After all the infrastructure, configuration and deployment is said and done, there must be a sustainable cost-to-value trajectory. With ever more data, SQL compute costs — which is already the dominant cost driver in a big data SQL stack — will continue to grow. In the cloud, data lakes built on object stores offer scalable, resilient, low-cost file storage – in contrast to expensive specialized hardware. Well-designed and well-priced query engine services can deliver superior price performance benefits to non-data lake alternatives.

Our solution, Ahana Cloud can offer at least a three-times-better price performance to alternatives due to sophisticated multi-level caching, intelligent autoscaling and the ability to run on price-performance optimized CPUs. This will further improve dramatically with ongoing work in more efficient native execution.

Conclusion

In this post, we introduced you to Presto, a little bit of how it works, and some of the challenges of deploying and managing a Presto service. The cloud helps to alleviate many of the pain points compared to on-premises, and there are several Presto solutions for the cloud.  Nonetheless, there are still challenges. Ahana Cloud is a cloud native managed service for Presto that aims to provide simplified management and configuration without limiting the most important aspects of control to deliver best-in-class price performance for large-scale SQL production workloads on the data lake.

TRENDING STORIES
Ying Su is the performance architect at Ahana, where she works on building the more efficient and better price-performant data lake services on Presto and Velox. She has worked for Microsoft SQLServer and Meta Presto in the past and is...
Read more from Ying Su
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services and the Linux Foundation are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: Docker.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.