![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Over the last three decades, various technologies have been developed and applied for big data analytics on structured and unstructured business data. Because today most companies store data on different platforms in multiple locations with various data formats, these large and diverse data sets often stymie the ability to capture real-time opportunity and extract actionable data insights. Enterprise organization data from those multiple systems traditionally went through an Extract, Transform and Load (ETL) process to transform the data from its operational state into a state that’s better for querying, and then the data was often moved into the enterprise data warehouse.
The ETL process created a big problem for users like data engineers, data scientists, and system administrators as the ETL jobs for data pipelining and data movement can be very time consuming and difficult to manage when working with multiple databases and datastores. Hence, demand of analytics-as-a-service has been emerging as a new model to meet the business requirements for both archived and real-time data.
A federated data computing framework that allows users to retrieve and analyze data from multiple data sources and compute in a centralized platform is a better solution for fast data analytics in today’s data environments.
Presto, an open source platform, was originally designed to replace Hive, a batch approach to SQL on Hadoop and was built with higher performance and more interactivity compared with Apache Hive. The concept of Presto was to support a MPP (Massive Parallel Processing) framework to compute large scale data, so the architectural model is designed to support disaggregation of compute and storage and process real-time and high performing data analytics. Presto is not supposed to store data, instead the data sources are accessed via various connectors.
After years of development, the latest version of PrestoDB supports SQL-on-Anything and is an interactive query framework that can fit in any enterprise architecture as an in-memory query engine.
Presto is based upon a standard MPP database architecture, which enables horizontal scalability and the ability to process large amounts of data. Presto’s in-memory capabilities allow for interactive querying across various platforms of data sources. In order to access data at different locations, Presto is designed to be extensible with a pluggable architecture. Many components can be added to Presto to extend it further using this architecture, including connectors and security integrations.
The Presto cluster is a query engine that runs a single-server process on each instance, or node. It consists of two types of service processes: a Coordinator node and a Worker node. The Coordinator node’s main purpose is to receive SQL statements from the users, parse the SQL statements, generate a query plan, and schedule tasks to dispatch across Worker nodes. Meanwhile, the Worker node may communicate with other Worker nodes and execute the task from the query plan from the Coordinator, which is fragmented for distributed processing.
The Presto Coordinator is a single node deployed to manage the cluster. The coordinator allows users to submit queries via Presto CLI, applications using JDBC or ODBC drivers, or other available client API libraries of connections. The Coordinator is also responsible to talk to Workers to get update status, assign tasks, and send the output result sets back to the users. All communication is done through the RESTful API by Coordinator’s StatementResource class.
Inside a Presto cluster, there may be one Coordinator node with multiple Worker nodes. If the Coordinator is a leader, the Worker nodes are followers. Each Worker node stays alive as a process of a service that listens to the Coordinator for task executions and actual compute. The Worker will periodically send a heartbeat to the Discovery Server via RESTful API to signal the server with its health status of whether or not the worker is online or offline. This lets the Coordinator know from the Discovery Server which Worker nodes are available to dispatch tasks when the user submits a query.
The logical implementation of Presto is shown below. There are seven basic steps for running a query:
Presto doesn’t use MapReduce. It computes through a custom query and execution engine. All of its query processing is in memory, which is one of the main reasons for its high performance.
Presto is a distributed SQL engine for data analysis on data warehouses and other disparate data sources. It can achieve excellent performance for real-time or quasi-real-time analytic computing. Queries run with response times from millisecond to seconds. For a complex query with the right configuration, runtime can finish within the unit of minutes vs. hours or days if running on the Hive system. With its federated architecture, Presto is a proven technology that is most suitable for the following application scenarios:
First of all, Presto adopts a full memory computing model with excellent performance, which is especially suitable for ad hoc query, data exploration, BI reporting and dashboarding, lightweight ETL and other business scenarios.
Secondly, unlike other engines that only support partial SQL semantics, Presto supports complete SQL semantics, so you don’t have to worry about any requirements that Presto can’t express. Furthermore, Presto has a very convenient plug-in mechanism, you can add your own plug-ins without changing the kernel. In theory, you can use Presto to connect any data source to meet your various business scenarios.
Finally, Presto has a very active community. As part of the Linux Foundation’s Presto Foundation, many large enterprise companies in addition to Facebook such as Twitter, Uber, Amazon Athena, and Alibaba embrace Presto’s data lake analytic capability to develop features using Presto’s codebase to support large scale, high volume OLAP transactions on top of their own data federation system. Based on the above advantages, Presto is a proven technology to provide cloud data lake analysis as the underlying analytics engine.
The priority design of data lake, through opening the underlying file storage, brings maximum flexibility to the data into the lake. The data entering the data lake can be structured, semi-structured, or even completely unstructured raw logs. In addition, open storage also brings more flexibility to the upper-level engines. Various engines can read and write the data stored in the data lake according to their own scenarios, and only need to follow the relatively loose compatibility conventions (such loose conventions will have hidden challenges, which will be mentioned later). But at the same time, file system direct access makes many higher-level functions difficult to implement. For example, fine-grained (less than file granularity) permission management, unified file management and read-write interface upgrade are also very difficult (each access file engine needs to be upgraded before the upgrade is completed).
Presto also features performant SQL processing. Here are a few reasons why:
In many scenarios, Presto’s ad-hoc query runtime is expected to be 10 times faster than Hive in seconds or minutes. It supports multiple data sources, such as Hive, Kafka, MySQL, MongoDB, Redis, JMX, and more. As an open source distributed SQL query engine, Presto is a proven analytic framework to quickly analyze queries for any size of data. It supports both non-relational and relational data sources. Supported non-relational data sources include Hadoop distributed file system (HDFS), Amazon S3, Cassandra, MongoDB, and HBase. Furthermore, Presto supports JDBC / ODBC connection, ANSI SQL, window function, join, aggregation, complex query, etc. These key features are the founding keystones of building a cloud-based data lake analytics.
The Linux Foundation is a sponsor of The New Stack.
Feature image via Pixabay.