![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
One of the biggest challenges of working with big data is the performance overhead involved with moving data between different tools and systems as part of your data processing pipeline.
Different programming languages, file formats and network protocols all have different ways of representing the same data in memory. The process of serializing and deserializing the same data into a different representation at potentially each step in a data pipeline makes working with large amounts of data slower and more costly in terms of hardware.
The solution to this problem is to create what could be seen as a lingua franca for data, which tools and programming languages could use as a common standard for transferring and manipulating large amounts of data efficiently. One proposed implementation of this concept that has started to gain major adoption is Apache Arrow.
👁 Chart showing how Apache Arrow defragments Data Access
Apache Arrow is an open source project intended to provide a standardized columnar memory format for flat and hierarchical data. Arrow makes analytics workloads more efficient for modern CPU and GPU hardware, which makes working with large data sets easier and less costly.
Apache Arrow went live in 2016 and over time has grown in scope and features, many being formerly independent projects that were integrated into the core Arrow project, like DataFusion and Plasma.
The overall goal for Apache Arrow can be summarized as trying to do for OLAP workloads what ODBC/JDBC did for OLTP workloads, by creating a common interface for different systems working with analytics data.
The primary benefit of adopting Arrow is performance. With Arrow, you no longer need to serialize and deserialize your data when moving it around between different tools and languages, because they can all use the Arrow format. This is especially useful at scale when you need multiple servers to process data.
Here’s an example of performance gains from Ray, a Python framework for managing distributed computing:
Not only is converting the data to the Arrow format faster than using an alternative for Python like Pickle, but the even bigger performance gains are when it comes to deserialization, which is orders of magnitude faster.
Due to Arrow’s column-based format, processing and manipulating data is also faster because it has been designed for modern CPUs and GPUs, so that data can be processed in parallel and take advantage of things like SIMD (single instruction, multiple data) for vectorized processing.
Arrow also provides for zero-copy reads, so memory requirements are reduced in situations where you want to transform and manipulate the same underlying data in different ways.
Arrow integrates well with Apache Parquet, another column-based format for data focused on persistence to disk. Arrow and Parquet combined makes managing the life cycle and movement of data from RAM to disk much easier and more efficient.
Another benefit of Apache Arrow is the ecosystem. More functionality and features are being added over time, and performance is being improved as well. As you will see in upcoming sections, in many cases companies are donating entire projects to Apache Arrow and contributing heavily to the project itself.
Apache Arrow benefits almost all companies because it makes moving data between systems easier. This means that by adding Arrow support to a project, it becomes easier for developers to migrate or adopt that technology as well.
Now, let’s take a look at some of the key features and different components of the Apache Arrow project.
The Arrow columnar format is the core of the project and defines the actual specification for how data should be structured in-memory. From a performance perspective, the key features delivered by this format are:
There are multiple client libraries for several languages to make it easy to get started with Arrow.
Arrow Flight is an RPC (remote procedure call ) framework added to the project to allow easy transfer of large amounts of data across networks without the overhead of serialization and deserialization. The compression provided by Arrow also means that less bandwidth is consumed compared to less-optimized protocols. Many projects use Arrow Flight to enable distributed computing for analytics and data science workloads.
Arrow Flight SQL is an extension of Arrow Flight for interacting directly with SQL databases. While it is still considered experimental, features are being added rapidly. Recently a JDBC driver was added to the project, which means that any database that supports JDBC (Java Database Connectivity) or ODBC (Microsoft Open Database Connectivity) can now communicate with Arrow data through Flight SQL.
DataFusion is a query execution framework donated to Apache Arrow in 2019. DataFusion includes a query optimizer and execution engine with support for SQL and DataFrame APIs. It is commonly used for creating data pipelines, ETL processes and databases.
Many projects are adding integrations with Arrow to make adopting their tool easier or embedding components of Arrow directly into their projects to save themselves from duplicating work.
A big trend in many different areas of software development is eliminating lock-in effects by improving interoperability. In the observability and monitoring space we can see this with projects like OpenTelemetry and in the big data ecosystem, we can see a similar effort with projects like Apache Arrow.
Developers who take advantage of Apache Arrow will not only save time by not reinventing the wheel, but will also gain access to the entire ecosystem of tools also using Arrow, which can make adoption by new users easier.