VOOZH about

URL: https://thenewstack.io/how-apache-arrow-is-changing-the-big-data-ecosystem/

⇱ How Apache Arrow Is Changing the Big Data Ecosystem - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-12-05 05:05:01
How Apache Arrow Is Changing the Big Data Ecosystem
contributed,sponsor-influxdata,sponsored,sponsored-post-contributed,
Data / Open Source / Storage

How Apache Arrow Is Changing the Big Data Ecosystem

Arrow makes analytics workloads more efficient for modern CPU and GPU hardware, which makes working with large data sets easier and less costly.
Dec 5th, 2022 5:05am by Charles Mahler
👁 Featued image for: How Apache Arrow Is Changing the Big Data Ecosystem
Image via Pixabay.
InfluxData sponsored this post.

One of the biggest challenges of working with big data is the performance overhead involved with moving data between different tools and systems as part of your data processing pipeline.

Different programming languages, file formats and network protocols all have different ways of representing the same data in memory. The process of serializing and deserializing the same data into a different representation at potentially each step in a data pipeline makes working with large amounts of data slower and more costly in terms of hardware.

The solution to this problem is to create what could be seen as a lingua franca for data, which tools and programming languages could use as a common standard for transferring and manipulating large amounts of data efficiently. One proposed implementation of this concept that has started to gain major adoption is Apache Arrow.

👁 Chart showing how Apache Arrow defragments Data Access

What Is Apache Arrow?

Apache Arrow is an open source project intended to provide a standardized columnar memory format for flat and hierarchical data. Arrow makes analytics workloads more efficient for modern CPU and GPU hardware, which makes working with large data sets easier and less costly.

Apache Arrow went live in 2016 and over time has grown in scope and features, many being formerly independent projects that were integrated into the core Arrow project, like DataFusion and Plasma.

The overall goal for Apache Arrow can be summarized as trying to do for OLAP workloads what ODBC/JDBC did for OLTP workloads, by creating a common interface for different systems working with analytics data.

Benefits of Apache Arrow

Performance

The primary benefit of adopting Arrow is performance. With Arrow, you no longer need to serialize and deserialize your data when moving it around between different tools and languages, because they can all use the Arrow format. This is especially useful at scale when you need multiple servers to process data.

Here’s an example of performance gains from Ray, a Python framework for managing distributed computing:

Not only is converting the data to the Arrow format faster than using an alternative for Python like Pickle, but the even bigger performance gains are when it comes to deserialization, which is orders of magnitude faster.

Due to Arrow’s column-based format, processing and manipulating data is also faster because it has been designed for modern CPUs and GPUs, so that data can be processed in parallel and take advantage of things like SIMD (single instruction, multiple data) for vectorized processing.

Arrow also provides for zero-copy reads, so memory requirements are reduced in situations where you want to transform and manipulate the same underlying data in different ways.

Bulk Data Ingress and Egress Via Parquet

Arrow integrates well with Apache Parquet, another column-based format for data focused on persistence to disk. Arrow and Parquet combined makes managing the life cycle and movement of data from RAM to disk much easier and more efficient.

Ecosystem

Another benefit of Apache Arrow is the ecosystem. More functionality and features are being added over time, and performance is being improved as well. As you will see in upcoming sections, in many cases companies are donating entire projects to Apache Arrow and contributing heavily to the project itself.

InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData

Apache Arrow benefits almost all companies because it makes moving data between systems easier. This means that by adding Arrow support to a project, it becomes easier for developers to migrate or adopt that technology as well.

Apache Arrow Features

Now, let’s take a look at some of the key features and different components of the Apache Arrow project.

Arrow Columnar Format

The Arrow columnar format is the core of the project and defines the actual specification for how data should be structured in-memory. From a performance perspective, the key features delivered by this format are:

  • Data is able to be read sequentially.
  • Constant time random access.
  • SIMD and vector processing support.
  • Zero copy reads.

There are multiple client libraries for several languages to make it easy to get started with Arrow.

Arrow Flight

Arrow Flight is an RPC (remote procedure call ) framework added to the project to allow easy transfer of large amounts of data across networks without the overhead of serialization and deserialization. The compression provided by Arrow also means that less bandwidth is consumed compared to less-optimized protocols. Many projects use Arrow Flight to enable distributed computing for analytics and data science workloads.

Arrow Flight SQL

Arrow Flight SQL is an extension of Arrow Flight for interacting directly with SQL databases. While it is still considered experimental, features are being added rapidly. Recently a JDBC driver was added to the project, which means that any database that supports JDBC (Java Database Connectivity) or ODBC (Microsoft Open Database Connectivity) can now communicate with Arrow data through Flight SQL.

Arrow DataFusion

DataFusion is a query execution framework donated to Apache Arrow in 2019. DataFusion includes a query optimizer and execution engine with support for SQL and DataFrame APIs. It is commonly used for creating data pipelines, ETL processes and databases.

Projects Using Apache Arrow

Many projects are adding integrations with Arrow to make adopting their tool easier or embedding components of Arrow directly into their projects to save themselves from duplicating work.

  • InfluxDB IOx — InfluxDB’s new columnar storage engine IOx uses the Arrow format for representing data and moving data to and from Parquet. It also uses DataFusion to add SQL support to InfluxDB.
  • Apache ParquetParquet is a file format for storing columnar data used by many projects for persistence. Parquet has support for vectorized reads and writes to and from Arrow.
  • DaskDask is a parallel computing framework that makes it easy to scale Python code horizontally. Dask uses Arrow for accessing Parquet files.
  • RayRay is a framework that allows data scientists to process data, train machine learning models, then serve those models in production using a unified tool. Ray relies on Apache Arrow for moving data between components with minimal overhead.
  • PandasPandas is one of the most popular data analysis tools in the Python ecosystem. Pandas is able to read data stored in Parquet files by using Apache Arrow behind the scenes.
  • TurboDCTurboDC is a tool based on the ODBC interface that allows data scientists to efficiently access data stored in relational databases via Python. Arrow makes this more efficient by allowing the data to be transferred in batches rather than as single records.

Conclusion

A big trend in many different areas of software development is eliminating lock-in effects by improving interoperability. In the observability and monitoring space we can see this with projects like OpenTelemetry and in the big data ecosystem, we can see a similar effort with projects like Apache Arrow.

Developers who take advantage of Apache Arrow will not only save time by not reinventing the wheel, but will also gain access to the entire ecosystem of tools also using Arrow, which can make adoption by new users easier.

InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData
TRENDING STORIES
Charles Mahler is a technical writer at InfluxData, where he creates content to help educate users on the InfluxData and time series data ecosystem. Charles' background includes working in digital marketing and full-stack software development.
Read more from Charles Mahler
InfluxData sponsored this post.
SHARE THIS STORY
TRENDING STORIES
InfluxData is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
👁 Image
Join the millions of developers using InfluxDB to predict, respond, and adapt in real-time.