VOOZH about

URL: https://thenewstack.io/python-pandas-ditches-numpy-for-speedier-pyarrow/

⇱ Python Pandas Ditches NumPy for Speedier PyArrow - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-05-27 07:00:37
Python Pandas Ditches NumPy for Speedier PyArrow
Data / Python

Python Pandas Ditches NumPy for Speedier PyArrow

Pandas 3.0 will significantly boost performance by replacing NumPy with PyArrow as its default engine, enabling faster loading and reading of columnar data.
May 27th, 2025 7:00am by Joab Jackson
👁 Featued image for: Python Pandas Ditches NumPy for Speedier PyArrow

👁 Image
Python Pandas is about to get a performance boost: When the long-awaited version 3.0 of the data analysis library is released, it will come with a faster engine for loading and reading columnar data. PyArrow will take the place of NumPy, the math library Pandas has used thus far.

At present, Pandas already supports PyArrow, and has done so at least since version 2 released in April 2023. And in the next version, v3.0, PyArrow will be a required dependency, with pyarrow.string being the default type inferred for string data.

You can still use NumPy in version 3.0, but why would you want to?

“So the good news, PyArrow is 10 times faster. What else do you need to know? Like it’s just really ridiculously faster,” advised Python instructor Reuven Lerner, during a session about the “PyArrow Revolution,” held at PyCon 2025 earlier this month in Pittsburgh.

The Way of the Pandas

Created in 2008 by financial quant Wes McKinney, the Pandas library is now used by many to manage large data sets. He originally built it on top of the NumPy scientific computing library, which, among other features, offers the ability to store large arrays of data in a great variety of formats.

A Pandas Series is basically a wrapper around a one-dimensional NumPy array; a Pandas Data Frame is a wrapper around a two-dimensional data array. Because it is written in C and vectorized, Pandas does so in a manner faster and more efficient than Python itself.

But NumPy, built in 2005 as an update to the Numeric library, predated a lot of data concerns in the past decade, such as data streaming, or nested rows or use of complex data types. It has trouble with dates; it has no compression techniques and is not even that great for batch processing.

Worst of all, it is slow with columnar data, which when you think about, is basically what arrays are. It still stores everything in rows, which makes array processing painfully slow as it tracks down each value on a case-by-case basis. And it is single threaded, so it does all calculation serially, limited by the speed of the processor.

Introducing PyArrow

PyArrow offers columnar storage, which eliminates all that computational back and forth that comes with NumPy. PyArrow paves the way for running Pandas, by default, on Copy on Write mode, which improves memory and performance usage.

PyArrow is the Python bindings for the Apache Arrow. Also created by McKinney, Apache Arrow is a cross-platform memory format that stores data in columns, making them easy to store on disk and faster to calculate.

The columnar orientation provided faster data writes and reads for most open source data processing engines, including Spark, Flink, Dremio, Drill and Ray. A lot of AI modeling is built on columnar data, so the format is much favored by AI frameworks such as TensorFlow and PyCharm.

Lerner offered an example of how bad NumPy is at memory, using a 2.2GB CSV (Comma Separated Values) of New York parking violations for the year 2020, which consists of about 12 million rows.

Reading in that CSV file into memory would take Python 55.8 seconds, but PyArrow did the work in 11.8 seconds.

Arrow defines two new binary formats to speed data exchanges even more. One is Feather, which is an uncompressed data format, and the other is Parquet, which compresses data.

That 2.2GB CSV file took up only 1.4GB when rendered into the Feather format and 379MB in Parquet. Because the data is binary, Pandas doesn’t have to pause to figure out the data type, Lerner noted.

Performance increased as well: With Feather, that entire CSV file could be read in 10.6 seconds, and with Parquet it took only 9.1 seconds, according to Lerner’s tests.

The Release of Pandas 3.0

In a follow-up e-mail to TNS, Lerner clarified that PyArrow will be required for Pandas 3.0. But it won’t yet be the default engine in Pandas 3.0.

“Over time, PyArrow is becoming better and better integrated with Pandas, but using it as a back end is still experimental and isn’t recommended in production,” Lerner wrote. “That said, the improvements that we’re seeing in PyArrow are pretty amazing, and indicate that we’re going in the right direction.”

“So — will PyArrow replace NumPy? Yes, at some point! But we don’t know when, and it won’t be in the Pandas 3.0 release.”

When will Pandas 3.0 arrive, however, is still an open question. Pandas 3.0 was originally due to be released in April 2024, which came and went with no release, and of press time, no scheduled release. The latest releases, v. 2.23, was issued in September.

In many ways, Pandas is by now a legacy technology, so the embedding of PyArrow is good news for organizations that want to speed data-crunching operations with all the messy work of migrating to a new platform.

“The real advantage here is that you get to keep your use of Pandas, keep the same API,” Lerner said. “You swap out the backend in favor a new one, and voila, you save tons of time and tons of memory.”

(May 30 2025: This post was updated with additional clarifications from Lerner). 

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Dremio.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.