VOOZH about

URL: https://thenewstack.io/how-open-source-arrow-helps-solve-time-series-data-dilemmas/

⇱ How Open Source Arrow Helps Solve Time Series Data Dilemmas - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-08-15 03:00:22
How Open Source Arrow Helps Solve Time Series Data Dilemmas
sponsor-influxdata,sponsored-post,
Data / Open Source

How Open Source Arrow Helps Solve Time Series Data Dilemmas

The evolution of Apache Arrow — a popular open source, multilanguage toolbox for faster data interchange and in-memory processing — creates new opportunities for improved real-time analytics.
Aug 15th, 2023 3:00am by B. Cameron Gain
👁 Featued image for: How Open Source Arrow Helps Solve Time Series Data Dilemmas
Image by Nick Fewings from Unsplash.
InfluxData sponsored this post.

Companies and other organizations have been using metrics stored in time series databases (TSDBs) for critical functions such as monitoring, alerting and automating processes. However, they had a harder time deriving other insights and value from those databases due to limitations imposed by cardinality constraints and specialized query languages.

Now, the evolution of Apache Arrow — a popular open source multilanguage toolbox for accelerated data interchange and in-memory processing — creates new opportunities for improved real-time analytics and time series applications beyond traditional use cases such as climate modeling, finance and even AI.

Indeed, users of time series databases historically struggle with high-cardinality use cases, according to Rachel Stephens, an analyst for RedMonk. High-cardinality data sets are those that have a large and often unbounded set of unique possible values in a given field.

For example, take user IDs, which have a large number of possible distinct values, or trace IDs, Stephens told The New Stack. This has historically meant that in infrastructure monitoring use cases, TSDBs were effective for measuring metrics over time, but cardinality limitations didn’t allow for logging or tracing use cases.

Apache Arrow is language-agnostic, facilitating building and querying large-scale databases that must transfer and process data in fractions of seconds for access by distributed end-users in a columnar data format.

After working with Apache Arrow, InfluxData applied its domain expertise in time series data to address specific requirements, such as compactions for more efficient data storage. It can build high-performance databases by leveraging Arrow’s upstream tools and libraries.

Arrow, Parquet and Rust

InfluxData also draws upon the Apache Parquet column-oriented data storage format, along with the Arrow in-memory format. Apache Parquet is designed for data storage and retrieval, which provides efficient data compression. The company also implemented its new database engine write-read data in the Rust programming language.

InfluxData has integrated Arrow into InfluxDB, allowing users to take advantage of columnar data formats and improved analytics. This development enables sub-second query responses with InfluxDB for its time series data platform and storage. As a result, real-time analysis is now possible for monitoring, alerting and analytics on large fleets of devices.

InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData

The end result, based on InfluxData’s work with Apache Arrow, Apache Parquet, Apache DataFusion and Rust, is InfluxDB 3.0, a new time series database engine that the company says is much more efficient than its predecessor without being limited by cardinality restrictions.

“We have achieved this by employing optimization techniques like vectorization, predicate pushdowns, aggregate pushdowns, parallelism and more,” Rick Spencer, vice president of product at InfluxData, told The New Stack. “Collectively, these advancements enable you to perform analytics at the leading edge of data processing.”

Thus, developers can build high-performance databases by leveraging the upstream tools and libraries that Arrow provides, Spencer said.

“InfluxData is a poster child for Apache Arrow, which we used to build InfluxDB’s core engine,” he said.

InfluxData also plans to release a cluster version of InfluxDB 3.0 so developers can run it in their own Kubernetes cluster. Spencer added. “This will give them more flexibility and control over their deployments.”

👁 Caption: InfluxDB 3.0’s structure with Apache Arrow as its core engine.

InfluxDB 3.0’s structure with Apache Arrow as its core engine. (Source: InfluxData)

Exploring the Layers

InfluxData summarized how Arrow, Parquet and Rust support InfluxDB’s new engine this way:

  • Rust is a cutting-edge programming language designed for speed, efficiency, reliability and memory safety.
  • Apache Arrow is a framework for defining in-memory columnar data.
  • Apache Parquet is a column-oriented durable file format.
  • Arrow Flight is a client-server framework designed to transport large datasets over network interfaces without significantly impacting performance.
  • Apache DataFusion drives the query engine and provides native SQL support.

InfluxData also separated out the compute and storage layers — with Apache Parquet used as the persistence format for the object store — and separate ingest, query and compression layers of compute on top, RedMonk’s Stephens said.

“This ability to work with unbounded distinct values opens up more use cases for time series engines,” she said.

In previous versions, InfluxDB indexed data based on tags. In 3.0, InfluxDB writes data into Parquet files (which have high compression), stored in object storage (which is much more scalable and cheaper than SSD storage), and then queried with a query tier (which is more elastic).

Additionally, users can now query in SQL in addition to InfluxQL thanks to Apache DataFusion, Stephens said: “This improves ecosystem compatibility and the ability for users to integrate InfluxDB into more upstream communities, as well as into their existing tools.”

Advantages in Scaling

From a CTO’s perspective, InfluxDB 3.0 is of particular interest when working with a database or time series data system while experiencing challenges with scale or cardinality limitations, according to Spencer.

“Many customers come to us when their existing systems no longer meet their needs due to scaling issues,” he said.

“InfluxDB 3.0 provides a purpose-built solution for time series data, allowing organizations to handle large volumes of observability data and the full range of time series data. This means unlimited quantities of metrics, events and traces, providing valuable insights for monitoring and analysis purposes.”

👁 Caption: Users appreciate high-storage compressions, columnar data formats and unlimited cardinality to meet large data requirements and insights.

Users appreciate high-storage compressions, columnar data formats and unlimited cardinality to meet large data requirements and insights. (Source: InfluxData)

Ecosystem Compatibility

The compatibility with popular libraries and tools is another advantage InfluxDB 3.0 offers, according to the company.

For instance, Pandas, a widely used Python analysis library, has native support for Arrow, and the next version of Pandas will be based on Arrow. This compatibility opens up possibilities for various use cases, Spencer noted, such as machine learning pipelines.

Additionally, the Flight and Flight SQL client-server protocols enable seamless integration with other tools like Dremio, allowing data availability across the organization, Spencer said.

“Developers can start using Arrow by simply grabbing the right libraries and tools,” he said. “For instance, they can use the InfluxDB client library and write SQL queries, which can be converted into Pandas data frames for further analysis in tools like Jupyter Notebook. Signing up for an InfluxDB account is an easy way to get started.”

Apache Arrow is rapidly becoming a standard for communications between tools used for big data storage and analytics. Originally, this was initially driven purely by the performance capabilities of Apache Arrow.

However, the inherent network effect of the Arrow ecosystem is driving adoption as well, with newer market entrants adopting Arrow to make integration into existing developer tools and workflows easier. The result is a win-win for both parties, with InfluxData being one of the early adopters leading the charge.

InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData
TRENDING STORIES
BC Gain is founder and principal analyst for ReveCom Media. His obsession with computers began when he hacked a Space Invaders console to play all day for 25 cents at the local video arcade in the early 1980s. He then...
Read more from B. Cameron Gain
InfluxData sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma, Dremio.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
👁 Image
Join the millions of developers using InfluxDB to predict, respond, and adapt in real-time.