VOOZH about

URL: https://thenewstack.io/processing-large-data-sets-in-fine-grained-parallel-streams-with-sql/

⇱ Processing Large Data Sets in Fine-Grained Parallel Streams with SQL - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-01-13 06:24:33
Processing Large Data Sets in Fine-Grained Parallel Streams with SQL
sponsor-aerospike,sponsored-post-contributed,
Data / Storage

Processing Large Data Sets in Fine-Grained Parallel Streams with SQL

A look at mechanisms for accessing large data sets over parallel streams and ways to define data splits for parallel processing, plus a framework for testing.
Jan 13th, 2023 6:24am by Neel Phadnis
👁 Featued image for: Processing Large Data Sets in Fine-Grained Parallel Streams with SQL
Image via Unsplash.
Aerospike sponsored this post.

SQL is broadly used as a data access language, and Trino provides a powerful engine for SQL access to multiple data sources. However, as more and more real-time data becomes available, developers will need to process large data sets at an unlimited scale with predictable performance.

The Aerospike Trino Connector leverages specific Aerospike mechanisms for accessing large data sets over parallel streams. It works with Trino’s distributed computation framework and its cost-based optimizer (CBO) to define data splits, push down query operations and leverage appropriate indexes. The result is faster time to insights that speed decisioning and business outcomes.

This article explains mechanisms for accessing large data sets over parallel streams and describes some schemes for defining data splits for parallel processing, as well as a framework for testing them.

Please follow along in the adjunct interactive tutorial.

Aerospike is the real-time database built for infinite scale, speed, and savings. Our customers are ready for what’s next with the lowest latency and the highest throughput data platform. Cloud and AI-forward, we empower leading organizations like Adobe, Airtel, Criteo, Experian, and PayPal.
Learn More
The latest from Aerospike

Parallel Processing of Large Data Sets

To process large data sets, a common scheme is to split the data into partitions and assign a worker task to process each partition. The partitioning scheme must have the following properties:

  • The partitions are collectively exhaustive, meaning they cover the entire data set, and mutually exclusive, meaning they do not overlap.
  • They are deterministically and efficiently computed.
  • They are accessible in an efficient and flexible manner as required by worker tasks, for example, in smaller chunks at a time.

Data Partitions in Aerospike

The three types of Aerospike indexes — primary, set and secondary — are partition-oriented. That means they are split by partitions at each node, and queries are processed at each node over individual partitions. A client can request a query to be processed over specific partitions so that multiple client workers can work in parallel. It is easy to see how parallel streams up to the total number of partitions (4,096) can be set up for parallel processing data streams.

Pagination is supported with Aerospike queries where the client can process a chunk of records at a time by repeatedly asking for a certain number of records until all records are retrieved.

Splitting Data Sets Beyond 4,096

Many data-processing platforms allow more worker tasks than 4,096. For example, Spark allows up to 32K worker tasks to run in parallel. Trino allows theoretical concurrency greater than 4,096.

Aerospike allows for data splits larger than 4,096 by allowing a partition to be divided into subpartitions efficiently. The scheme is based on the “digest-modulo” function that can divide a partition into an arbitrary number of non-overlapping and collectively complete subpartitions. It involves adding the filter expression “digest % N == i for 0 <= i < N”, where the “digest” is the hashed key of the record.

The advantage of the “digest-modulo” function is that it can be evaluated without reading individual records from the storage device, such as SSDs. Digests of all records are held in the primary index, which resides in memory. Therefore, determining the membership of a digest, and equivalently of the corresponding record, in a subpartition is fast. Each subpartition stream needs to read only its records from the potentially slower storage device, although it needs to perform the in-memory digest-modulo evaluation, which is much faster, for all records.

This scheme works for primary-index and set-index queries because they hold digests of records. The secondary index holds the primary index location of the record, and a lookup provides the digest information.

Defining and Assigning Splits

How are the splits over a data set defined and assigned to N worker tasks, where N can vary from one to any arbitrarily large number? In reality, there would be an upper bound on N on a given platform because of either a platform-defined absolute limit, or the overhead of processing a large number of parallel streams and coordinating across them can negate the benefits.

It is important to understand what partitions or subpartitions can be requested in a single Aerospike API call:

  1. Full partitions and subpartitions cannot be mixed in a call.
  2. Full partitions must be consecutive in order, or “(pstart-id, pcount)”.
  3. Subpartitions must be consecutive, belong to consecutive partitions and use the same modulo factor, or “(pstart-id, pcount, sstart-id, scount, m)”.

The goal is to achieve best efficiency with the operations available in the APIs.

Split Assignment Schemes

We will examine three variations of split assignment.

If N is the requested number of splits:

  1. At most N splits (can be fewer), same sized, one API call per split.
  2. At least N splits (can be more), same sized, one API call per split.
  3. Exactly N splits, same sized, up to three API calls per split.

The first two allow specific, discrete values of splits to allocate the same amount of data (as partitions or a subpartition) and choose the closest allowed number of splits that is a factor or multiple of 4,096. Each split is processed with one API call.

The third one allows any number of splits with the same sized data assignment of partitions and/or subpartitions. Each split, however, may require up to three API calls.

Parallel Query Framework

The parallel stream processing from the above split assignments can be tested with the following simple framework. It can be tweaked to suit the needs of the intended workload and environment.

The test data consists of 100,000 records (can be changed) of ~1KB size, with a secondary index defined on an integer bin.

Processing Flow

The processing takes place as follows (tunable parameters are italicized):

  • Splits assignments are made for the requested number of splits and the desired split type.
  • The desired number of workers (threads) are created. All workers start at the same time to process the splits. Each worker thread does the following in a loop until there are no unprocessed splits available:
    • Obtain the next scheduled split.
    • Create one or more query requests over the split’s partitions and subpartitions and process them sequentially.
    • Assign the secondary-index query predicate depending on the requested query type.
    • Create the requested filter expression. Append it (with AND) to the subpartition filter expression if one is being used; otherwise, use it separately.
    • Process the query with the filter in the requested mode (sync or async).
    •  Get chunk-size records at a time until all records are retrieved.
    • Process the records using the stream-processing implementation. The CountAndSum implementation:
  • Aggregates the number of records in a count by the worker.
  • Aggregates an integer bin value in a sum by the worker.
  • Aggregates count and sum across all workers at the end.
  • Waits for all workers to finish and outputs the aggregated results from stream processing.

In the CountAndSum example, the total number of processed records and the sum of the integer bin across all records must be the same for a given query predicate and filter irrespective of the number of splits, split type, number of workers and processing mode.

Parameters and Variations

This requires a wide range of action, including number of splits, number of workers, query index types and more. Please follow along in the adjunct interactive tutorial for a deeper understanding.

Use Cases for Fine-Grained Parallelism

Processing speed can benefit from a very high degree of parallelism for a very large data set processed with transforms, aggregations and updates.

Multiple data sets that need to be joined and require shuffling subsets across a large number of worker nodes may not benefit from a very high parallelism in retrieval. The cost of transfer of data in subsequent steps across a large number of worker nodes can limit the benefit of fine-grained retrieval. A cost-based optimizer on the processing platform should be able to determine the best level of parallelism for data access from Aerospike for a given query.

Aerospike is the real-time database built for infinite scale, speed, and savings. Our customers are ready for what’s next with the lowest latency and the highest throughput data platform. Cloud and AI-forward, we empower leading organizations like Adobe, Airtel, Criteo, Experian, and PayPal.
Learn More
The latest from Aerospike
TRENDING STORIES
Neel Phadnis is the director of developer engagement at Aerospike. He is a technologist with leadership experience in building innovative products and bringing them to market. He has held senior engineering management roles at Tealeaf, Efficient Frontier, AOL and Netscape....
Read more from Neel Phadnis
Aerospike sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.