VOOZH about

URL: https://thenewstack.io/amazon-to-save-millions-moving-from-apache-spark-to-ray/

⇱ Amazon to Save Millions Moving From Apache Spark to Ray - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-11-06 07:00:02
Amazon to Save Millions Moving From Apache Spark to Ray
Data

Amazon to Save Millions Moving From Apache Spark to Ray

By moving data lake table compaction chores from Apache Spark to the Python-based Ray, Amazon found that they could be executed 82% more efficiently. A report from ATO 2024.
Nov 6th, 2024 7:00am by Joab Jackson
👁 Featued image for: Amazon to Save Millions Moving From Apache Spark to Ray
Photo: Patrick Ames, ATO 2024 (TNS photo).

For an e-tailer as large as Amazon, even small performance improvements reap significant savings.

By replicating data lake table compaction chores from Apache Spark to the Python-based Ray in the first quarter of 2024, the company found that they could be executed 82% more efficiently.

Given compaction is an essential feature for users of its in-house business intelligence services, the e-tailer may be able to save over 220,000 years of EC2 vCPU computing time. From a typical AWS customer’s perspective, this would work out to saving around $100 million annually in Amazon EC2 on-demand R6g instance charges.

Yes, Amazon uses that much BI internally.

Anyscale: New Optimized Runtime for Ray, Kubernetes Operator

Amazon Principal Engineer Patrick Ames discussed — in a talk at the All Things Open 2024 conference, held last week In Raleigh, N.C. — its migration to Ray.

His message? Ray is not just for building a machine learning pipeline; its current favored use.

“Ray, at its core, is a very general-purpose distributed compute framework that I’d argue could be good at almost any distributed systems area where you decide to magnify and focus its attention,” said Ames, who is also a contributor to the Ray project.

The Need to Compact

The move to Ray expedites one of Amazon’s most costly operations, compaction. Whenever a data lake table format like the Apache Iceberg or Apache Hudi offers a copy-on-write or merge-on-read functionality, eventually, it will use compaction to reconcile updates to the table.

Previously, Apache Spark did the job.

In his talk, Ames described how Amazon went from using a giant Oracle Data Warehouse in 2016 to running its own fully scalable exabyte-scale data lakehouse while maintaining ACID compliance. The idea was to decouple storage from compute, so database tables could be stored in S3 buckets and users could bring their own query engines.

Initially, the tables were updated through append-only statements, and soon, they grew too unwieldy for even the most robust platform to ingest.

So they put Spark to work rooting out duplicates, a job that, as it turns out, can be quite tricky.

“Finding duplicates is a simple enough problem in theory, but it turns out to get a little nasty when your data starts growing to petabyte scale and beyond, you can no longer fit on a single node,” Ames said.

The Business Data Technologies office within Amazon them looked to Ray for further optimization.

Like Spark, Ray came from the University of California Berkeley. A few of the researchers who worked on Ray went on to start Anyscale, which offers commercial support for the platform,

Ray had already found a home in the Amazon data scientists thanks to its Pythonic APIs and ability to work with large data sets. Pandas are great for single node data sets, but building a data pipeline for a terabyte of data can be difficult, Ames explained.

This is where Ray comes in.

Basically, you take any Python application that could be parallelizable, put annotations on it for tasks that are distributed functions and actors that are distributed classes, and you can now deploy that code to an arbitrarily large cluster, and it’ll manage a lot of cluster scaling for you,” Ames explained.

👁 Chart: Business Data Technologies

How the Business Data Technologies uses Ray to provision clusters (Amazon).

At Amazon, the technology has the possibility of one day being the “unified” compute framework for all Amazon’s data pipelines, Ames said.

Amazon BI

Amazon’s internal data lake has “tens of thousands of users,” not only from AWS business analysts but also from partners as well.

Amazon’s Ray compactor currently runs over 25,000 jobs a day, requiring approximately 1.5 million EC2 vCPUs each day. About 40PBs of Apache Arrow data are merged each day, with a cost of about $0.59/TB.

The internal customers are charged per byte for the data they consume from the catalog, which pays for the maintenance of the data catalog. Surprisingly, the largest change in the costs comes from compaction.

“So we just kept throwing money at Spark, and the data sets grew in size,” Ames said, adding that it was not “the most elegant solution.”

The team investigated Ray, found promise there, and re-engined its compaction algorithm to run on the platform. They also investigated how Ray’s data science tools could help with data quality. A lot of Amazon’s code base is in Java, so there was considerable work that was done to create links to Ray’s Python API.

To date, Amazon has been running Spark and Ray compaction jobs in parallel to ensure consistency. This year, however, Spark will be mothballed and all operations will be moved to Ray.

Results From Working With Ray

Early results showed a definite performance advantage for Ray.

The business unit had found that Spark compacted a GB of data in about 1/2 minute on an Amazon Web Services‘ EC2 instance. But it would take Ray about a tenth of a minute, leading to what Ames characterized as an 82% efficiency gain based on data from Q1 2024.

👁 Chart of Ray vs. Spark in terms of efficiency (Amazon)

Ray vs. Spark in terms of efficiency (Amazon).

For these jobs, they also found that the Ray compactor consumed about 55% of a cluster’s total available  memory, which Ames admitted was less than optimal, preferring to get it up to 80% or so. Each cluster of servers collectively provides about 36TB of available memory.

👁 Chart for Ray memory utilization.

One area of concern was reliability, which could, in the words of Ames, “burn into your cost advantage” through the additional costs of rerunning jobs. Initially, in October of 2023, Ray’s first attempt to compact a table only succeeded85% of the time.

Again, not ideal, though by February of 2024, the team upped that to 99.15% which was closer to Spark’s 99.91%.

👁 Chart: Ray vs. Apache in terms of reliability

Ray vs. Spark in terms of reliability (Amazon).

When the migration is complete, it is projected to reduce its computational needs by roughly 220,000 years of vCPU time annually, which translates to about $100 million, in terms of the typical AWS customer’s Amazon EC2 on-demand R6g instance charges.

The Future of Ray

Spark still has some advantages, Ames concluded. It still has more general-purpose data processing features than Ray. For instance, Ray still doesn’t have an easy interface for SQL. And so, some customization is still inevitable.

“You can’t just throw a Spark job on Ray and expect to get these kinds of performance gains,” Ames said.

The project team also plans to adapt the compaction algorithm to be used with Apache Iceberg, which they hope to release in 2025.

“If any of you are writing iceberg tables using Apache Flink and then trying to read them back with Spark or something else, it should really improve that process a lot,” Ames said.

But overall, Ray is worth a serious look for large scale data operations.

“Ray core is flexible enough to let you craft very optimal solutions to very specific problems,” Ames said.  “If you can focus on a particularly onerous and expensive problem that your organization has, it is probably a good area — if your business is willing to invest with you — to turn Ray’s magnifying glass onto that problem.”

Enjoy the full presentation here:

Update (11/08/2024): An earlier edition of this post incorrectly stated that Ray is an Apache project. It is not, but rather it is open source under the Apache license. 

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services is a sponsor of The New Stack. 
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.