VOOZH about

URL: https://thenewstack.io/how-slack-transformed-cron-into-a-distributed-job-scheduler/

⇱ How Slack Transformed Cron into a Distributed Job Scheduler - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-03-14 06:18:31
How Slack Transformed Cron into a Distributed Job Scheduler
Kubernetes / Linux / Operations

How Slack Transformed Cron into a Distributed Job Scheduler

With help from Kubernetes, Golang and Kafka, Slack's crontab drives 2,000 tasks an hour. Monster Scale Summit had all the details.
Mar 14th, 2025 6:18am by Joab Jackson
👁 Featued image for: How Slack Transformed Cron into a Distributed Job Scheduler

For a decade, Slack ran its cron jobs on a single server. But when the server started having issues and chewing up maintenance time, the company’s admins knew a more resilient job scheduler was needed. So, they turned cron into a distributed system.

In a talk at ScyllaDB‘s Monster Scale Summit, held virtually last week, Claire Adams, infrastructure software engineer for Slack, described how the collaboration service provider turned the Unix scheduling utility Cron into a distributed service,

“People were really fed up with dealing with this one cron box. Nobody really wanted to maintain it. It had been set a long time ago,” Adams said. “People had some legacy knowledge, but nobody really had total knowledge on all the quirky stuff with it.”

“We needed something more reliable.”

Cron for One

Like every hardcore Linux user knows, cron is a time-based job scheduler, allowing admins to run scripts and apps at specific times and dates, by scheduling them in a file called the crontab.

As you can imagine, Slack, with over 38 million daily users, has plenty of tasks to run.

Overall, Slack has about 385 cron scripts that collectively execute 2,000 an hour, which tallies up to 340,000 jobs a week, or 20 million a year.

For Slack, Cron handles tasks to power both user features — such as reminders and e-mail notifications,  as well as back-end maintenance duties like database cleaning and running analytic jobs.

For the first 10 years of Slack, cron was run from a single crontab, running on a single server on Amazon Web Services.

The system had its limitations, however. Especially tricky were software updates, which were done by duplicating the service on another server, then switching over — quickly enough as to not miss any scheduled jobs.

The final straw, however, was that in its last year, the cron server kept stumbling from errant out-of-memory errors, necessitating manual remediation. More downtime.

“We can’t have a lot of incidents that might impact users. We need to be more reliable and more stable as a product,” Adams said. “So that led us to this rewrite.”

👁 Image

A Distributed Replacement

Clearly, a scheduling system distributed across multiple servers would be needed. Moving to a distributed system, the company hoped to increase reliability, reduce maintenance windows, and gain more insight into the jobs that were run.

There were different approaches the team could take. For instance, Slack is a big Kubernetes user, so they investigated using Kubernetes’s own built-in cron, cronjob. This approach, however would have required spinning up 53,000 pods a day, and would have been difficult to debug. And would have to require users to rewrite their scripts. So, major hassle.

Nonetheless, “it made sense for us to use the technologies that we had already invested in” she said.

It Helps To Have a Monster Job Execution Service

In the end, it took three different components to replace the once-mighty cron box.

The system would continue to use cron, which would run the cron scripts without modification. But instead of running the jobs in its own memory, cron would hand them off to a separate job execution engine.

👁 Image

As it happened, the company had already built and maintained an asynchronous computing platform, or job execution service. Based on Kubernetes and written in the Go programming language, it was a beast, executing 10 billion jobs a day.

But, remarkably enough, it did not have a scheduler. Just queues. Cron itself is a very good scheduler. And fortunately, Go has a Cron library that could be used. This meant none of the cron scripts would have to be rewritten.

In this setup, all the cron jobs get their own dedicated queue. Each script is wrapped as a job so it can be executed.

Queuing is done through Kafka, with each job getting its own Kafka topic. An AWS EC2 instance actually executes the work.

Because the cron server was not executing the scripts in its own memory, it could still run on a single server.

The design team initially considered an approach of spreading out the scripts across multiple cron servers, but this would lead to a lot of complexity, determining which server should run which script.

Instead, the team went with another approach: leader election with locking.

Instead of having each server execute some of the scripts, a leader server executed all the scripts, handing them off to the job engine. Back-up servers were ready to take over should the primary server fail quickly.

👁 Image

The last piece of the system was a database that would track how the scripts executed. Typically, this information is found in the cron logs recorded on the server, though these are difficult to track down and parse.

Wouldn’t it be better to have a centralized portal where all the statuses were kept, providing info such when the last time the was job run, of if it was successful? This is the role of the database.

More Components But Easier to Manage

Adding a few more cron scripts to a 10-billion-jobs-a-day monster job executer proved to be no problem. As a bonus, it was a mature, fully supported system.

“We already had invested years to making this job queue system very reliable, very scalable, have good guarantees and have a good on-call rotation and good maintenance,” Adams said, recalling the reasoning of the moment. “So if we can just leverage that,  [it would] make our lives a ton easier.”

It’s been about a year since Slack migrated to its distributed cron. The new system has thus far successfully executed over six million scripts. Even better, it has reduced its on-call burden, relieving admins from resetting a server each time it’s befouled by a memory error.

“Even though there are more components, it is easier to maintain,” Adams said.

Adams’ takeaway? Use what you have. In their case, it was an existing job queue, Golang and Kubernetes. “You decrease the maintenance burden while getting huge-scale wins,” she said.

And even the lowly cron box held a lesson or two.

“Slack ran key functionality for 10 years on one node. That’s a long time to deal with this less-than-ideal system. But it was good enough. It got the job done. And I think that is really a key takeaway,” she said. “It’s okay to keep it really simple, even if it’s kind of janky, for a long time.”

“And then, when you’re fed up, you can try something better.”

View the entire presentation here:

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services and ScyllaDB are sponsors of The New Stack. 
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.