InfoQ Homepage News Slack Eliminates SSH in EMR Pipelines, Migrates 700+ Jobs to Rest-Based Architecture
Slack Eliminates SSH in EMR Pipelines, Migrates 700+ Jobs to Rest-Based Architecture
Jun 12, 2026 2 min read
by
Write for InfoQ
Feed your curiosity. Help 550k+ globalsenior developers
each month stay ahead.Get in touch
Slack has completed a large-scale modernization of its data platform by replacing SSH-based job execution with a REST-driven orchestration layer across its Amazon EMR pipelines. The migration removed direct SSH access to production clusters and shifted more than 700 Airflow operators to a centralized job submission system, aiming to improve security, reliability, and observability across eight data regions.
Slack’s data platform previously relied on Airflow operators that executed jobs by opening SSH connections directly to Amazon EMR master nodes. While simple initially, this approach became harder to scale as hundreds of production workflows, including search indexing and analytics pipelines, began depending on it. By 2024, SSH-based execution was widely used across production clusters, introducing operational and security concerns.
The primary challenge was an expanded attack surface due to direct production access. SSH key distribution and rotation increased operational overhead, while execution auditing required correlating logs across multiple systems. Reliability also suffered, with jobs sometimes continuing after connection drops or failing silently under infrastructure instability.
To address these issues, Slack introduced a REST-based job submission model built on an internal orchestration layer called Quarry. Instead of persistent SSH sessions, Airflow now submits jobs through HTTP APIs. Each job follows a server-side lifecycle with submission, tracking via job IDs, and controlled cancellation, decoupling execution from client connectivity and improving centralized observability and control.
Before and after architecture comparison (Source: Slack Blog Post)
The migration required additional engineering to support different workload types. While Spark and Hive workloads were transitioned using existing REST interfaces such as Livy and HiveServer2, a significant portion of workloads consisted of arbitrary shell commands. To support these cases, Slack used Apache Hadoop YARN’s Distributed Shell capability, which enables execution of shell commands inside managed containers with resource isolation and fault tolerance.
The migration was executed incrementally across development, staging, and production environments spanning eight data regions. Each region introduced additional complexity due to network segmentation and compliance constraints. During the transition, Slack identified several issues, including virtual memory enforcement behavior in YARN that had previously been obscured by SSH-based execution, as well as cross-account network connectivity gaps that revealed previously hidden dependencies between services.
Sudip Ghosh, senior software engineer @ Walmart, says,
This isn't just a security win; it's a massive operational debt payoff. SSH is easy to start with, impossible to scale securely or audit consistently across a large organization.
Slack completed the migration over three quarters without downtime for critical workloads. The company eliminated SSH access across production EMR clusters, improved job reliability through server-side execution tracking in Quarry, and enhanced observability via structured logging and centralized metrics. The REST-based approach reduced coupling between Airflow and EMR and standardized job submission across teams, while also enabling downstream efforts such as Spark on Kubernetes preparation.
The rollout was executed incrementally using phased operator deprecations and staged validations across environments. Airflow metadata dashboards tracked remaining SSH-dependent workflows, and cross-team coordination helped reduce migration risk. Key lessons included early network topology discovery, validating resource limits across execution models, and improving communication during operator restrictions.
About the Author
Leela Kumili
This content is in the Observability topic
Related Topics:
-
Related Editorial
-
Related Sponsors
-
Popular across InfoQ
-
ArrowJS Reaches 1.0, Recast as the First UI Framework for the Agentic Era
-
Anthropic Releases and Temporarily Suspends Claude Fable 5
-
Slack Eliminates SSH in EMR Pipelines, Migrates 700+ Jobs to Rest-Based Architecture
-
Building and Scaling a Platform with Project-as-a-Service
-
Increasing Users' Data Agency: From BlueSky's AT Protocol to the Local-First Software Movement
-
Spring Boot 4.1 Adds gRPC Auto-Configuration, SSRF Mitigation, and Kotlin 2.3 Support
-
The InfoQ Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
