VOOZH about

URL: https://thenewstack.io/how-pinterest-tuned-memcached-for-big-performance-gains/

⇱ How Pinterest Tuned Memcached for Big Performance Gains - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-05-16 12:47:20
How Pinterest Tuned Memcached for Big Performance Gains
in-depth-news,
Edge Computing

How Pinterest Tuned Memcached for Big Performance Gains

 The Pinterest social media site uses a memcached-based cache-as-a-service platform to slash application latency across the board, minimizing the overall cloud cost footprint, and meet the strict site wide availability targets.
May 16th, 2022 12:47pm by Jessica Wachtel
👁 Featued image for: How Pinterest Tuned Memcached for Big Performance Gains

The Pinterest social media site uses a memcached-based cache-as-a-service platform to slash application latency across the board, minimizing the overall cloud cost footprint, and meet the strict site-wide availability targets.

Pinterest’s memcached-based caching system is massive — over 5,000 EC2 dedicated instances from Amazon Web Services, that span a variety of instance types optimized for compute, memory, and storage dimensions. As a whole, the fleet serves up to approximately 180 million requests per second and approximately 220 GB/s of network throughput over a ~460TB active in-memory and on-disk dataset, partitioned among approximately 70 distinct clusters.

With that large of a fleet, any optimization, no matter how small has a magnified effect. Because each copy of memcached runs as the exclusive workload on its respective virtual machine, Pinterest engineers were able to fine-tune the VM Linux kernels to prioritize the CPU time for caching. One configuration change reduced latency by up to 40% and smoothed over performance overall.

Also, by observing data patterns and breaking them into workload classes, Pinterest engineers to dedicate EC2 instances for specific workloads, driving down costs and improving performance.

Kevin Lin, Pinterest software engineer for storage and caching, in a blog entry posted last week, discussed some of what the company has learned running memcached at scale in production. Here are a few of the highlights.

👁 High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

The Scientific Theory at Work

In order to identify possible areas for optimization, Pinterest set up a controlled environment to create structured, reproducible workloads for testing and evaluating. The following variables allowed for high-level testing with minimal impact to critical-path production traffic over the years.

  • Server-side: This includes metrics for request throughput, network throughput, resource utilization, and hardware-level parameters (NIC statistics like per-queue packet throughout and EC2 allowance exhausting, disk response times and in-flight I/O requests, etc.).
  • Client-side metrics for cache request percentile latency, timeout and error rates, and per-server availability (SLIs), as well as top-level application performance indicators like RPC P99 response time.
  • Synthetic Load Generation: This practice is historically known for detecting performance improvements or regressions while under maximum load. Pinterest used memtier_benchmark, an open source tool that generates their load against a memcached cluster.
  • Production Shadow Traffic: This is the process of mimicking real production traffic with the purpose of evaluating the system performance at scale. Pinterest uses Facebook’s mcrouter, an open source memecached-protocol routing proxy deployed as a client-side sidecar in the Pinterest fleet.

The Results Are in

Cloud Hardware

Pinterest divided its diverse collection of workloads on a high level into the workload classes below. Each class was designated to a fixed pool of optimized EC2 instance types to allow for vertical scaling for greater cost efficiency. Since horizontal scaling is available anytime to alleviate bottlenecks as they arise but is not the most cost-effective method. Pinterest was more interested in Vertical scalability.

Workload Classes:

  • Throughput (compute)
  • Data Volume (memory and/or disk capacity)
  • Data Bandwidth (network and compute)
  • Latency Requirement (compute)

The following workload profiles were created from the classes above:

  • Moderate Throughput, Moderate Data Volume | r5
  • High Throughput, Low Data Volume | c5
  • High Data Volume, Relaxed Latency Requirement | r5d
  • Massive Data Volume, Relaxed Latency Requirement | i3, i3en
👁 EC2 instance type distribution for the Pinterest memcached fleet

EC2 instance type distribution for the Pinterest memcached fleet

When selecting the EC2 instances, and determining which cluster will go to which instance it mainly boiled down to these criteria: CPU, memory size, and disk speed.

Breaking down data patterns into workload classes and dedicating specific EC2 instances for each workload class drives down costs while improving I/O performance. Adding memcached add-on extstore improves storage efficiency while proportionately reducing the cloud cluster footprint.

The instances are configured with Linux software RAID at level RAID0 to combine multiple hardware clock devices into a single, logical disk for user space consumption. Pinterest stripes reads and writes evenly across two disks thus RAID0 doubles the maximum theoretical I/O throughput with a best-case two-fold reduction in effective disk response time with a troubled MTTF.

Pinterest makes it clear that this increased hardware performance for extstore at an increased theoretical failure rate is a highly worthwhile tradeoff.  The infrastructure is on a public cloud so it is self-healing and mcrouter is able to handle server changes immediately.

Computing

The memcached docs define “compute efficiency” as the additional rate of requests that can be serviced by a single instance for each percentage point increase in instance CPU usage, without increasing request latency. By this definition, optimizing compute efficiency is measurable in terms of allowing memcached to serve a higher request rate at lower CPU usage without changing the latency characteristics.

Half of all caching workloads at Pinterest are compute-bound (purely request throughput-bound). Pinterest’s goal was to downsize clusters without compromising serving capacity.

With the majority of Pinterest’s workloads running on dedicated EC2 virtual machines, this opens up a unique opportunity to optimize at the hardware-software boundary while in the past optimizations centered around hardware modifications.

Memcached is somewhat unique among stateful data systems at Pinterest in that it is the exclusive primary workload, with a static set of long-lived worker threads, on every EC2 instance on which it is deployed. Because of this, Pinterest tuned the process scheduling to request the kernel prioritize the CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, like monitoring daemons.

This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting (essentially starving) all non-real-time processes whenever a memcached thread becomes runnable. Below is an example invocation of memcached under a SCHED_FIFO real-time scheduling policy

$ sudo chrt — — fifo <priority> memcached …

This one line change drove client-side P99 latency down by 10% to 40% after rollout to all compute-bound clusters and eliminated spurious spikes in P99 and P999 latency across the board. The steady-state operation CPU usage ceiling was raised 20% without introducing latency regressions and 10% of memcached’s total fleet-wide cost footprint was shaved thanks to this optimization.

👁 Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster

Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster.

👁 Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem)

Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem).

👁 Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series)

Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series).

TRENDING STORIES
Jessica Wachtel is a developer marketing writer at InfluxData where she creates content that helps make the world of time series data more understandable and accessible. Jessica has a background in software development and technical journalism.
Read more from Jessica Wachtel
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services is a sponsor of The New Stack.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.