![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
The Pinterest social media site uses a memcached-based cache-as-a-service platform to slash application latency across the board, minimizing the overall cloud cost footprint, and meet the strict site-wide availability targets.
Pinterest’s memcached-based caching system is massive — over 5,000 EC2 dedicated instances from Amazon Web Services, that span a variety of instance types optimized for compute, memory, and storage dimensions. As a whole, the fleet serves up to approximately 180 million requests per second and approximately 220 GB/s of network throughput over a ~460TB active in-memory and on-disk dataset, partitioned among approximately 70 distinct clusters.
With that large of a fleet, any optimization, no matter how small has a magnified effect. Because each copy of memcached runs as the exclusive workload on its respective virtual machine, Pinterest engineers were able to fine-tune the VM Linux kernels to prioritize the CPU time for caching. One configuration change reduced latency by up to 40% and smoothed over performance overall.
Also, by observing data patterns and breaking them into workload classes, Pinterest engineers to dedicate EC2 instances for specific workloads, driving down costs and improving performance.
Kevin Lin, Pinterest software engineer for storage and caching, in a blog entry posted last week, discussed some of what the company has learned running memcached at scale in production. Here are a few of the highlights.
High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments
In order to identify possible areas for optimization, Pinterest set up a controlled environment to create structured, reproducible workloads for testing and evaluating. The following variables allowed for high-level testing with minimal impact to critical-path production traffic over the years.
Pinterest divided its diverse collection of workloads on a high level into the workload classes below. Each class was designated to a fixed pool of optimized EC2 instance types to allow for vertical scaling for greater cost efficiency. Since horizontal scaling is available anytime to alleviate bottlenecks as they arise but is not the most cost-effective method. Pinterest was more interested in Vertical scalability.
Workload Classes:
The following workload profiles were created from the classes above:
EC2 instance type distribution for the Pinterest memcached fleet
When selecting the EC2 instances, and determining which cluster will go to which instance it mainly boiled down to these criteria: CPU, memory size, and disk speed.
Breaking down data patterns into workload classes and dedicating specific EC2 instances for each workload class drives down costs while improving I/O performance. Adding memcached add-on extstore improves storage efficiency while proportionately reducing the cloud cluster footprint.
The instances are configured with Linux software RAID at level RAID0 to combine multiple hardware clock devices into a single, logical disk for user space consumption. Pinterest stripes reads and writes evenly across two disks thus RAID0 doubles the maximum theoretical I/O throughput with a best-case two-fold reduction in effective disk response time with a troubled MTTF.
Pinterest makes it clear that this increased hardware performance for extstore at an increased theoretical failure rate is a highly worthwhile tradeoff. The infrastructure is on a public cloud so it is self-healing and mcrouter is able to handle server changes immediately.
The memcached docs define “compute efficiency” as the additional rate of requests that can be serviced by a single instance for each percentage point increase in instance CPU usage, without increasing request latency. By this definition, optimizing compute efficiency is measurable in terms of allowing memcached to serve a higher request rate at lower CPU usage without changing the latency characteristics.
Half of all caching workloads at Pinterest are compute-bound (purely request throughput-bound). Pinterest’s goal was to downsize clusters without compromising serving capacity.
With the majority of Pinterest’s workloads running on dedicated EC2 virtual machines, this opens up a unique opportunity to optimize at the hardware-software boundary while in the past optimizations centered around hardware modifications.
Memcached is somewhat unique among stateful data systems at Pinterest in that it is the exclusive primary workload, with a static set of long-lived worker threads, on every EC2 instance on which it is deployed. Because of this, Pinterest tuned the process scheduling to request the kernel prioritize the CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, like monitoring daemons.
This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting (essentially starving) all non-real-time processes whenever a memcached thread becomes runnable. Below is an example invocation of memcached under a SCHED_FIFO real-time scheduling policy
$ sudo chrt — — fifo <priority> memcached …
This one line change drove client-side P99 latency down by 10% to 40% after rollout to all compute-bound clusters and eliminated spurious spikes in P99 and P999 latency across the board. The steady-state operation CPU usage ceiling was raised 20% without introducing latency regressions and 10% of memcached’s total fleet-wide cost footprint was shaved thanks to this optimization.
Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster.
Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem).
Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series).