Linux 7.2 Can Significantly Lower Container Exit/Unmount Latency
Alibaba engineer Baokun Li tracked down the possible race condition when a container exits and addressed it with the now-merged patch. That portion of the work should also be back-ported to current Linux stable kernel series in the near future. What's most exciting though is the additional work that eliminates a global serialization penalty and can lead to much lower container exit/unmount latency.
Christian Brauner summed up the situation in this pull request that is now merged for Linux 7.2:
"Fix a race between cgroup_writeback_umount() and inode_switch_wbs()
When a container exits, a race between cgroup_writeback_umount() and inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy inodes after unmount" followed by a use-after-free on percpu counters. There is a window between inode_prepare_wbs_switch() returning true (having passed the SB_ACTIVE check and grabbed the inode) and the subsequent wb_queue_isw() call: if cgroup_writeback_umount() observes the global isw_nr_in_flight counter as non-zero but flush_workqueue() finds nothing queued yet, it returns early - leaving a held inode reference that blocks evict_inodes() and a later iput() that hits freed percpu counters.
The race is closed by covering the window from inode_prepare_wbs_switch() through wb_queue_isw() with an RCU read-side critical section and synchronizing in the umount path. On top of that the now-dead rcu_barrier() left over from the queue_rcu_work() era is removed, and the global synchronize_rcu()/flush_workqueue() pair is replaced with a per-sb in-flight counter plus pin/unpin/drain helpers so umount no longer serializes against switch activity on unrelated superblocks.
Under cgroup writeback churn on a 16 vCPU guest this takes umount latency from ~92-138ms p50 down to ~5-8ms p50 and the cumulative cost of cgroup_writeback_umount() from ~62ms to ~4us per call. The initial race fix is kept separate and minimal so it backports cleanly to stable trees that still queue switches via queue_rcu_work()."
Quite a nice improvement for the unmount latency.
👁 Linux 7.2 unmount latency benchmark
There are also additional benchmark numbers from this patch.
👁 Linux 7.2 unmount latency benchmark
Separately, that same VFS pull request for Linux 7.2 also improves write performance when using the RWF_DONTCACHE flag. Those benchmark numbers and more details within this patch.
