Please consider subscribing to LWNSubscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.
May 14, 2012
The classic computer science response to such a problem is to add another level of indirection in the form of another layer of caching. In this case, a large array of drives could be hidden behind a much smaller SSD-based cache that provides quick access to frequently-accessed data and turns random access patterns in something closer to sequential access. Hybrid drives and high-end storage arrays have provided this kind of feature for some time, but Linux does not currently have the ability to construct such two-level drives from independent components. That situation could change, though, if the bcache patch set finds its way into the mainline.
LWN last looked at bcache almost two years ago. Since then, the project has been relatively quiet, but development has continued. With the current v13 patch set, bcache creator Kent Overstreet says:
The idea behind bcache is relatively straightforward: given an SSD and one or more storage devices, bcache will interpose the SSD between the kernel and those devices, using the SSD to speed I/O operations to and from the underlying "backing store" devices. If a read request can be satisfied from the SSD, the backing store need not be involved at all. Depending on its configuration, bcache can also buffer write operations; in this mode, it serves as a sort of extended I/O scheduler, reordering operations so that they can be sent to the backing device in a more seek-friendly manner. Once one gets into the details, though, the problem starts to become more complex than one might imagine.
Consider the buffering and reordering of write operations, for example. Some users may be uncomfortable with anything that delays the arrival of data on the backing device; for such situations, bcache can be run in a write-through caching mode. When write-through behavior is selected, no write operation is considered to be complete until it has made it to the backing device. Clearly, in this case, the SSD cache is not going to improve write performance at all, though it may still improve performance overall if that data is read while it remains in the cache.
If, instead, writeback caching is enabled, bcache will mark the completion of writes once they make it to the SSD. It can then flush those dirty blocks out to the backing device at its leisure. Writeback caching can allow the system to coalesce multiple writes to the same blocks and to achieve better on-disk locality when the writes are eventually flushed out; both of those should improve performance. Obviously, writeback caching also carries the risk of losing data if the system is struck by a meteorite before the writeback operation is complete. Bcache includes a fair amount of code meant to address this concern; the SSD contains an index as well as the cached data, so dirty blocks can be located and written back after the system comes back up. Providing meteorite-proof drives is beyond the scope of the bcache patch set, though.
Of course, maintaining this index on the SSD has some performance costs of its own, especially since bcache takes pains to only write full erase blocks at a time. One write operation from the kernel can turn into several operations at the SSD level to ensure that the on-SSD data structures are consistent at all times. To mitigate this cost, bcache provides an optional journaling layer that can speed up operations at the SSD level.
Another interesting problem that comes with writeback caching is the implementation of barrier operations. Filesystems use barriers (implemented as synchronous "force to media" operations in contemporary kernels) to ensure that the on-disk filesystem structure is consistent at all times. If bcache does not recognize and implement those barriers, it runs the risk of wrecking the filesystem's careful ordering of operations and corrupting things on the backing device. Unfortunately, bcache does indeed lack such support at the moment, leading to a strong recommendation to mount filesystems with barriers disabled for now.
Multi-layer solutions like bcache must face another hazard: what happens if somebody accesses the underlying backing device directly, routing around bcache? Such access could result in filesystem corruption. Bcache handles this possibility by requiring exclusive access to the backing device. That device is formatted with a special marker, and its leading blocks are hidden when accessing the device by way of bcache. Thus, the beginning of the device under bcache is not the same as the beginning when the device is accessed directly. That means that a filesystem created through bcache will not be recognized by the filesystem code if an attempt is made to mount the backing device directly. Simple attempts to shoot one's own feet should be defeated by this mechanism; as always, there is little point in doing more to protect those who are really determined to injure themselves.
There seems to be a reasonable level of consensus that bcache would be a useful functionality to add to the kernel. There are some obstacles to overcome before this code can be merged, though. One of those is that bcache adds its own management interface involving a set of dedicated tools and a complex sysfs structure. There is resistance to adding another API for block device management, so Kent has been encouraged to integrate bcache into the device mapper code. Nobody seems to be working on that project at the moment, but Dan Williams has posted a set of patches integrating bcache into the MD RAID layer. With these patches, a simple command is sufficient to set up an array with SSD caching added on top. Once that code gets into shape, presumably the user-space interface concerns will be somewhat lessened.
A harder problem to get around may be the simple fact that the bcache patch
set is large, adding over 15,000 lines of code to the kernel. Included
therein is a fair amount of tricky data structure work such as a complex
btree implementation and "closures," being "asynchronous refcounty
things based on workqueues
". The complexity of the code will make
it hard to review, but, given the potential for trouble when adding a new
stage to the block I/O path, developers will want this code to be well
reviewed indeed. Getting enough eyeballs directed toward this code could
be a challenge, but the benefit, in the form of faster storage devices,
could well be worth the trouble.
| Index entries for this article | |
|---|---|
| Kernel | Block layer/Caching |
The LWN site is currently under high scraper load, so comment display has been suppressed for anonymous users. If you are a human, you may read the comments by clicking the button below:
Note: you can avoid this step in the future by logging into your LWN account.
