Voozh

Please consider subscribing to LWN
Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
May 14, 2012

Flash-based solid-state storage devices (SSDs) have a lot to recommend them; in particular, they can be quite fast even when faced with highly non-sequential I/O patterns. But SSDs are also relatively small and expensive; for that reason, for all their virtues, they will not be fully replacing rotating storage devices for a long time. It would be nice to have a storage device that provided the best features of both SSDs and rotating devices—the speed of flash combined with the cheap storage capacity of traditional drives. Such a device could simultaneously reduce the performance pain that comes with rotating storage and the financial pain associated with solid-state storage.

The classic computer science response to such a problem is to add another level of indirection in the form of another layer of caching. In this case, a large array of drives could be hidden behind a much smaller SSD-based cache that provides quick access to frequently-accessed data and turns random access patterns in something closer to sequential access. Hybrid drives and high-end storage arrays have provided this kind of feature for some time, but Linux does not currently have the ability to construct such two-level drives from independent components. That situation could change, though, if the bcache patch set finds its way into the mainline.

LWN last looked at bcache almost two years ago. Since then, the project has been relatively quiet, but development has continued. With the current v13 patch set, bcache creator Kent Overstreet says:

Bcache is solid, production ready code. There are still bugs being found that affect specific configurations, but there haven't been any major issues found in awhile - it's well past time I started working on getting it into mainline.

The idea behind bcache is relatively straightforward: given an SSD and one or more storage devices, bcache will interpose the SSD between the kernel and those devices, using the SSD to speed I/O operations to and from the underlying "backing store" devices. If a read request can be satisfied from the SSD, the backing store need not be involved at all. Depending on its configuration, bcache can also buffer write operations; in this mode, it serves as a sort of extended I/O scheduler, reordering operations so that they can be sent to the backing device in a more seek-friendly manner. Once one gets into the details, though, the problem starts to become more complex than one might imagine.

Consider the buffering and reordering of write operations, for example. Some users may be uncomfortable with anything that delays the arrival of data on the backing device; for such situations, bcache can be run in a write-through caching mode. When write-through behavior is selected, no write operation is considered to be complete until it has made it to the backing device. Clearly, in this case, the SSD cache is not going to improve write performance at all, though it may still improve performance overall if that data is read while it remains in the cache.

If, instead, writeback caching is enabled, bcache will mark the completion of writes once they make it to the SSD. It can then flush those dirty blocks out to the backing device at its leisure. Writeback caching can allow the system to coalesce multiple writes to the same blocks and to achieve better on-disk locality when the writes are eventually flushed out; both of those should improve performance. Obviously, writeback caching also carries the risk of losing data if the system is struck by a meteorite before the writeback operation is complete. Bcache includes a fair amount of code meant to address this concern; the SSD contains an index as well as the cached data, so dirty blocks can be located and written back after the system comes back up. Providing meteorite-proof drives is beyond the scope of the bcache patch set, though.

Of course, maintaining this index on the SSD has some performance costs of its own, especially since bcache takes pains to only write full erase blocks at a time. One write operation from the kernel can turn into several operations at the SSD level to ensure that the on-SSD data structures are consistent at all times. To mitigate this cost, bcache provides an optional journaling layer that can speed up operations at the SSD level.

Another interesting problem that comes with writeback caching is the implementation of barrier operations. Filesystems use barriers (implemented as synchronous "force to media" operations in contemporary kernels) to ensure that the on-disk filesystem structure is consistent at all times. If bcache does not recognize and implement those barriers, it runs the risk of wrecking the filesystem's careful ordering of operations and corrupting things on the backing device. Unfortunately, bcache does indeed lack such support at the moment, leading to a strong recommendation to mount filesystems with barriers disabled for now.

Multi-layer solutions like bcache must face another hazard: what happens if somebody accesses the underlying backing device directly, routing around bcache? Such access could result in filesystem corruption. Bcache handles this possibility by requiring exclusive access to the backing device. That device is formatted with a special marker, and its leading blocks are hidden when accessing the device by way of bcache. Thus, the beginning of the device under bcache is not the same as the beginning when the device is accessed directly. That means that a filesystem created through bcache will not be recognized by the filesystem code if an attempt is made to mount the backing device directly. Simple attempts to shoot one's own feet should be defeated by this mechanism; as always, there is little point in doing more to protect those who are really determined to injure themselves.

There seems to be a reasonable level of consensus that bcache would be a useful functionality to add to the kernel. There are some obstacles to overcome before this code can be merged, though. One of those is that bcache adds its own management interface involving a set of dedicated tools and a complex sysfs structure. There is resistance to adding another API for block device management, so Kent has been encouraged to integrate bcache into the device mapper code. Nobody seems to be working on that project at the moment, but Dan Williams has posted a set of patches integrating bcache into the MD RAID layer. With these patches, a simple command is sufficient to set up an array with SSD caching added on top. Once that code gets into shape, presumably the user-space interface concerns will be somewhat lessened.

A harder problem to get around may be the simple fact that the bcache patch set is large, adding over 15,000 lines of code to the kernel. Included therein is a fair amount of tricky data structure work such as a complex btree implementation and "closures," being "asynchronous refcounty things based on workqueues". The complexity of the code will make it hard to review, but, given the potential for trouble when adding a new stage to the block I/O path, developers will want this code to be well reviewed indeed. Getting enough eyeballs directed toward this code could be a challenge, but the benefit, in the form of faster storage devices, could well be worth the trouble.

Index entries for this article
Kernel	Block layer/Caching

The LWN site is currently under high scraper load, so comment display has been suppressed for anonymous users. If you are a human, you may read the comments by clicking the button below:

Note: you can avoid this step in the future by logging into your LWN account.

URL: https://lwn.net/Articles/497024/

⇱ A bcache update [LWN.net]

A bcache update