Voozh

Brief items

Kernel release status

The current development kernel is 4.0-rc7, released on April 6 after a day's delay for the holiday. "But it's still pretty small, and things are on track for 4.0 next weekend. There's a tiny chance that I'll decide to delay 4.0 by a week just because I'm traveling the week after, and I might want to avoid opening the merge window. We'll see how I feel about it next weekend."

Stable updates: none have been released in the last week.

Comments (none posted)

Quote of the week

If you write interfaces with more than 4 or 5 function arguments, it's possible that you and I cannot be friends.

— David Miller

Comments (1 posted)

Kernel development news

Running the kernel in library mode

By Jonathan Corbet
April 8, 2015

Once upon a time, the only way to run the Linux kernel was as the primary operating system on a handy piece of hardware. Since then, though, other modes of operation have become possible: the kernel can, for example, be run as the guest of another kernel through virtualization, or as a user-space process with the user-mode Linux (UML) port. One mode that has not been supported is running the kernel as a library that can be called from within an application program, but that situation appears to be about to change thanks to a patch set which has just made its first appearance on the linux-kernel list.

This patch set, posted by Hajime Tazaki, goes by the name LibOS; it was presented (slides [slideshare]) at the recent Netdev 0.1 conference. LibOS is structured as if it were a new architecture port; it can be found under in the kernel tree. But this port, when built, does not result in a bootable kernel; instead, it creates a shared library that can then be loaded into a running process.

One might wonder why this mode of operation would be useful. Though it is not limited to this particular use, the main focus of LibOS at the moment is to make the Linux network stack available to user-space applications. User-space network stacks are not unheard of in the Linux world; they have shown up in certain performance-sensitive settings for some years now. With LibOS, it is not necessary to write (or port) a new network stack to run in a Linux process; the kernel's network stack is now available to use directly.

Needless to say, one does not just make the network stack callable from user space without doing a bit of work. To make this mode possible, the LibOS developers have created a whole set of stub functions to replace various kernel functions used by the networking code. Indeed, the bulk of the patch set consists of thousands of lines of stub functions. They do things like replacing the slab allocator with a simple version based on and, for the most part, shorting out the filesystem layer entirely. When that is done, what's left is the networking stack with almost enough scaffolding to let it run standalone within a process's address space.

"Almost enough" because a few tasks are still left to the calling application. For example, there is no stub implementation of ; instead, the calling code must provide one during the initialization process. The idea here is that the running application may want to exert some control over how the management of processes (most likely implemented as POSIX threads) will be done.

There are currently two projects using the LibOS framework. Networking in user space (NUSE) finishes the job of providing a running user-space network stack. With NUSE, one can set up arbitrary networking topologies, interface to other user-space mechanisms like DPDK for fast transmission and reception of packets, and more. The NS-3 system, instead, is a simulation framework used to run tests on network protocols and implementations. It can run network-oriented applications on top of the LibOS network stack using tricks to redirect calls to the networking system calls.

There are a number of interesting things that can be done with these tools. Users running networking in user space for performance reasons could consider using it, though the kernel's stack has not been optimized for performance in that setting. Somebody wanting to run an experimental protocol like MPTCP in production could use LibOS (built with a suitably patched kernel) to get that feature without touching the network stack used by the rest of the system. There are also a lot of opportunities for running debugging tools with a network stack that is running in user mode.

While the LibOS work has been focused on the network stack as the first objective, there is nothing in its design that limits it to networking. If one wanted to, say, isolate the virtual filesystem layer instead, it would mostly be a matter of coming up with the additional stub functions needed.

A question that might come to mind is: how does this differ from the user-mode Linux port that has been in the kernel for many years? Indeed, UML maintainer Richard Weinberger wondered exactly that. There appear to be a few differences. UML is meant to run as a standalone application in its own right, while LibOS runs as a library called by some other application. One can even have several LibOS instances running simultaneously within the same application. Beyond that, the idea of isolating a single subsystem for use within an application is not a part of the design of UML. After looking more deeply at the LibOS code, Richard agreed that it brought some interesting things to the table.

One possible area of concern is the maintenance of all of the stub functions. There are a lot of them, and they will need to be updated whenever the corresponding "real" version is changed in the kernel. Few maintainers are likely to think that they have to update LibOS when they are making changes to their own subsystems. As a result, it seems likely that LibOS will be broken much of the time.

That, in turn, means that maintenance concerns may be one of the chief obstacles LibOS must overcome before it can be considered for merging into the mainline kernel. If LibOS is often broken, developers will hesitate to use it. If LibOS breakage leads to complaints against subsystem maintainers working on their own code, they may respond by calling for its removal. Avoiding these pitfalls may require finding some way to automate the creation of these stub functions. Creating a library-mode version of the kernel may turn out to have been the easy part when one considers what is required to make that work maintainable in the long run.

Comments (16 posted)

Write-stream IDs

By Jonathan Corbet
April 7, 2015

Storage devices with large physical block sizes — solid-state storage devices (SSDs), for example — are subject to a problem known as "write amplification" that can affect both the performance and lifetime of the device. Applications often have information about the data they write that can be helpful in reducing write amplification problems, but there is currently no way to communicate that information to the relevant parts of the kernel. A new proposed addition to the block-layer API may help to solve that problem in the near future, though.

The kernel typically performs block I/O in units of 4KB, but a typical SSD has an erase-block size of many times that. The firmware in the drive itself performs the impedance matching between the small sector size exposed to the host computer and the real requirements imposed by the hardware. Whenever a sector is written, the firmware must find a home for it in a new erase block, leaving an empty space where the sector used to be. Occasionally, sectors must be shifted and coalesced during a garbage-collection pass to free up the empty spaces for new writes.

"Write amplification" refers to this extra work that must be performed when data is overwritten. It gets worse if short-lived data is mixed with long-lived data in the same erase blocks; garbage collection must happen more often and more data must be moved around. On the other hand, if short-lived data can be kept together, the rewriting of erase blocks and garbage-collection work can be minimized. The kernel could perhaps do this kind of separation for some types of filesystem metadata, but it has no knowledge of how user space plans to use the data that it writes to the filesystem. So, if long-lived user-space data is to be separated from the short-lived variety, user space is going to have to help with the job. That is where Jens Axboe's write-stream IDs patch set comes in.

A write-stream ID is simply an eight-bit integer value assigned to block data as it is written. The kernel does not interpret that value in any way other than as a hint that data with the same ID is likely to have approximately the same lifetime. Low-level storage drivers can use this ID to place data with the same life expectancy together on the media, hopefully reducing write-amplification problems in the process.

At the lowest level in the block layer, the stream ID is stored in eight bits of the field in the structure. It can be set with and queried with . A call to can be used to determine whether a given structure has had its stream ID set; low-level block drivers can use a valid stream ID to instruct the hardware to group similar data on the physical media.

At the user-space level, the stream ID for an open file can be set with the new operation. Interestingly, this value is stored in two places: the structure associated with the given file descriptor, and the structure representing the file itself. That might seem like an interesting choice, given that both structures are heavily used and bloating both of them with a new field is not something to be done lightly, but there is a reason for it.

When an application performs direct I/O, the data being written will be placed in a structure immediately and passed to the block layer. The structure corresponding to the file descriptor passed by user space is available then, so the stream ID stored in that structure can be copied directly into the structure.

That option is not available for buffered I/O, though. A buffered will simply copy the data into the page cache and mark the relevant page(s) dirty; the actual I/O on those pages will happen at some future time. By the time that the writeback code gets around to those pages, the structure used to initiate the write may no longer exist. Even if it is still around, though, it is not readily accessible at that level of the kernel. But the structure is accessible. So, in this case, the stream ID must be taken from the structure.

One might ask why the -stored stream ID is not used all of the time. The patches are silent on this point, but the probable answer is that the more direct control afforded by storing the ID in the structure is worth having when it is possible. It allows the stream ID to be changed from one I/O operation to the next; different file descriptors referring to the same on-disk file can also have different stream IDs. This flexibility is not available when doing buffered I/O and using the stream ID stored in the structure; since it's not possible to know when the actual writeback will happen, the stream ID cannot be changed between writes without the likelihood of affecting writes intended to go under a different ID.

There would be clear value in a closer association between stream IDs and specific buffered-write operations. Getting there would require storing the stream ID with each dirtied page, though; that, in turn, almost certainly implies shoehorning the stream ID into the associated structure. That would not be an easy task; it is not surprising that it is not a part of this patch set. Should the lack of per-buffered-write stream IDs prove to be a serious constraint in the future, somebody will certainly be motivated to try to find a place to store another eight bits in .

Meanwhile, there does not appear to be any real opposition to the patch set in its current form. Unless that situation changes, write-stream IDs would appear to be a feature headed for the mainline in a near-future development cycle.

Comments (12 posted)

An update on the freedreno graphics driver

By Jake Edge
April 8, 2015

ELC 2015

The freedreno project was started by Rob Clark to create a free-software driver for the Adreno family of GPUs, which are used by the Qualcomm Snapdragon system-on-chip (SoC) family. He presented a status report on the project, along with some history and future plans, at the Embedded Linux Conference, which was held in San Jose, CA, March 23-25.

The Adreno 2xx, 3xx, and 4xx are all supported by freedreno. The 2xx GPUs support OpenGL ES 2.0, the 3xx devices support OpenGL ES 3.0 and the embedded profile of OpenCL 1.1, while the Adreno 4xx GPUs support OpenGL ES 3.1 (with the Android Extension Pack) and the full profile of OpenCL 1.2. The 3xx is the first of the modern Adrenos, Clark said, and the 4xx was announced with the Snapdragon 805 SoC, which is the first Adreno to support DirectX 11.

Adrenos have a tile-based renderer, though it is implemented differently than other tile-based devices. Adrenos have a relatively large internal memory for the tile buffer, either in the GPU core itself (GMEM) or elsewhere on the SoC (OCMEM). The driver manages the tile buffers, including handling restore and resolve operations (moving tile data between the CPU and GPU memory). It also handles partitioning the rendering target into tiles. On Adreno 3xx and later GPUs, the driver can decide to bypass the tile buffer to do immediate rendering in certain scenarios.

Motivation and history

Clark got involved because of the lack of free drivers for various Snapdragon-based boards that were becoming available. Graphics progress was being held back because the GPUs were all locked down. These days, developers expect to have GPU acceleration available for user interfaces and other purposes, but for these ARM boards, you were "left with Android or Android" for driver choices. Those drivers didn't come with source, so you couldn't even recompile them. There were some "clever hacks" like libhybris, but "piling on more duct tape doesn't solve the issue".

👁 Rob Clark

So in mid-2012 he decided to do something about it. He was working for TI at the time, so PowerVR-based GPUs (as used by TI) were off-limits. He found some hardware that had an Adreno 220 and started to reverse engineer it. He began work on a Gallium driver in November 2012. By early 2013, he had most of the "normal stuff" working. He could run GNOME Shell and some games on the hardware.

One of the nice things about a Gallium driver, Clark said, is that it supports both desktop graphics and GL ES. Lots of games are not ported to GL ES but can still be played using the Gallium driver.

That early work was done mostly in the evenings and on weekends, but that changed somewhat when he joined the Red Hat graphics team in February 2013. While his freedreno work is not his full-time job, he does get some work time to do freedreno development.

In March 2013, he ordered a Nexus 4, which has an Adreno 320 GPU, and started looking at that. Everything had changed between the two Adreno revisions, including the shader instruction set and the registers. Adreno 2xx support was working well at that point, so it has pretty much been left behind. By mid-2013, he had Adreno 3xx support working at a basic level.

The Direct Rendering Manager (DRM) driver for the MSM (Mobile Station Modem, a Qualcomm hardware designation that is used as the Adreno driver name) devices was merged into the mainline in August 2013 for the 3.12 kernel, which provided a "nice platform" for further development work. In January 2014, he added hardware binning support, which is a pre-pass made on the triangles to be rendered to see which fit in each tile. That sped up rendering so that he was able to get a "fairly playable" 30 frames per second (fps) in Xonotic.

He also started work on a new shader compiler in early 2014. The earlier compiler had just translated TGSI directly to native instructions, but didn't do proper instruction scheduling. Adding that scheduling fixed a number of problems while also giving a big performance boost, he said.

He started work on OpenGL 2.0 and 2.1 support in May 2014. Xonotic supports both OpenGL 1.0 and 2.0, so it can easily be used to test the changes made for 2.0 support.

In perhaps something of a surprise move, Qualcomm posted the first patches for the hardware in June 2014. The initial patches were for display support, but subsequent patches were for the DRM/MSM driver itself. Clark strongly commended Qualcomm for doing that work upstream.

In mid-2014, he got his hands on his first Adreno 4xx device (an Inforce6540 with an Adreno 420). He found that nearly all of the registers had all changed again, which required another round of reverse engineering to figure them out. The shader instruction set was quite similar to that of the 3xx, so he was able to share that code between the two GPU versions.

By October 2014, 90% of the Piglit OpenGL tests were passing for the Adreno 3xx. Piglit is useful to test "weird corner cases", he said. It is often the case that games will work fine, but lots of Piglit tests still fail.

Kernel driver patches to enable the 4xx were submitted by Qualcomm in November 2014 for the 3.19 kernel, which was another big step for the company. The initial Gallium support for the 4xx was also released around the same time. As of February 2015, the 4xx support is behind where the 3xx is, but most games and other programs are largely working at this point. By March 2015, the new shader compiler was able to handle everything that the old compiler could, so the old one has been retired.

The architecture for freedreno is much the same as for the Intel, Radeon, or Nouveau drivers. There are multiple user-space pieces (Gallium driver, X video driver, and libdrm_freedreno) that talk to the kernel driver. By implementing it that way, they got Wayland support for free. GNOME Shell works just fine on Adreno hardware using the Wayland compositor.

Clark said that he doesn't know much about the Android graphics stack. He would like to see Android support free graphics for Adreno devices, but that requires further investigation.

User space

The biggest single component of an open-source graphics driver is the Gallium driver, he said. The freedreno Gallium driver has a common core that handles tasks like dirty state and buffer tracking, tile-buffer management, and fences (i.e. synchronization operations). There are also separate components (fd2, fd3, and fd4) to handle each generation of Adreno devices, as well as a shader compiler (ir3) that is shared between 3xx and 4xx devices.

The user-space piece of freedreno builds up a command stream that has lots of register references. That gets sent to the kernel driver, which is a "glorified register writer", Clark said. The shader compiler is the single largest piece of the user-space driver. He also went into some details of the tiling implementation and how queries (e.g. occlusion queries) are handled by the driver and the hardware.

Turning to debugging, Clark said that the environment variable is the most useful technique for debugging. This command:

 $ FD_MESA_DEBUG=help glxinfo

will produce a help message that lists other values for the environment variable. Another useful tool is apitrace, which will save all of the GL commands into a trace file. That file can be used to reproduce some problem entirely separate from the program that produced the output. Command-stream traces can also be grabbed from the kernel driver (using the file).

The envytools from the Nouveau project were used to generate header files that describe the Adreno registers for both the Gallium and DRM/MSM kernel driver. As he figures out the registers of the GPUs, Clark enters them into an XML file that is used by other tools, both to create the headers and to decode various kinds of trace files. For example, cffdump can decode the command-stream trace files from both the free and binary blob drivers to compare the output of each. Information about the tools used to reverse engineer the GPUs is available on the freedreno wiki.

GL/GL ES 3.0

There is a lot of work going on "behind the scenes" to enable GL/GL ES 3.0, Clark said. A lot of the work to do so has been done by Ilia Mirkin. Supporting 3.0 will enable more games to work and will provide more advanced rendering features. Clark has also done work on the shader compiler to support the new version of the GLSL shader language.

For GL ES 3.0, the "big ticket item" that is left to do is to implement transform feedback and uniform buffer objects (UBOs). For those, the reverse engineering has been done, all that's left is to "write code". The code for multiple render targets (MRTs) exists on a branch, but has not yet been merged. There is also some shader compiler work to support advanced flow control, but that is not used in games much.

Support for the NV_conditional_render extension is the biggest piece left for GL 3.0 support. Clark said that the extension could be called the "we hate tilers" extension, since handling it will perform badly on a tile-based GPU.

The shader compiler is already the biggest piece of the user-space driver, but it will be getting somewhat bigger to handle some of these new features. Clark has started documenting the compiler design and the Adreno instruction set architecture. He has also has some preliminary work toward supporting the new internal representation (NIR), which is an Intel effort aimed at a new shader compiler IR for better optimization. He has a TGSI to NIR translation patch, but it is not ready to land yet.

Getting the freedreno driver should be easy, since all of the pieces are already upstream. Distributions generally already have it enabled. He recommends Mesa 10.4.x for Adreno 3xx devices and 10.5.1 (or higher) for 4xx GPUs.

As with most free-software projects, freedreno could use some help. To start with, if a distribution doesn't have the driver enabled, help make that happen, he suggested. Bug reports are also welcome. For those interested in actually working on the driver, there is help needed in "everything from the kernel to compilers". For those that know GL/GL ES, adding more tests would be helpful as well.

Clark described the process of reverse engineering a bit in answer to a question from the audience. He will start with a simple GL program that draws a "quad" (a four-sided polygon), then he will change the frame buffer size to see what the blob driver does differently. As he sees different registers and values, he records what he finds in the XML file that is used by cffdump and other tools. He runs a series of tests, just varying one parameter at a time to see what changes and slowly works out all of the registers. For the most part, these tests are rendered off-screen as he doesn't actually care much about what the output looks like.

[I would like to thank the Linux Foundation for travel support to San Jose for ELC.]

Comments (9 posted)