Voozh

Ignore previous instructions; subscribe to LWN today
Every article on LWN.net is written by humans, for humans. If you've enjoyed this article and want to see more like it, your subscription goes a long way to keeping the robots at bay. We are offering a free one-month trial subscription (no credit card required) to get you started.

January 28, 2015

This article was contributed by David Drysdale

This is the first in pair of articles that describe how the kernel runs programs: what happens under the covers when a user program invokes the system call? I recently worked on the implementation of a new system call, which is a close variant of that allows the caller to specify the invoked program by a combination of file descriptor and path, as with other system calls. (This will, in turn, enable an implementation of the library function that doesn't rely on access to the filesystem, which is important for sandboxed environments such as Capsicum.)

Along the way, I explored the existing implementation, and so these articles present the details of that functionality. In this one, we'll focus on the general mechanisms that the kernel uses for program invocation, which allow for different program formats; the second article will focus on the details of running ELF binaries.

The view from user space

Before diving into the kernel, we'll start by exploring the behavior of program execution from user space (there's also a good description of this behavior in chapter 27 of The Linux Programming Interface). For Linux versions up to and including 3.18, the only system call that invokes a new program is , which has the following prototype:

The argument specifies the program to be executed, and the and arguments are NULL-terminated lists that specify the command line arguments and environment variables for the new program. A simple skeleton driver program () allows us to explore how this behaves, by feeding in , , as arguments and , as environment variables. To see the result in the invoked program, we use another simple program () that just outputs its command-line arguments () and environment ().

Putting these together gives the expected result — the arguments and environment are passed through to the invoked program. Notice, though, that the for the invoked binary is just the value specified by the caller of ; having the program's name in isn't a convention that's required or policed by itself, at least for binaries.

Things change slightly when the program being invoked is a script rather than a binary program. To explore this, we use a shell script equivalent () of our environment-outputting program; putting this together with the original program that invokes , we see a couple of differences:

First, the environment has gained an extra value, indicating the current directory. Secondly, the initial argument to the script is now the script filename, rather than the value that the invoker specified. A further experiment reveals that the script interpreter added the environment variable, but the kernel itself modified the arguments:

More specifically, the kernel has removed the first () argument and replaced it with two arguments — the name of the script interpreter program (taken from the first line of the script) and the name of the invoked file (which holds the script text). If the first line of the script also includes command-line arguments for the interpreter (for example, needs an option to treat its input as a filename rather than script text), a third extra argument is also inserted, holding all of the extra options:

Up to a point, we can also repeat this pop-one, push-two alteration of the arguments, by invoking scripts that wrap scripts and so on; each such alteration effectively pushes the wrapper script name in at :

However, this doesn't continue forever — once there are too many levels of wrappers, the process fails with :

Into the kernel:

Now we move into kernel space and begin delving into the code that implements the system call. A previous article explored the general system call machinery (and the special wrinkles needed for ), so we can pick up the story at the function in . The main purpose of the code in this function is to build a new instance that describes the current program invocation operation. In the structure:

The field is set to a freshly opened for the program being invoked; this allows the kernel to read the file contents and decide how to handle the file.
The and fields are both set to the name of the file holding the program; we'll see later why there are two distinct fields here.
The function allocates and sets up the associated and data structures in preparation for managing the virtual memory of the new program. In particular, the new program's virtual memory ends at the highest possible address for the architecture; its stack will grow downward from there.
The field is set to point at the end of memory space for the new program, but leaves space for a NULL pointer as an end marker for the stack. The value of will be updated (downward) as more information is added to the new program's stack.
The and fields are set to hold the counts of arguments and environment values so that this information can be propagated to the new program later in the invocation process.
The field is set up to hold a bitmask of reasons why the program execution might not be safe; for example, if the process is being traced with or has the bit set. The Linux Security Module (LSM) may subsequently use this information to deny the program execution operation.
The field is a separately allocated object of type that holds information about the credentials for the new program. These are generally inherited from the process that called , but are updated to allow for setuid / setgid bits and other complications. The presence of setuid/setgid bits also disallows a collection of compatibility features because they have an adverse effect on security; the field records the bits in the process's personality that will be cleared later.
The field allows an LSM to store LSM-specific information with the ; the LSM is notified via a call to and the LSM hook. The default implementation of this hook updates the new program's Linux capabilities to allow for the file capabilities of the program file; other LSM implementations chain this behavior into their own implementations of the hook (e.g. Smack, SELinux).
The scratch space is filled with the first chunk (128 bytes) of data from the program file. This data will be used later to detect the binary format so it can be processed appropriately.

The parts of this setup process that depend on the particular file that's being executed are performed in an inner function; this function can be called again later to update those fields if a different file (e.g. a script interpreter) is actually run.

Finally, information about the program invocation is copied into the top of new program's stack, using the local and utility functions. First, the program filename is pushed to the stack (and its location is saved in the field of the instance), followed by all of the environment values, then by all of the arguments. At the end of this process, the stack looks like:

Binary format handler iteration:

With a complete in hand, the real business of program execution is performed in and (more importantly) . This code iterates over a list of objects, each of which provides a handler for a particular format of binary programs. A binary handler could potentially be defined in a kernel module, so the code calls for each format to ensure the relevant code can't be unloaded by another task while it's being used here.

For each handler object, the function pointer is called, passing in the object. If the handler code supports the binary format, it does whatever is needed to prepare the program for execution and returns success (>= 0). Otherwise, the handler returns a failure code (< 0) and iteration continues with the next handler.

Execution of a particular program may itself rely on execution of a different program; the obvious example is executable scripts, which need to invoke the script interpreter. To cope with this, the code can be called recursively, re-using the object. However, recursion depth is limited to prevent infinite recursion, giving the error behavior seen earlier.

The system's LSM also gets a say in the operation; before the iteration over binary formats starts, the LSM hook is triggered, allowing the LSM to make a decision on whether to allow the operation. To do so, it may use the state it stored in the field earlier.

At the end of the iteration, if no formats that can handle the program have been found (and the program appears to be binary rather than text, at least according to the first four bytes), then the code will also attempt to load a module named , where XXXX is the hex value of bytes three and four in the program file. This is an old mechanism (added in 1996 for Linux 1.3.57) to allow for a more dynamic way of associating binary format handlers with formats; the more recent mechanism (described below) allows a more flexible way of doing something similar.

Binary formats

So what are the binary formats available in the standard kernel? A search for code that registers instances of (via and ) gives us quite a collection of possible formats, all of which are configured and explained in the file:

: Support for interpreted scripts, starting with a line.
: Support miscellaneous binary formats, according to runtime configuration.
: Support for ELF format binaries.
: Support for traditional a.out format binaries.
: Support for flat format binaries.
: Support for Intel ELF binaries running on Alpha machines.
: Support for ELF FDPIC binaries.
: Support for SOM format binaries (an HP/UX PA-RISC format).

(plus a couple of other architecture-specific formats).

The next sections will examine the most important of these: interpreted scripts and the "miscellaneous" mechanism for supporting arbitrary formats; the next article will examine the ELF binary format — which is typically where all program execution ends up.

Script invocation:

Files that start with the character sequence (and have the execute bit set) are treated as scripts, handled by the handler. After checking those first two bytes, this code parses the rest of the script-invocation line, splitting it into an interpreter name (everything after up to the first white space) and possible arguments (everything else up to the end of the line, stripping external white space).

(One detail to note: back when the object was created, only the first 128 bytes of the program were retrieved. This means that if the interpreter name and arguments are longer than this, the results will be truncated.)

With these in hand, the code then removes from the top of the new program's stack (i.e. at the lowest address), and in its place pushes the following, adjusting the value in the object along the way:

the program name
(optionally) the collected interpreter arguments
the name of the interpreter program

Taken together, this explains the user space behavior we observed at the beginning of the article; our new program's stack is modified to look like:

The code also changes the value in the structure so that it references the interpreter filename, rather than the script filename. This explains why the structure refers to two strings: one () is the program that we currently want to execute, and one is the name () that was originally invoked in the call. Along similar lines, the field in the is also updated to reference the new interpreter program, and the first 128 bytes of its contents read into the scratch space.

The script handler code then recurses into to repeat the whole process for the script interpreter program. If the interpreter is itself a script, then the value will be changed once again but the will stay unchanged.

Miscellaneous interpreter detection:

We saw previously that early versions of the Linux kernel supported a rough-and-ready way of dynamically adding format support, by hunting for a kernel module with a name containing the early bytes of the binary. That's not particularly convenient — only searching on a couple of bytes is very limited (compare the vast range of detection signatures that the command uses) and requiring a kernel module raises the barrier to entry.

The miscellaneous binary format handler allows a more flexible and dynamic method of dealing with new formats, by allowing run-time configuration (via a special filesystem mounted under ) to specify:

How to recognize a supported format, based on filename extension or a magic value at a particular offset. (As with parsing script interpreters, this magic value has to fall within the first 128 bytes of the program file.)
The interpreter program to invoke, which will get the program filename passed to it as (as with script invocation).

A good example of the miscellaneous format handler in use is for Java files: detect files (based on their prefix) or files (based on the extension) and automatically invoke the JVM executable on them. This will require a wrapper script to provide the relevant command-line arguments, as the miscellaneous configuration doesn't allow arguments to be specified — which means that the miscellaneous handler will invoke the script handler, which will then invoke the ELF handler for the JVM executable (and which will probably in turn invoke the dynamic linker , although that's a somewhat different story).

Internally, the kernel implementation for this format is similar to the handler for script programs described above, except that there is an initial search for a matching configuration entry, and that configuration is used to make some of the details (such as removing ) optional.

The format handlers for both scripts and miscellaneous formats recurse on to attempt to invoke the interpreter program that is needed for that particular format. This recursion has to end at some point, and on a modern Linux system this is almost always at an ELF binary program — the subject of the next article — stay tuned.

Index entries for this article
Kernel	exec()
GuestArticles	Drysdale, David

URL: https://lwn.net/Articles/630727/