Update: this document is obsolete. Most of what is said here fortunately did not have to come true. The premise, that scheduler activation are needed for scalable threads, is not true. A much simpler and far superior solution has been implemented in the form of NPTL. If we would have known the method to scale the number of threads on IA32 which we use now, we never would have devised this complicated mechanism. Hindsight is 20-20. But there are still people who think scheduler activations are a viable method. I pity the fools.

Design of the New GNU Thread Library

Ulrich Drepper
2002-4-13

On March 26th and 27th a meeting was held at Compaq's facilities in Nashua, NH, with the goal to design a new thread library for the GNU C library which would be used as the replacement of the current LinuxThreads library. The participants were:

Bill Abt, IBM
Dave Butenhof, Compaq
Ed Sharpe, HP (only on the 27th)
Finnbarr Murphy, Compaq
Jerry Harrow, Compaq
John Piacente, Compaq
Karl Runge, Sun
Kaz Kylheku
Larry Dwyer, HP (only on the 27th)
Peter Portante, Compaq
Ulrich Drepper, Red Hat

Summary

Provide Linux kernel with improved dynamic control of system load distribution, while allowing threaded applications to dynamically adapt to the kernel's decisions. This will result in better performing applications with fewer kernel resources.

The following is the result of the discussions. We tried to provide a complete overview of the problem at hand and suggest a solution. We could not do more than that since collaboration from other parties is necessary to make a final decision. These parties are the developers of the OS kernels, Linux and Hurd. In a first step this document is meant to help communicating the group's opinion to the kernel developers. If an agreement on the proposed lines can be achieved the document will be extended to describe the solution.

This document was prepared by Ulrich Drepper with lots of help from the other members of the group. All comments should be sent to me; I'll distribute them if necessary to the other participants.

Introduction

POSIX threads are an important component of a modern Unix system. They provide an interface which is probably/definitely not the absolute best of all possible versions of an thread API. But POSIX is always about compromises: the 'P' stands for portable which is the aspect users are looking for.

Threads are used for various reasons. Users have many freedoms since the API is general enough. This wide variety adds a number of requirements to the implementation of the threads library.

The conclusion which can be drawn from these requirements is that the library must handle arbitrary numbers of threads (scalability) but at the same time must not get into the way of effectively using the machine's resources (efficiency). The requirement for unreasonable large number of threads can only be dropped in an ideal world.

Hardware restrictions put hard limits on the number of threads the kernel can support for each process. Specifically this applies to IA-32 (and AMD x86_64) where the thread register is a segment register. The processor architecture puts an upper limit on the number of segment register values which can be used (8192 in this case).

This is one of the reasons why it is absolutely necessary to think about two-level scheduling for the threads. I.e., the actual user threads are different from the kernel threads (or light-weighted process, or what ever one wants to call them) and scheduled separately. This is generally called the M-on-N model for a thread implementation. The 1-on-1 model dedicates a unique kernel thread for each user-level thread; this is the model used by the current, inadequate thread library implementation.

In the subsequent discussion of the methods to implement the M-on-N model efficiently we will see that as a side effect the programs using small numbers of threads will also benefit. A system running many threaded programs (large or small) can also benefit from the greater cooperation fostered between those programs in use of system resources.

Before we start this discussion it is important to note some other requirements on the new thread library. Parallelization can be exploited on different kinds of machines and different processors. Neither the development of machines nor of processor technologies is expected to stagnate in the foreseeable future. Beside the preeminent uni-processor and symmetric multi-processor (SMP) machines which are in use today machines with non-uniform memory architecture (NUMA) are going to be important in the near future. Modern processors provide many more methods useful in multi-processor environments than older version and these new methods should be taken advantage of if possible. New generations of processors will support SMP machines with larger number of processors and several internal improvements, such as memory ordering during execution, also influence the implementation of the thread library. The consequence is that the implementation should not only take into account today's machines and processor architectures but must instead look ahead and enable adaption to newly available technology.

How to Implement the M-on-N Model?

Two-level scheduling means that beside the scheduling the kernel performs to keep the various processes (and the threads contained therein) running there is also a scheduler at the user-level at work. The user-level scheduler is responsible for executing the user-level threads using the resources the kernel provides, the kernel threads.

A simple minded implementation of the user-level scheduler could use the entries of the execution path of the threads into the thread library (i.e., calls to any function in the thread library) as a point to schedule a new thread. This is a very questionable approach since no thread really has to call such a function and if these functions are called there is no guarantee that these calls come at times which are reasonable for a re-schedule. Fair progress in such a scenario could be achieved by

Even if one of these methods would be adopted this simple approach falls completely flat when it comes to using the time a threads waits for I/O, or even just waits for some time, to execute other threads. Since the waiting does not happen in the thread library no scheduling can happen.

For some of the operations there is a possible work-around. For many operations which use file descriptors the descriptor can be set into a non-block mode by setting the O_NONBLOCK flag. This will prevent the execution from blocking in the kernel and instead return to the user-level with an appropriate error. This error can then be interpreted. There is of course the issue of not changing the semantics of the program (file descriptors must not globally be set into the non-block mode, only if needed) and not all situations in which the kernel can block have such a work-around.

Prevention of blocking is not the only problem. The thread library will also have to determine when the blocking reason is removed. In case of a file operation it would be possible to poll the file descriptor and determine whether I/O is possible. Unless a way is found that this polling happens asynchronously from the time the blocking was noticed on there can be a significant delay in noticing the removal of the blocking. It would be possible to create a separate thread which has the task to only check for blocking reasons (plural, it might be necessary to track many of these). In addition the thread library has to handle explicit delays caused by functions such as nanosleep().

It is possible to write a M-on-N thread library with these techniques. Properties of the implementation would be:

These are not the only problems with such an implementation. A few examples will show some more. As we will see, this exposes not only problems with such a M-on-N implementation, it also shows problems a thread library following the 1-on-1 model would have.

Example 1:

A machine with N, N > 1, processors is used to execute a program which is easy to parallelize. An application is started which takes advantage of all the processors and (nearly) linear acceleration can be measured.

Now a second similar process is started. It also wants to take advantage of the N processors and creates that many threads. The result is that the performance of each process is less than half of what it would be if only one such process would run. The costs of the context switching demands its share of the processor power.

In this case it would be best if the new process allocates only N/2 threads and the already running process is made to give up the same number. The system load is then balanced, and both processes may run at speeds close to the theoretical maximum; the overhead of scheduling kernel threads is gone.

Example 2:

A program is using a number of threads which is higher than the number of processors available. The program is working on a large dataset which has to be loaded into memory via mmap().

In such a scenario the process can block in the kernel for page faults when accessing the map file data (no way to avoid this). The timeslot of the thread is lost instead of being given to another thread of the process. In addition the kernel will run the different kernel threads according to its rules.

Example 3:

An application could take advantage of a large number of threads but the threads are tightly coupled by working on the same memory region. Blindly allocating resources can lead to bad performance on NUMA machines. The kernel cannot make good decisions without getting the resource requirements communicated.

If the user-level thread library is expected to make the right decisions for the given environment it would have to be told about the machine's topology, available resources, and many more details. There are currently no means to get all the information and even if there were the volume could be large. It is easier and more precise to specify the requirement.

Example 4:

In a multi-threaded application with high lock contention the execution contexts of many of the kernel threads are wasted if the kernel does not notify the thread library about preemption. If all threads but one are blocked on a mutex the runnable thread will be preempted at some point. If in this case it could take over another kernel thread, which is otherwise unused, the blocking could be lifted sooner. This does not only make a difference on multi-processor machines. It also can help on a single-processor machine with more than one running process.

The parts of the OS developer world which looked into these problems came to the conclusion that a new concept is needed. It was introduced in the early 90's as Scheduler Activations. In all the years since then nobody found a mechanism which promises equal performance, scalability, and general adaptability to new technologies in the computers.

The basic concept of scheduler activations (SA) is that the kernel and the user level are working together on scheduling and resource usage issues. Whenever the kernel thinks that a process should change its resource usage it tells the user-level code that such a change is going to happen or happened. The user-level code cannot reject the change. It can only react on it and make changes to the state of the process. In short, scheduler activations empower the kernel to influence the execution of the user code to a much higher degree. The kernel can at any time call the user level code to communication changes.

Processor resources, which are the initial concern here, are distributed as virtual processors. How they are mapped is up to the OS. A virtual processor need not run on the same physical processor all the time but the number of virtual processors a process gets can never be higher than the number of physical processors in use. The virtual processor basically corresponds to the kernel thread of today's system and it is the entity which the kernel scheduler handles.

To use these virtual processors the kernel provides the user level code with information about the way it maps these virtual processors to physical ones. Events such as preemption are communicated to the user-level code using scheduler activation.

The user level code can call into the kernel to create another virtual processor (which would correspond to a clone() call in the Linux kernel with the old model) but the difference is that the kernel does not have to grant another one. Instead, if the kernel determines that the system is already loaded, it can tell the requesting process to use the already available resources. At any time the kernel can take virtual processors away which can allow handling hot-swappable resources (like processors) gracefully.

It does not have to stop there. If the request for more virtual processors is made flexible such a request could contain information like the process could need between 2 and 8 virtual processors. This would allow the kernel to provide to the process as many virtual processors as are currently reasonable to use. It also allows the kernel to add more virtual processors later when the system is not that loaded anymore.

While this is a big step ahead it still need not be the end. Beside processor usage there are other resources which require consideration when creating virtual processors. Example 3 above already mentioned memory usage. The threaded application could ask for up to 4 threads but only if they can work without major penalties on the same memory area. There are other resources like I/O capabilities and maybe more in future which would benefit from this treatment. If the kernel interface to request virtual processors is generic enough they all can be adequately communicated to the kernel.

Kernel Requirements To Implement Scheduler Activations

While reading this document it is of great benefit to also read the original paper on scheduler activations: Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism by Anderson, Bershad, Lazowska, and Levy. We haven't even tried to duplicate all the details of the paper here. The referenced document should answer most of the remaining questions. The original papers and the follow-ups do not apply directly to the situation at hand. The implementation of the kernel side for a kernel like Linux will necessarily be different. Most of what is written still applies, though. And as far as the Hurd kernel is concerned, the original implementation was for a Mach-based kernel so there might be even more leverage possible.

This is not the time and place where the details about the kernel implementation of the scheduler activations can be decided. Nor can we fix the API here. These decisions will have to be made by whoever implements the functionality. Instead we will concentrate on spelling out the requirements on the API.

The API consists of a number of individual interfaces. We need a number of system calls to request and free virtual processors. The kernel needs to communicate with the user-level code about the various event. We also need a special signal handling. We will now discuss each of these requirements in more details.

Requesting and Freeing Virtual Processors

In the motivation for the scheduler activation model we already mentioned that the request for a new virtual processor should not be as simple as the existing clone() interface. The clone() interface does not allow the user-level code to provide the kernel enough information to make good decisions about the resource allocation. Since the interface should also be prepared for future developments the way in which information is passed should be extendable. Information identified as useful so far include:

The concept of scheduler activations calls for the kernel to make the decisions about the creation of new virtual processors. Sometimes it is certainly useful for the user code to insist on a new virtual processor regardless of the machine status. This is not different from what clone() does today. To select this allocation mode another parameter for the SA request function is needed.

To make the upcalls (see below) it is necessary for the kernel to have a place where the upcall parameters are stored. These parameters include the state (= register content etc) of the interrupted thread, the state of another thread (which was perhaps preempted), and a code to specify the reason for the upcall. It is not always the case that a virtual processor used for the scheduler activation was already in use. This means there is no user-level stack associated with the SA which could be used. Therefore it is necessary to somehow associate a memory area with each virtual processor (and therefore scheduler activation) which is used to pass up the parameters. This memory area must be shown to the kernel at the time the virtual processor is created.

The system call to free a virtual processor is simpler. There is not much to it: the virtual processor will simply never be used again.

One special system call is needed to handle realtime scheduling correctly. Assume a process with two virtual processors allocated which has two high-priority threads and one low-priority thread. One of the high-priority threads is blocked. The kernel then notifies the user-level code via a scheduler activation on the virtual processor running the other high-priority thread that the blocking reason has been removed. What now has to happen is that the low-priority thread is stopped and one of the high-priority threads takes it place. For this a system call of some sort need to be available. It should force a scheduler activation to be performed on the virtual processor the low-level thread is using. The upcall handler then will reschedule the threads and automatically the second high-priority thread will start running. The needed system call is more or less a normal signal. It might be possible to implement it this way (i.e., without a new system call) but the result might not be optimal.

Communication With The User Code

The communication of the kernel to the user-level code must happen in an asynchronous way. I.e., it cannot happen as a result of a returning system call. It is more like a signal sent to the user code. In fact, (mis)using signals is one possible way of implementing the communication. But it is probably better to invent a new, specialized mechanism which does not have the overhead of a signal delivery associated. Such a mechanism is usually called an upcall.

Just as for signal handlers the user code must register an upcall handler. There should be a system call for this. The interface is quite simple: the kernel has to know about a function and probably some flag which allows selecting the kind of events which are reported. Eventually the sigaction() interface can be reused (just pick a signal number which cannot regularly be used).

The integration of the scheduler activations with the traditional process model is challenging. Processes not using SA should not be penalized. To do this the normal action performed by, say, a preemption could be coded in a separate function. All places in the kernel which do handle preemption than would call either this function or a different function performing the upcall based on whether the process uses SA or not. Another question is whether there should be exactly one upcall or whether a number of different upcalls should be defined, one for the various different reasons an upcall is made for. These are details which are best to be decided by whoever implements the changes, though.

Each scheduler activation notification must communicate certain information. What the information is depends on the kind of event which is reported:

New Virtual Processor created

This event is reported with a brand new virtual processor. Nothing is known about it, yet, so the kernel should report something about the resources used. It is not really important to do this for uni-processor or SMP machines since the possible environments are indistinguishable (assuming it is not allowed to run different types of CPU in a SMP system). For NUMA system it is important to know which cluster the virtual processor runs on to determine the distance to resources. How this information is passed up remains to be seen.

Preemption

The preemption of a virtual processor is communicated to the user-level code using a SA on a different virtual processor. It obviously cannot be done using the one which was preempted. Therefore the information passed to the upcall handler contains the state (= register content etc) of the virtual processor which was preempted and the state of the virtual processor used for the upcall. The state must contain enough information for the upcall handler at the user-level to determine which virtual processor was preempted and which one was is used for the upcall. This should be possible to do by inspecting the register contents and looking at a thread-specific value such as a thread register or the stack pointer.

One extension to the SA model is to report preemption even in the case the preempted virtual processor is the last one for a process. In this case the report is delayed until the virtual processor is scheduled by the kernel the next time. Implementing this extension would allow the user-level thread library to implement a fair scheduling using time slicing. It is not always desirable to do this so this mode should be selectable by the flags passed to the kernel in the system call to set the upcall handler(s).

Timeslicing at the user-level can be improved even further if the kernel notifies the user-mode scheduler even if it chose not to preempt the kernel entity even though its time slice is over.

Virtual Processor Revoked

The kernel at any time can revoke a virtual processor if the process is left with at least one more virtual processor after that. This event is just a special case of preemption and in the theoretical SA model does not have to be distinguished since for every upcall a new SA is created. In the implementable model we probably would need such a special upcall, though. The event has to be reported using one of the remaining virtual processors of the process. The information which has to be passed to the upcall is the state of the two involved virtual processors, the one which is revoked and the one which was preempted to report the event. Everything said about the preemption notification is true here, too.

Blocking Notice

If the execution would block in the kernel the kernel has to remember this fact but then immediately has to signal the event by calling the upcall handler. This will allow the user-level code to use the virtual processor for a different user-level thread. The upcall is made using scheduler activation on the same virtual processor which would block. Therefore only one virtual processor state is saved, the one of the virtual processor which would have blocked.

What remains to be solved is the problem how the user-level code will recognize the unblocking event (see below). The kernel could create a unique ID for the scheduler activation reporting the blocking event and a unblocking event with this ID would later follow. This is once again an implementation detail which best is decided by the implementor. All that is needed is a way to match the blocking event with an unblocking event which will eventually follow later.

Please note that the reasons for blocking are numerous. Not only does this include the usual waiting for data from a device, the availability of room to write to a device. It also includes circumstances like a page fault. The current virtual processor cannot continue and it might take a bit until the data is loaded. But the user-level thread library might be able to find another thread which can run. To distinguish the various blocking reasons a code describing them should be passed to the upcall as well.

Unblocking Notice

The unblocking notice is sent by the kernel when the blocking reason is gone. The ID for the blocking event which was mentioned in the discussion of blocking notification event above must be passed to the upcall together with the state of the preempted virtual processor. This will allow the user-level code to determine which user-level thread can be continued and which was preempted for the notification.

One side effect of reporting the scheduling events to the user-level code is worth noting. At any time the kernel has to have only as many kernel stacks allocated as there are virtual processors available to the process. With an 1-on-1 model the number of kernel stacks can grow significantly, as each thread needs one. The implementation of the unblocking notification needs some form of representation for many different threads but this only has to be an identifier, it need not be a kernel thread description with stack. The use of scheduler activations can save quite a bit of memory which would have to be allocated for the kernel.

Signal Handling in the SA Implementation

As with any implementation of the M-on-N model the handling of signals is tricky. Only one of the threads of the process need to have a signal unblocked for it to be delivered. It was already rejected that the Linux kernel performs the combining of the signal masks of the kernel threads. This really does not make much sense from the user-level code's point of view either since not all user-level threads have a CPU at all times and even then (in a 1-on-1 model) it would require a separate system call to set the signal mask whenever a new user-level thread runs on top of a kernel thread.

For this reason the concept of a process signal mask must be introduced. The user-level code would be responsible for combining the mask of the various threads into the process mask. A system call to change the signal mask would only be necessary if one or more of the bits flip. I.e., either if the last thread is changed to ignore a signal or if the first thread is unblocking the signal. These situations should be rare.

The 2.5 kernel series already provides something resembling a process signal mask. If the appropriate clone() flag is given all signals are routed through the thread-group leader. The signal mask for this thread is acting as the process signal mask. This implementation is not easy to use in a SA implementation, though. The problem is that there is no support for changing the signal mask in a thread other than the thread-group leader. Each thread in the process can cause the need to change the process signal mask. An overly complicated and time-delaying construct would have to be devised to get the user-level thread using the thread-group leader kernel thread to perform the signal mask change. It is possible with the system calls mentioned above but it is not the most elegant and not the fastest solution. It would be better if the kernel would get a process signal mask concept. Since all other signal masks in the process are not used the storage used for the process signal mask could still be the one provided for the thread-group leader's signal mask. But there would have to be provisions of some sort which allow all threads to change the signal mask. Maybe part of the thread data structures in the kernel could be a signal mask pointer (not only the structure) which could point to the thread.group leader's signal mask memory.

The delivery of the signal to a multi-threaded process in the model used by the 2.5 kernel is awkward and tailored for the specific implementation of the M-on-N model implemented in the NGPT thread library. It does not allow simultaneous delivery of multiple signals and requires delivery to the thread-group leader which might have to be rescheduled since it is possibly not running at all.

We propose a different mechanism. Before we come to the description some facts. At no time need the thread which is willing to receive the signal be scheduled on any kernel thread. The process signal mask does not allow to distinguish between the different kernel threads anyhow. This means that the signal must be caught at user-level by a special function and not directly delivered to the signal handler. The function which gets called has to determine which thread is receiving the signal (if necessary it has to schedule the thread) and then call the signal handler by passing the signal context etc to the signal handler.

In the SA model we already have a special function which receives calls from the kernel: the upcall handler. It is therefore no bad idea to require the kernel to deliver signals as upcalls. This would also have the advantage that as many signals can be delivered in parallel as a process has virtual processors. It seems at least from the user-level point of view a perfect match.

In such a model the signal handler registered in the kernel is only used distinguish between cases which must be handled in the kernel and those which must be handled at user-level. What function implements the exact user-level handler is unimportant. The situations which have to dealt with in the kernel are the SIG_IGN handler (the signal is simply discarded) and SIG_DFL is the default action is to terminate the process.

If a signal is delivered to the upcall handler even though the handler was set to a value which would require handling in the kernel the user-level code will have to deal with it. In a multi-processor environment this can even happen if the sigaction() completely finished and installed the SIG_IGN or SIG_DFL handler. In such a case the user-level code can do what it wants, it is a classical race condition. A reasonable solution would be to call the old signal handler which the user-level thread would have to keep around.

Expected System/Library Changes

The final decisions about the design of the thread library will have to be made once the kernel details are decided. As far as the kernel is concerned we are currently looking for an implementation with the following characteristics:

The Alternative

If the Linux kernel cannot be extended following the scheduler activation model we have to fall back on the model described in the beginning of the How to Implement the M-on-N Model? section. It would be possible to implement a POSIX compliant implementation using the functionality currently available in the 2.5 series of the Linux kernel. But the drawbacks should be known:

We really want to avoid such an implementation if it is possible since it does not reflect what people expect today. But we would have no other choice since the current implementation is not usable for much longer since it does not use the latest kernel features and would have to be changed and extended beyond what is reasonable to do without a fresh start.

Conclusion

The proposal in this document reflects mostly the opinions of developers of the user-level side of the problem. It is based on various papers about complete implementations of the SA model which do include the kernel implementations. The evidence that the resulting system is strong enough that we believe an implementation following the proposal would be of benefit for everybody. Even if the POSIX thread model is not the ultimate goal for somebody looking for a thread library the SA handling provides the means to write a high-performance implementation, one which is better than it could be possible with a simple 1-on-1 model. In no place is the SA model specially tailored for POSIX threads. There are additional costs added by the SA mechanism but with a clever design in the kernel these should be negligible for processes not using SA. If SA are used the costs are added to the user-level code and all indications show that these extra costs are far outweighed by the benefits.

One word of warning. While the group spent a considerable amount of time coming up with the proposal and describing what might be necessary, especially with respect to requirements on the kernel, there is no guarantee that everything is covered. Only a real implementation will show whether really all the requirements are spelled out.