Thread Numbers and Stacks

Ulrich Drepper
2005-10-26
Copyright © 2005
All rights reserved.
No redistribution allowed

Questions about how to create many threads are constantly asked. Leaving aside the stupidity to abstract a problem by creating huge amounts of threads the issues at hand should be obvious to everybody who understands a bit about computer architecture. This should include everybody who is programming.

Anyway, to rob people of the last excuse to ever ask these questions again here is a brief explanation of the problem and the factors which play part in the solution. It is all really trivial.

Prerequisites

Every process gets its own virtual address space in which it has to fit all the code and data it will use. On 32-bit systems the available address space varies between 3G and 4G, on 64-bit systems these days around 16T. A normal application is split into several parts: the executable itself, the dynamic linker, and all the dependencies like the DSO for the C library. Following immediate the executable is the heap, a region of memory used for dynamic memory allocation. In addition each process needs a stack. The initial stack is allocated by the kernel. After startup, the address space of a process could look like this:

The position of the dynamic linker, dependencies (libc here), and if necessary even the stack are up to the system to chose. There are many factors playing part in the actual decision. Addresses are sometimes randomized and DSOs are loaded low (see the paper of Security Enhancements in RHEL).

Once the process is started it almost certainly needs dynamically allocated memory to do its work. This memory is either allocated on the heap (by extending the upper limit further up to higher addresses) or memory is allocated using the mmap() interface which can place the memory in any place in the address space which is not allocated.

At this point the difference between address space allocation and memory use should be explained. This is really a fundamental concept in programming so everybody should know. But I lost faith in education of programmers a long time ago. Address space allocation is a prerequisite for memory usage. The OS must grant use (read, write, execute) to the process for a certain region of the address space. This is only an operation modifying the data structures which describe the virtual memory of the process. The virtual memory handling describes what each page of the entire virtual address space contains. The memory itself is not affected. If the process then touches the memory for the first time memory is actually allocated for it.

Even though the address space might seem huge there are applications which fill it up. This means the allocation functions cannot allow big gaps between the various allocations and every single little gap which exists could be allocated sooner or later.

One last comment: the memory used for the stack must be continuous. I.e., it is not possible to allocate address space for the stack in multiple pieces. It is also not possible to switch to another address space region for the stack after the thread started running. The allocation for the step is made once and for all times when the thread (including the sole, initial thread) is created.

Adding Threads to the Picture

When adding a new thread to a process not much really happens since the threads in a process share almost all of the processes resources. The only distinguishing features of the threads are the execution contexts which contain among other things separate sets of register contexts for each thread. One of the resources used by the processor is the stack which is usually pointed to by one of the registers of the CPU. To avoid having the threads trampling on each others feet the address space regions used for the stacks of different threads are kept separate.

As said before, the address space region used for the stack must be consecutive and it must be allocated when the thread is started since otherwise parts of the address space can be used for other things. This means it is not possible to grow the address space for a thread.

Looking at the figure above, adding a new thread stack means adding a new block of appropriate size. Each thread stack must fit in one block, i.e., be continuous. This clearly shows that the fragmentation of the address space (as represented by the non-continuous regions used for the DSOs) and the limited amount of unused memory puts a limit on the number of thread stacks which can fit into the picture.

This does not mean, again following the previous discussion, that an allocation of 2M really needs 2M of memory. In most cases the actual memory use for a stack is only a few kilobytes, rounded up to the next multiple of the page size. In no situation is more memory committed for the stack then is actually used because only the above mentioned demand paging actually allocates memory.

The Math of Thread Stacks

Some simple math should enlighten shows why a naive program trying to create many threads will fail on 32-bit platforms. On RHEL4 and FC4 systems and later the default stack size is 10M. If one uses the normal x86 kernel which provides 3G of userspace address space one can, in lucky circumstances, get about 2.5G of available address space in which thread stacks can be allocated. This means a maximum total of 2.5G/10M = 250 threads can be created. And this more or less prohibits the program from using any dynamically allocated memory.

In practice this number is even less. In addition to dynamically allocated memory the programs themselves require address space for the code in the executable and all dependencies. With prelinking enabled (the default) the address space in addition gets fragmented. The DSOs are not necessarily loaded at consecutive addresses and there are gaps between them. Unless the gaps are by sheer coincidence a multiple of the thread stack size in size there is waste since, as said multiple times, the address space allocated for the stack must be continuous.Smaller thread stack sizes help to create more threads. With 100K stacks 25,000 threads could be created. Plus the amount of memory unusable for stacks in between two existing mappings is reduced.

On 64-bit machines all these issues are more theoretical, at least for the time being. With 10M stacks and (pessimistically) 15T of address space available, we still are able to create around 1,600,000 threads. The required RAM for the page tables alone would be huge, though. Such a machine would have to be well equipped.

Thread Stack Sizes

Now the big question is: how are the thread stack sizes determined? There are two possibilities:

the stack size can be explicitly specified when the thread is created;
a default value is used if no explicit size specification is used.

The former can be achieved by passing a pthread_attr_t object to the pthread_create() call. The attribute object must have the stacksize attribute set. This is possible by using either the pthread_attr_setstacksize() or the pthread_attr_setstack() interface. The stack is still allocated by the system, unless the stackaddr attribute is also set. The minimum size of the thread stack is specified by the PTHREAD_STACK_MIN macro. The actual value is implementation defined and it takes factors like page sizes, stack memory use by the processor, and stack memory use by the runtime into account. The actual value varies from architecture to architecture.

This approach is very flexible, all threads can have individually selected thread stack sizes. But the pthread_create() calls to create new threads must be appropriately called. The code must actually use an attribute in the calls (the parameter can be NULL), there must be a call to pthread_attr_setstacksize(), and the value passed to pthread_attr_setstacksize() must be selectable by the user. Well designed programs provide such functionality, most JVMs have an appropriate parameter. But a lot of code is not as well written.

The default stack sizes come into play if pthread_create() is called without an attribute object or if the stacksize attribute in the attribute object passed is not set. In this case the new thread's stack size is determined from the size of the stack of the initial thread. More correctly, the RLIMIT_STACK limit is determined with a call to getrlimit(). If the returned current limit is unlimited (RLIM_INFINITY) an architecture-dependent default value is used. For most 32-bit architectures this is 2M. Otherwise the maximum of PTHREAD_STACK_MIN and the returned size of the stack used.

This means that the stack size for programs which do not specify thread stack sizes themselves can be selected from the command line. Before starting the threaded application use the ulimit command with the -s option to set the size of the stacks. It is important to note that all threads will get stacks of the same size. It is not possible to assign separate sizes this way. It is also important to note that the size of the stack of the initial thread is specified this way, too (that is actually the primary function of the limit). This could lead to problems in programs where the initial thread is special and needs larger amounts of stack memory than the other threads. Such a situation is not uncommon.

Summary

There is only that much which can be done to select the thread stack sizes if the author of the code doesn't provide hooks to select it explicitly. Only the case of a thread which is created without the stacksize attribute being set can be handled at all from the outside (using the stack limit).

It is therefore essential to make programmers understand how they have to write the code. If a library contains functions which might create new threads, it must be possible for the program using the library to pass in an attribute object which will be used in the pthread_create() calls. For an example, look at the mq_notify() interface in mqueue.h.

If there is no possibility to explicitly influence the stack sizes and the coarse stack limit mechanism is not sufficient, there is nothing the runtime can help with. The only solution is to track down the code authors and make them change their code.