Mapping physical memory directly

As an intended successor to the Mach microkernel, X15 inherited some of its trends, in both its internal APIs and its implementation. One of these trends is that the kernel runs in a virtual address space that is very similar to what userspace applications get. In addition to simply being elegant, it provides the kernel with the ability to participate in virtual memory operations with userspace tasks through the same interface. For example, besides the mach_msg system call, Mach also has an internal mach_msg function that can be used by stub code generated with MiG (the Mach Interface Generator). In turn, this is what allows userspace to pretend to be the kernel and provide kernel services, a feature that helps developing system virtualization. Furthermore, it helps reducing physical memory fragmentation since contiguous kernel memory can map non-contiguous physical pages. And it also helps with NUMA because kernel mappings can be changed to use physical pages from the local node without changing the kernel address space.

By adopting the Mach virtual memory (VM), BSD systems kept this kernel virtual space. On NetBSD, for example, you can even use the pmap(1) command on the kernel process to inspect its virtual space. But this approach has problems :

The first problem is a recursion problem. A virtual space requires data structures to describe it, and those data structures must be allocated, an operation that actually uses the very data structures that the system is trying to allocate. This problem is usually solved by reserving some area of virtual space with a few statically allocated entries, but this implies that the size of this area is hardcoded at build time, which is often frowned upon. This is still acceptable because virtual memory is normally architectural, which means developers know what to expect precisely, which helps choosing a reasonable size.

The second problem is a little more embarrassing : it’s a performance problem, which can be divided into two separate issues :

Increased inter-processor interrupts (IPIs) : since mappings are created dynamically on demand, and because the kernel can’t handle page faults on itself (because of yet another recursion problem), it has to send IPIs to every other processor so that they invalidate their TLBs. TLB shootdowns are expensive operations that don’t scale with the number of processors. To reduce the severity of this issue, operating systems use highly optimized caching allocators that retain objects for some time and batch TLB shootdowns to reduce the number and frequency of IPIs. Kernels could make their memory pageable so that mappings are created on access, but this would require another reserved area that would serve as a logical page table that all processors could access at any time. Some of them kind of do that. For example, like Mach, NetBSD can allocate pageable kernel memory. This works because kernel data structures are allocated out of non-pageable kernel memory. Since pageable kernel memory is actually rarely used, it can be ignored, reducing the problems of kernel memory, i.e. non-scalable system-wide TLB shootdowns, to non-pageable memory. Another problem is that page faults would also affect performance, and even exceed what IPIs would have cost. Finally, on an architecture such as x86, which implements hardware page translation, it’s impossible to determine whether a translation is cached in a TLB. On mapping removal, the kernel has to send TLB shootdowns to all processors on which a process has been dispatched anyway.
Non-optimal use of large pages : although the kernel could map large pages in its address space, this can apparently only lead to a non-optimal use of these resources. In order to use large pages, the kernel needs physical memory that is contiguous and aligned according to the large page size. This kind of memory becomes very hard to come by with time, mainly because of the page cache, which keeps data around in physical memory to reduce I/O and improve block device and file system performance. Research has shown that fragmentation arises when blocks of different sizes are allocated out of a pool. This means that, with time, the kernel would be unable to allocate large pages any more, unless the page cache is cleared by some unpredictable external event such as unmounting a file system with lots of recently used data – don’t count on it. Even then, it’s likely that fragmentation would be too high to allow large page allocation. The only way to guarantee success is to create separate pools of reserved memory for each available page size. These pools would be configured and filled at boot time, and allocating from them would require some kind of system privilege. This is what the Linux hugetlbfs does, but for userspace applications. Managing pools for both the kernel and userspace would make the life of system administrators even more painful. At best, the kernel could use some heuristics, e.g. based on the size of physical memory, to reserve large pages for itself, depriving userspace of that memory. A very bad use case would have the kernel use large pages (e.g. 2M) for its object allocator, with each cache owning a large page filled with a single, small object…

As X15 aims at performance and scalability, these problems had to be dealt with.

The Linux kernel has historically used a technique that could be considered dirty, but has proved very effective. It maps physical memory directly into the kernel space. This direct mapping of physical memory allows users of the physical page allocator to directly access the pages they obtain without any mapping operation. The only operation required to obtain the virtual address of a physical page is adding a fixed offset. Since those mappings never change (virtual address offset + 0 will always translate to physical address 0), there is no need for TLB shootdowns. And since physical memory is mostly contiguous, the kernel can map it with the largest available page size. Assuming the processor supports 1G pages, on a machine with 32G of physical memory, about 32 translations can map all kernel objects, which reduces TLB contention. In addition, since x86 page tables are implemented as a radix tree, larger pages mean less lookup steps. Translating a virtual address requires only two memory accesses for 1G pages instead of four for 4k pages. In practice, it’s a bit more complicated because operating systems install mapping permissions at 4k page granularity, so not everything is mapped by large pages.

But this technique has problems of its own :

With increasing available physical memory, not all of it can be directly mapped. On 32-bits x86 with a 3G/1G split (3G for userspace, 1G for the kernel), this means that less than 1G can be directly mapped. Linux limits the direct physical mapping to ~896M. This problem still exists with 64-bits x86 processors because the architecture allows up to 2^52 physical addresses whereas Linux can map up to 2^46 addresses. However it won’t hit until that much physical memory (64T !) is actually installed, and since the problem was solved for 32-bits processors, there is no reason anyone would notice. When the kernel has to access a “high memory” page, it must create a temporary mapping somewhere outside the direct physical mapping. Some of these mappings can last a long time, reducing the amount of virtual space available for temporary mappings.
Without the ability to map non-contiguous pages into contiguous virtual addresses, fragmentation increases. The problem is amortized by carefully choosing memory allocation algorithms. Modern systems tend to use the buddy memory allocator for physical page allocation, and the slab allocator on top of it for object caching. Page migration is probably mandatory.

Intuitively, this solution seems more practical and efficient than a pure virtual kernel space. The BSDs have started to create such a direct mapping of physical memory as a special region of their pure kernel space, and they use it as the backing store of a complex stack of caching allocators.

X15 was recently modified to use a similar model, which caused the removal of about 300 lines of code (1345 added, 1624 removed). It retains the kernel VM map, which previously spanned over almost all kernel memory and is now reduced to pure virtual mappings. This confirms that using a direct physical mapping seems indeed simpler, in addition to all the benefits previously mentioned, and despite the apparent complexity of having to manage two different kinds of kernel memory. By retaining a kernel VM map, X15 will be able to perform IPCs with the same high level interface as userspace applications. By allocating kernel objects out of the rather large direct physical mapping, the recursion problem is completely solved. By using large pages to map its code and data, the kernel can better play its fast message passing role.