The end of the thundering Hurd

It’s 2016, and the Hurd project is still alive. Barely, as a bunch of only three to five people “regularly” contribute, but it is alive. And it’s making progress. During the past few years, a problem that should never have occurred in the first place was mostly solved, or at least dealt with through workarounds where it couldn’t. For most people who ever touched the Hurd, this problem would end up in a system freeze, and that’s the picture you would remember. But for us contributors, aware of some of the intimate details of the underlying implementation, the picture was quite different, and I’m writing this post to keep a memory of it, because, as annoying as it was, it was also a lot of fun thinking about it.

This problem was related to the classic thundering herd problem, and we had cute little names for its specific manifestations. When a server would spawn hundreds, sometimes thousands of threads to handle incoming requests, we would call it a “thread storm”. To make things more interesting, the low level receive code in servers used a spin lock around thread accounting, which made most threads loop, yielding the processor each time until they could acquire the lock. One funny consequence was a huge system load average rise, with no apparent reason. There were actually a lot of global spin lock based critical sections spread in much of the Hurd and Glibc code. When a thread storm occurred, there was a chance, although small, that the system could recover, provided that all load sources were cut, and I’ve actually seen it a couple of times, after hours or days of apparent freeze.

The other side of this problem was messaging. When a burst of messages was sent to a server, we would call it a “message flood”. These were most apparent when trying to fix thread storms, by limiting the total number of worker threads. They could cause two kinds of subproblems. One was that a lot of memory was consumed to keep the passed data around (message memory is automatically backed by the default pager, aka the swapper). The other was that some servers, such as our primary file system, would use more than one message queue, with dependencies between messages. By restricting the pool of worker threads, we would then cause deadlocks, because some messages would then remain in a queue, never to be processed. Another interesting fact was that this made servers very vulnerable to denial of service attacks/bugs. As an example, aptitude on Debian used select() to poll sockets frequently, which created many requests, at a faster rate than they were processed by the networking server, which could also incur a thread storm and a system freeze.

Because of the main file system being unable to cope with a restriction on worker threads, we lived with thread storms for a long, long time.

One of the most annoying issues related to thread storms was paging. When a file system would request all modified pages to be written back, the kernel would kindly oblige, sending a potentially huge number of messages to the pager, which would spawn as many threads as needed to handle all the data. This made writing big files very risky. In particular, automatic builders would choke when linking large libraries such as Firefox xulrunner. Such fun.

Since 2012, Justus Winter, myself and a few other people worked on these issues. The libpager library was reworked to use a fixed number of workers instead of spawning them on demand. The libdiskfs and libnetfs libraries now use atomic operations for reference counting instead of global locks. Scalability has been improved so that requests are quicker to handle, most notably with protected payloads which remove the port name -> object lookup, red-black tree backed VM maps and radix-tree backed IPC spaces. As for denial of service, the best solution would probably be thread migration (showing the importance of synchronous IPC),  but it’s quite complicated, messy, and difficult to integrate, so the current workaround is to give higher priorities to privileged servers. At least that’s something the design allows us to do very easily. The system is still buggy and prone to freezes, despite some recent fixes like the pageout deadlock and ext2fs use-after-free fixes, in particular when swapping large amounts of data, but it’s so much better than it used to be that we can confidently say an era has ended and another has started. One where, hopefully, these bugs won’t be the barrier of entry they used to be.