The trust tree

This could be the title of a novel, but no, the subject is really about computers and systems.

To begin with, it’s important to understand what trust means, why it matters, and where it applies. For the remainder of this article, let’s consider it means the following: a relation between two parties where the party that trusts the other is willing to rely on its actions, thereby abandoning control over these actions. Trust is usually thought of as a purely social kind of relation, i.e. between humans, and as a result, most people don’t think of it when making decisions concerning programs and machines. It is implicitly assumed that we can rely on our programs to do their job.

DespiteĀ studies 45 years old, security has only recently become a major subject of interest among both developers and users. Developers learn how to avoid exploitation (e.g. how to prevent buffer overflows and limit denial of service), and users are educated not to open a file just because there’s a hyperlink to it. Although it’s usually not explicitly mentioned, the underlying idea is trust. So basically, trust matters every time one uses a networking system in order to make sure life doesn’t turn into a disaster. Obvious, right ?

But there is a growing need for programs to trust other programs, without direct human intervention. The best current example is secured web sites (HTTPS), where users must easily be able to trust a quite large chain of programs, from the operating system to the remote data, through the user interface, the browser, the network, the web server and the web application, including the remote database that retains the data while it’s not being used. This particular case is solved with a system of certificates involving a third party, which most users never know. Once more, users trust their programs implicitly, and browsers come with a list of certificate authorities that are considered trustworthy.

An interesting point which deserves attention is the pervasive use of timeouts between mutually untrusting peers. For example, timeouts are ubiquitous in networking software, because developers often cannot even trust the wire. The only purpose of timeouts is to guarantee progress, which is not sufficient to make communication safe.

The case I want to consider now is one of particular interest to me : trust in a Hurd-like multi-server operating system. One thing that makes the Hurd different from all other existing systems I know of is the ability of unprivileged users to extend the system with their own system services. It’s indeed a paradigm shift : users aren’t under the strict control of the all-powerful administrator any more. They have gained freedom, conforming to the goals of the GNU project. At first, this may seem like a great feature. The administrator doesn’t need to delegate privileges, users can add their own services without compromising the rest of the system because they don’t have the rights (i.e. the capabilities) to privileged resources. In practice, this translates to a user installing a file system server in his home directory, starting it on a disk image file, binding it to a file system node he owns, accessing the file system with regular programs, all this without waiting for root to add him to the fuse group (or worse, refusing to). This also means a developer can work on a file system without disturbing the rest of the system more than a regular Unix process could.

Or does it ?

Well, no, because reality if full of annoying details that have a way of punching at the theory in unexpected ways. As a simple example, consider a user installing a malicious server implementing a file in /tmp (these servers are called translators in Hurd jargon). That malicious translator would merely never return on any of the calls it implements. As a result, the script that cleans /tmp on startup may hang when accessing the file implemented by the translator. If the initialization procedure depends on the completion of this operation, this could prevent other services such as a web server or the graphical user interface from being started. The critique already pointed out that the implicit trust on the file system had been broken.

The Hurd was initially designed to allow mutually-suspicious clients and servers to interact. In a manner similar to networking, a mechanism was implemented to break the connection to a system server, either interactively or with a timeout. As previously stated, we’re not interested in direct human intervention, so let’s discard the interactive case. This leaves us with a timeout, but as previously stated, that’s not enough for communication safety.

It is therefore imperative that clients be able to authenticate servers reliably and efficiently, before any communication takes place.

There are of course other cases which strengthen this almost obvious conclusion. An unprivileged user may send sensitive data to a server, and all precautions must be taken to avoid sending those data to a potentially malicious server. The case of the Hurd is even more complicated since file systems act as pagers for the Mach kernel, which puts the kernel in a position where it may have to trust external pagers if extreme care is not taken. Currently, the kernel uses a timeout on pageout in order to guarantee progress, but since the pager may simply be slow (and neither malicious nor buggy), it also uses the default pager (i.e. the swap) to temporarily store pages that can’t be evicted, in the hope that slow/blocked pagers will recover once memory pressure decreases, and process those pages. This is called double paging in Mach jargon. It’s easy to see that a single malicious user could destroy the performance of the page cache because of these timeouts and the need to multiply I/O to the swap and the true backing store device. Without a specific resource limit, it can easily lead to a denial of service for the whole system.

So how do we fix this ? The key insight is to realize that client/server relations form a tree of trust. Distributed systems already use hierarchy-based techniques, but usually to avoid deadlocks. For example, the documentation of the QNX operating system IPC devotes a paragraph (Robust implementations with Send/Receive/Reply) about it. The ideal trust tree is a directed acyclic graph where the edges represent trust relations. A system where this structure strictly applies would be free of untrusted access to potentially malicious servers. The Genode operating system framework seems to do exactly that. Unfortunately, a Hurd-like system aims at native POSIX compatibility, and it would still be very useful to be able to e.g. trust the less privileged networking stack when upgrading packages as root. In practice, if the networking stack is compromised, it can cause some trouble, but it cannot steal the identity of the root user, so explicitly trusting a lower node in the trust tree seems fine if done with care.

In conclusion :

  • clients must be able to authenticate servers efficiently, which defines an edge in the tree
  • the trust tree, rooted at the client, defines its trusted computing base
  • potential problems can be spotted by identifying cycles in the trust tree

As a simple solution for server authentication in the Hurd, we could imagine a per-process list of trusted UIDs, by default self and root, inherited on process creation, used to determine whether accessing a capability, e.g. a file descriptor, is safe or not before receiving it.