Operating systems rely on system calls to allow the controlled communication of isolated processes with the kernel and other processes. Every system call includes a processor mode switch from the unprivileged user mode to the privileged kernel mode. Although processor mode switches are the essential isolation mechanism to guarantee the system's integrity, they induce direct and indirect performance costs as they invalidate parts of the processor state. In recent years, high-performance networks and storage hardware has made the user/kernel transition overhead the bottleneck for IO-heavy applications. To make matters worse, security vulnerabilities in modern processors (e.g., Meltdown) have prompted kernel mitigations that further increase the transition overhead. To decouple system calls from user/kernel transitions we propose AnyCall, which uses an in-kernel compiler to execute safety-checked user bytecode in kernel mode. This allows for very fast system calls interleaved with error checking and processing logic using only a single user/kernel transition. We have implemented AnyCall based on the Linux kernel's extended Berkeley Packet Filter (eBPF) subsystem. Our evaluation demonstrates that system call bursts are up to 55 times faster using AnyCall and that real-world applications can be sped up by 24 % even if only a minimal part of their code is run by AnyCall.
Critical operations are often implemented in roughly the same way across multiple platforms, but differently by software systems running on the same platform. This observation is arguably justified by the potential restrictions of each software system, but it is surprising given the operation sensitivity to numerous platform-specific software and hardware parameters. With initial focus on the memory copy operation (memcpy), we introduce a methodology based on exhaustive search to optimize the performance across different platforms. We design and implement the Asterope algorithm to experimentally generate optimal memcpy parameters for two x86-64 processor models from different vendors. With experiments on microbenchmarks and two production systems, we demonstrate that Asterope respectively achieves up to 2.4x and 1.9x higher function and system performance in comparison to using the Linux kernel memcpy.
The POSIX shell is 'stringy', and its ecosystem primarily supports line-oriented formats. While such formats are popular and common, contemporary programming often involves semi-structured data, like JSON or YAML. Dealing with such formats, the shell's stringiness leaves users out in the cold---the POSIX ecosystem struggles with semi-structured data. New command-line tools work well with 'modern' data formats, but each tool is its own complex language to learn.
The tree-like filesystem is the shell's only real data structure. By mapping 'modern' formats onto file hierarchies, we can work effectively in the existing ecosystem.
We introduce ffs, the file filesystem, a new tool for mapping semi-structured data formats to filesystems in userspace. Like /proc and /sys, our filesystem-based approach helps the shell (and other tools) manipulate structured data.
Over the past decade, various systems and software libraries have been developed that provide crash consistency on byte-addressable persistent memory. They often require programmers to adapt their code significantly or to use special compiler plugins. Constant innovation in this evolving field makes it desirable to be able to easily switch to more recent systems without massive code refactoring, and without changing compilers.
In this paper, we show how aspect-oriented programming can be used to automatically apply crash consistency to normal, sparsely annotated C++ code. In two case studies, we find that our approach significantly reduces the amount of code required to apply state-of-the-art crash consistency frameworks such as PMDK libpmemobj++ and Pronto.
While persistent memory (PMEM) is a promising technology, leveraging it with legacy applications is non-trivial. This is primarily because legacy applications assume all memory is volatile and there is no notion of crash-consistency or state recovery. As new types of persistent and intelligent memory emerge, propelled by the CXL standard, the problem of integration and adoption remains.
In this paper we present PyMM, a framework for heterogeneous memory management in Python. It provides a means to abstract upon different memory types and their underlying traits (e.g., persistence, near/far). PyMM focuses on ease-of-use and employs an approach of sub-classing existing heavily-used types such as NumPy ndarray and PyTorch tensors. By doing so, PyMM allows new memory adoption with only minor modification to the application.
Fast provisioning of serverless functions is salient for serverless platforms. Though lightweight sandboxes (e.g., containers) enclose only necessary files and libraries, a cold launch still requires up to a few seconds to complete. Such slow provisioning prolongs the response time of serverless functions and negatively impacts users' experiences. This paper analyzes the main reasons for such slowdown and introduces an effective containerization framework, FlashCube. Instead of building a container from scratch, FlashCube quickly and efficiently assembles it through a group of pre-created general container parts (e.g., namespaces, cgroups, and language runtimes). In addition, FlashCube's user-space implementation makes it easily applicable to existing commodity serverless platforms. Our preliminary evaluation demonstrates that FlashCube can quickly provision containerized functions in less than 10 ms (vs. ~400 ms using Docker containers).
The advent of multi-core processors has increased the demand for programming concurrent systems. In this paper, we explore the use of SIMULA style coroutines and other primitives as a basis for defining a broad class of high-level concurrency abstractions including the definition of associated schedulers. The main contribution in this paper is an implementation of preemptive coroutines for a multi-core processor in an experimental version of Beta. The overall goal is to use a high-level language to program applications on a bare bone platform without an operating system.
A recent surge of security attacks has triggered a renewed interest in hardware support for isolation. Extended page table switching with VMFUNC, memory protection keys (MPK), and memory tagging extensions (MTE) are just a few of the hardware isolation mechanisms that promise support for low-overhead isolation in recent CPUs. Along with the restored interest in lightweight hardware isolation mechanisms, safe programming languages like Rust has made a leap towards practical, zero-overhead safety implemented without garbage collection.
Both lightweight hardware mechanisms and zero-overhead language safety can be leveraged to enforce the isolation of subsystems, e.g., browser plugins, device drivers and kernel extensions, user-defined database and network functions, etc. However, as both technologies are still young, their relative advantages are still unknown. In this work, we study the overheads of hardware and software isolation mechanisms with the goal to understand their relative advantages and disadvantages for fine-grained isolation of subsystems with tight performance budgets. We ask two questions: What is the overhead of hardware isolation in an ideal scenario where the hardware isolation mechanism takes zero cycles? And if the safety of the Rust language can lower the overhead of cross-subsystem invocations, can the language on its own introduce overheads that might outweigh isolation advantages? To answer these questions, we develop and compare two carefully optimized versions of inter-process communication (IPC) mechanisms (one in safe Rust and one in a carefully-optimized assembly), and two identical (to the degree possible) DPDK-based network packet processing frameworks (one in C++ and one in Rust). Our analysis shows that for systems requiring frequent boundary crossings, a safe language is still beneficial even if the overheads of hardware isolation mechanisms drop to zero.
For decades, the C programming language proved to be a cornerstone of system-software ecosystems, leaving us with billion lines of existing source code. From today's perspective of object-oriented and functional languages, C itself seems rather limited in its expressiveness and abstractive power. However, with the C preprocessor (CPP) as its companion, macros, which operate on the raw token stream, allow for abstractions that are impossible to achieve within the language itself. While its flexibility and its ease of use make CPP attractive for programmers, its potential undisciplined usage makes it problematic for static source-code analysis and can slow down the on-boarding of new developers.
In this paper, we focus on a disciplined subclass of CPP macros: the statement-like and expression-like macros, which mimic regular C functions, with well-type C expressions as arguments and, in case, a return value. We show how to spot such macros and their arguments in the compiler's abstract syntax tree, whereby it becomes possible to deduct type signatures for individual macro expansions. With our CppSig prototype, implemented as a Clang plugin, we extract macro-type information from Linux 5.12, whereby it becomes easier to understand even deep macro-expansion nests. In the future, these expansion signatures could be used to statically enforce gradually-typed CPP macro definitions.
Modern hardware platforms are increasingly complex and heterogeneous. System software uses a hodgepodge of different mechanisms and representations to express the memory topology of the target platform. Considerable maintenance effort is required to keep them in sync while often sharing is impossible due to hard-coded values. Incorrect platform-specific values in the hardware initialization sequence can lead to security critical and hard-to-find bugs because of misconfigured translation hardware, inaccessible devices, or the use of bad pointers.
We present a better way for system software to express and initialize memory hardware. We adopt an existing, powerful hardware description language, and efficiently compile it to generate correct initial page tables and memory maps for OS kernels and firmware from a single system description.
We evaluate our system on multiple architectures and platforms, and demonstrate that we can use the generated data structures to successfully initialize translation hardware, devices, memory maps, and allocators enabling easy support of new hardware platforms.
Rust is the first practical programming language that has the potential to provide fine-grained isolation of untrusted computations at the language level. A combination of zero-overhead safety, i.e., safety without a managed runtime and garbage collection, and a unique ownership discipline enable isolation in systems with tight performance budgets, e.g., databases, network processing frameworks, browsers, and even operating system kernels.
Unfortunately, Rust was not designed with isolation in mind. Today, implementing isolation in Rust is possible but requires complex, ad hoc, and arguably error-prone mechanisms to enforce it outside of the language. We examine several recent systems that implement isolation in Rust but struggle with the shortcomings of the language. As a result of our analysis we identify a collection of mechanisms that can enable isolation as a first class citizen in the Rust ecosystem and suggest directions for implementing them.
The C programming language was developed in the 1970s as a fairly unconventional systems and operating systems development tool, but has, through the course of the ISO Standards process, added many attributes of more conventional programming languages and become less suitable for operating systems development. Operating system programming continues to be done in non-ISO dialects of C. The differences provide a glimpse of operating system requirements for programming languages.