POSIX-style read() and write() have long been the standard interface for accessing data in files. However, the data copy into and out of memory these methods require imposes an unnecessary overhead when files are stored in fast persistent memories (PMEMs). To avoid the copy, PMEM-aware file systems generally provide direct-access (DAX)-based mmap(), but in doing so force the programmer to manage write-atomicity and concurrent accesses to the file.
In this work, we propose two new system calls - peek() and patch(), and collectively called SubZero - that read and update PMEM-backed files without any copies. To show its potential, we implemented SubZero in two state-of-the-art PMEM file systems, XFS-DAX and NOVA. Measurements of simple benchmarks show that SubZero can outperform copy-based read() and write() by up to 2x and 2.8x, respectively. At the application level, peek() improves GET performance of the Apache Web Server by 3.6x, and patch() boosts SET performance of Kyoto Cabinet up to 1.3x.
We present ColumnBurst, a memory-efficient, near-storage hardware accelerator for database join queries. While the paradigm of near-storage computation has demonstrated performance and efficiency benefits on many workloads by reducing data movement overhead, memory-bound operations such as relational joins on unsorted data have been relatively inefficient with fast modern storage devices, due to the limited capacity and performance of memory available on the near-storage processing engine. ColumnBurst delivers very high performance even on such complex queries, while staying within the memory performance and capacity budget of what is typically already available on off-the-shelf storage devices. ColumnBurst achieves this via a compact, hardware implementation of sorting-based group-by aggregation and join algorithms, instead of the conventional hash-based algorithms. We evaluate ColumnBurst using an FPGA-based prototype with 1 GB of slow on-device DDR3 DRAM, and show that on benchmarks including TPC-H queries with join queries on unsorted columns, it outperforms MonetDB on a 6-core i7 with 32 GB of DRAM by over 7x, and ColumnBurst using a near-storage hash join algorithm by 2x.
The learned index structures have reshaped our perspectives on the design of traditional data structures. With machine learning (ML) techniques, they can achieve better lookup performance than existing indexes. However, current learned indexes primarily focus on integer-key workloads and failed to efficiently index variable-length string keys. We introduce SIndex, a concurrent learned index specialized in variable-length string key workloads. To reduce the cost of model inference and data accesses, SIndex groups keys with shared prefixes and use each key's unique part for model training. We evaluate SIndex with both real-world and synthesized datasets. The result shows that SIndex can achieve up to 91% better performance compared with other state-of-the-art index structures. We have open-sourced our implementation1.
This paper revisits Peer-to-Peer DMA (P2P DMA) and investigates its potential for exploitation on Ethernet NICs and NVMe SSDs. The slowing performance improvement of CPUs has led to emergence of peripheral accelerators such as Smart NICs and TPU. P2P DMA presents potential for the efficient integration of multiple peripherals by avoiding data bouncing on main memory. However, P2P DMA has been studied mainly around GPUs, and its improvement has been measured for specific applications. In this paper, we perform experiments to clarify the benefits of using P2P DMA on individual devices, i.e., an Ethernet NIC and an NVMe SSD, from an I/O throughput perspective. We developed a library, called Libpop, for manipulating memory on devices for invoking P2P DMA. Additionally, we integrated Libpop into pcie-bench, which is an FPGA-based benchmark device, netmap for Ethernet NICs, and UNVMe for NVMe SSDs. Experiments with these implementations show that (1) memory writes degrade the throughput of DMA write by 70%, (2) the degradation affects I/O throughput on the devices, and (3) P2P DMA can avoid degradation, but device queues affect throughput on the Ethernet NIC.
Containers are commonly used to run the data-intensive applications of different tenants in cloud infrastructures. The storage I/O of the colocated tenants is typically handled by the shared system kernel of the container host. When a data-intensive container competes with a noisy neighbor, the kernel I/O services can cause performance variability and slowdown. This is a challenging problem for which several approaches have already been explored. Although the dynamic resource allocation, kernel structure replication, and hardware-level virtualization are helpful, they incur costs of high implementation complexity and execution overhead. As a realistic cost-effective alternative, we isolate the I/O path of each tenant by running dedicated storage systems at user level on reserved resources. We introduce the libservices as a unified user-level storage abstraction to dynamically provision per tenant container root filesystems, application data filesystems and image repositories. We outline several examples of container storage systems whose clients and servers can be composed from libservices. With an early prototype, we successfully demonstrate that the libservices combine the required efficiency and flexibility to build isolated I/O services on multitenant hosts with superior performance over existing user-level or kernel-level systems.
Multi-core processors are commonplace and continue to require rethinking (not only) in system software development. It is still difficult to operate several functionally identical computing cores efficiently. One misconception is to assume that functionally identical cores of a multi-core processor will behave non-functionally alike, especially at the speed at which they execute the same non-sequential program. We show that considerable deviations in the non-functional behaviour of otherwise identical cores are anything but unusual, and can be expected to vary by more than 20%. The paper documents the applied measurement methodology, discusses measurement results obtained, and addresses consequences for the coordinated operation of logically connected concurrent threads in the context of Linux.
Datacenters are adopting heterogeneous hardware in the form of different CPU ISAs and accelerators. Advances in low-latency and high-bandwidth interconnects enable hardware vendors to tighten the coupling of multiple CPU servers and accelerators. The closer connection of components facilitates bigger machines, which pose a new challenge to operating systems. We advocate to build a heterogeneous OS for large heterogeneous systems by combining multiple OS design principles to leverage the benefits of each design. Because a security-oriented design, enabled by simplicity and clear encapsulation, is vital in datacenters, we choose to survey various design principles found in microkernel-based systems. We explain that heterogeneous hardware employs different mechanisms to enforce access rights, for example for memory accesses or communication channels. We outline a way to combine enforcement mechanisms of CPUs and accelerators in one system. A consequence of this is a heterogeneous access rights management which is implemented as a heterogeneous capability system in a microkernel-based OS.
The AVX2 and AVX-512 instructions found in recent Intel CPUs can increase the performance of vectorized code. Their complexity and increased power consumption, however, causes the CPU to reduce its frequency. This frequency reduction can affect parts of the workload which do not use AVX2 or AVX-512, with previous work reporting an overall slowdown of more than 10% for various workloads with AVX-512-enabled parts. Although countermeasures against this frequency reduction overhead exist, they themselves cause additional overhead and are therefore only viable if the gains are larger than the additional overhead.
It is, however, often not clear how much AVX2/AVX-512 frequency reduction overhead is present. In this paper, we describe a sampling profiler to determine the magnitude of the overhead as an aid during software development or during the selection of countermeasures. Our profiler temporarily stops individual CPU cores to let the cores recover their maximum (non-AVX) frequency. The profiler then observes whether the frequency is immediately reduced again once the workload is resumed to determine whether the previous frequency reduction was actually necessary. The resulting information is used to calculate the approximate AVX2/AVX-512 frequency reduction overhead. In the case of AVX-512, our prototype is able to estimate the overhead with an average error of 1.2 percentage points for various benchmarks. We describe potential improvements to our design, and we describe a novel hardware-software interface which would allow more accurate measurement of the overhead.
The OS load balancing algorithm governs the performance gains provided by a multiprocessor computer system. The Linux's Completely Fair Scheduler (CFS) scheduler tracks process loads by average CPU utilization to balance workload between processor cores. That approach maximizes the utilization of processing time but overlooks the contention for lower-level hardware resources. In servers running compute-intensive workloads, an imbalanced need for limited computing resources hinders execution performance. This paper solves the above problem using a machine learning (ML)-based resource-aware load balancer. We describe (1) low-overhead methods for collecting training data; (2) an ML model based on a multi-layer perceptron model that imitates the CFS load balancer based on the collected training data; and (3) an in-kernel implementation of inference on the model. Our experiments demonstrate that the proposed model has an accuracy of 99% in making migration decisions and while only increasing the latency by 1.9 μs.
Mobile GPU, as the ubiquitous computing hardware on almost every smartphone, is being exploited for the deep learning inference. In this paper, we present our measurements on the inference performance with mobile GPUs. Our observations suggest that mobile GPUs are underutilized. We study the inefficient issue in depth and find that one of root causes is the improper partition of compute workload. To solve this, we propose a heuristics-based workload partitioning approach, considering both performance and overheads on mobile devices. Evaluation results show that our approach reduces the inference latency by up to 32.8% on various devices and neural networks.
Embedding methods are commonly used in recommender systems to represent features about user and item. An impeding practical challenge is that the large number of embedding vectors incurs substantial memory footprint for serving queries, especially as the number of features continues to grow. We propose an embedding compression system called Saec to address this challenge. Saec exploits the similarity among features within a field as they represent the same attribute of user or item, and uses clustering to compress the embeddings. We propose a new fast clustering method that relies on the empirical heavy-tailed nature of features to drastically reduce the clustering overhead. We implement a prototype of Saec on a production system and evaluate it with private feature datasets from a large Internet company. Testbed experiments show that Saec consistently reduces the memory footprint of embedding vectors from 4.46 GB to 161 MB. The fast clustering method we design achieves 32x speedup compared to the baseline method.
Most recommender systems are designed to comply with service level agreement (SLA) because prompt response to users' requests is the most important factor that decides the quality of service. Existing recommender systems, however, seriously suffer from long tail latency when the embedding tables cannot be entirely loaded in the main memory. In this paper, we propose a new SSD architecture called EMB-SSD, which mitigates the tail latency problem of recommender systems by leveraging in-storage processing. By offloading the data-intensive parts of the recommendation algorithm into an SSD, EMB-SSD not only reduces the data traffic between the host and the SSD, but also lowers software overheads caused by deep I/O stacks. Results show that EMB-SSD exhibits 47% and 25% shorter 99th percentile latency and average latency, respectively, over existing systems.
Many performance studies rely on Intel's Precise Event Based Sampling (PEBS) to collect processor events, where precision is a key for the reliability of analysis. In this paper, we make a study on the precision of PEBS and show that, while by its name being precise, PEBS can cause mistakes under shadowing, which may make the analysis unreliable. We then show how to remedy such imprecision by artificially inserting bogus instructions. Evaluation shows that our remedy leads to more precise event samples and thus more reliable performance analysis.
In cloud computing, resource allocation is the key building block. Existing resource allocation strategies are designed for multi-tenant cases and the majority of them target resource fairness and utilization. We consider the problem of resource allocation among multiple applications that belong to a single user. In this case, instead of resource fairness, the user cares about performance fairness. We focus on memory, one of the most universal resources used by applications. We mainly face two challenges. The first one is how to characterize the performance of various applications. We carefully choose User time, the number of CPU jiffies spending on the User mode, as a representative of the application's progress to reflect the performance of the application. Through a series of processing, we make performance comparable among different applications. The other challenge is to ensure high system memory utilization. We address this challenge by designing an adaptive memory allocation algorithm that guarantees high memory utilization and performance fairness simultaneously through dynamically reallocating memory among applications. We have implemented our algorithm by developing a scheduler on a physical machine where multiple applications share resources. Experimental evaluation shows that our algorithm can achieve good performance fairness on the premise of ensuring high memory utilization with different applications and importance.
Network-wide telemetry requires real-time analysis of a large amount of traffic. Telemetry systems use stream processors to support various applications, and Protocol Independent Switching Architecture switches to reduce the workload on stream processors. Due to the inefficient use of switch resources, existing systems cannot fully reduce the workload on the stream processors. Unlike the existing systems that treat switches independently when assigning tasks, Concerto lets switches work together. The use of cooperating switches means more resources and a further reduction in the stream processor's workload. Furthermore, Concerto can also adhere to a more stringent error rate requirement. Our evaluation shows that Concerto reduces the stream processor's workload by as much as 19x, and under the same workload on the stream processor, Concerto achieves an error rate of 104x lower than existing systems.
The selection of data reduction schemes, which is crucial for reduced data footprints on a distributed file system (DFS) and transferring big data, is usually limited to the schemes supported by the underlying platforms. If the source code of the platform is available, it might be possible to add user-favorite reduction schemes, but it requires an expensive implementation cost or is virtually impossible. In this paper, we propose a system design that links a DFS to reduction schemes and enables them transparently to data processing applications. To verify the feasibility of this approach, we implemented a framework within Hadoop DFS (HDFS) named Hadoop Data Reduction Framework (HDRF). The features of HDRF is threefold. First, the application programmers can easily incorporate their favorite reduction schemes with reasonably restrained implementation costs. Second, the selection is transparent to data processing applications. Lastly, experimental results show that HDRF can halve the vanilla HDFS transfer time by using a more optimized application without compromising the compression ratio. When running the same application, HDRF has minimal runtime and storage overheads compared to vanilla HDFS.