APSys '19- Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems

Full Citation in the ACM Digital Library

SESSION: Storage

pNOVA: Optimizing Shared File I/O Operations of NVM File System on Manycore Servers

NOVA is a state-of-the-art non-volatile memory file system that logs on a per-file basis to ensure consistency. However, NOVA does not show scalability when multiple threads perform I/Os to a single shared file on Manycore servers. We identified two problems: First, when multiple threads write to a single file restricts parallel writes because of a coarse-grained lock on files in the file system layer. Second, when multiple threads read to a single file, every reader lock acquisition invalidates cachelines of waiting threads and block holders. In order to solve the aforementioned problems, we propose pNOVA, a variant of NOVA that accelerates parallel writes and reads to the same file of multiple threads. First, pNOVA employs a fine-grained range lock, for which we take two implementations, an interval tree based range locking and an atomic operation-based range locking, rather than a coarse-grained lock on files. Second, by defining a range locking variable per each file range, we alleviate the cacheline invalidation problem of a single read counter. Lastly, we address the potential consistency damage incurred by parallel writes to the shared file, and provide consistency using a commit mark based logging method. We evaluated pNOVA on a Manycore server with 120 cores. For microbenchmark, pNOVA showed up to 3.5× higher I/O throughput than NOVA for concurrent shared file write workload. In the Filebench-OLTP benchmark, pNOVA showed up to 1.66× higher transaction processing rate than NOVA.

Understanding Security Vulnerabilities in File Systems

File systems have been developed for decades with the security-critical foundation provided by operating systems. However, they are still vulnerable to malware attacks and software defects. In this paper, we undertake the first attempt to systematically understand the security vulnerabilities in various file systems. We conduct an empirical study of 157 real cases reported in Common Vulnerabilities and Exposures (CVE). We characterize the file system vulnerabilities in different dimensions that include the common vulnerabilities leveraged by adversaries to initiate their attacks, their exploitation procedures, root causes, consequences, and mitigation approaches. We believe the insights derived from this study have broad implications related to the further enhancement of the security aspect of file systems, and the associated vulnerability detection tools.

PCDedup: I/O-Activity Centric Deduplication for Improving the SSD Lifetime

When data deduplication is used for extending the SSD lifetime inside an SSD, one of the key performance factors is how to manage the fingerprint cache. Since the size of the fingerprint cache is limited, the fingerprint cache should be very selective in choosing which fingerprints should be stored in the cache. In this paper, we show that write program contexts, which are automatically extracted during run time, can accurately capture their future data duplicability. Based on this observation, we propose PCDedup for SSD-internal data deduplication. PCDedup automatically filters out undesirable fingerprints from the cache, thus improving the cache hit ratio. Our experimental results show that PCDedup can improve the SSD lifetime by up to 16.4% over the existing deduplication scheme while the fingerprint management overhead is lowered on average by 68.6%.

SESSION: Operating Systems

USETL: Unikernels for Serverless Extract Transform and Load Why should you settle for less?

Growing popularity of serverless functions is driving the need to optimize the execution platform to reduce resource usage and increase the number of functions that can be executed concurrently. This reduces the provider's costs and increases profit. While current serverless solutions use containers and/or virtual machines, we propose a unikernel based design called USETL which is specialized for serverless extract, transform, load (ETL) workloads. Our design is motivated by a number of key observations: serverless functions are stateless, ephemeral and event-driven. Further, each function's specific purpose is known at invocation time. Unikernels are a natural fit for execution contexts with these properties: they are minimal kernels packaged with a single application in a single address space, which makes them incredibly lightweight. Our design removes network and storage components entirely, replacing them with high level APIs tailored to the needs of serverless ETL functions. Virtualizing I/O at the runtime library interface reduces memory and CPU overheads, yielding higher consolidation density.

ExtOS: Data-centric Extensible OS

Today's computer architectures are fundamentally different than a decade ago: IO devices and interfaces can sustain much higher data rates than the compute capacity of a single threaded CPU. To meet the computational requirements of modern applications, the operating system (OS) requires lean and optimized software running on CPUs for applications to fully exploit the IO resources. Despite the changes in hardware, today's traditional system software unfortunately uses the same assumptions of a decade ago---the IO is slow, and the CPU is fast.

This paper makes a case for the data-centric extensible OS, which enables full exploitation of emerging high-performance IO hardware. Based on the idea of minimizing data movements in software, a top-to-bottom lean and optimized architecture is proposed, which allows applications to customize the OS kernel's IO subsystems with application-provided code. This enables sharing and high-performance IO among applications---initial microbenchmarks on a Linux prototype where we used eBPF to specialize the Linux kernel show performance improvements of up to 1.8× for database primitives and 4.8× for UNIX utility tools.

Thinking about A New Mechanism for Huge Page Management

The Huge page mechanism is proposed to reduce the TLB misses and benefit the overall system performance. On the system with large memory capacity, using huge pages is an ideal choice to alleviate the virtual-to-physical address translation overheads. However, using huge pages might incur expensive memory compaction operations due to memory fragmentation problem, and lead to memory bloating as many huge pages are often underutilized in practice.

In order to address these problems, in this paper, we propose SysMon-H, a sampling module in OS kernel, which is able to obtain the huge page utilization in a low overhead for both cloud and desktop applications. Furthermore, we propose H-Policy, a huge page management policy, which splits the underutilized huge pages to mitigate the memory bloating or promotes the base 4KB pages to huge pages for reducing the TLB misses based on the information provided by SysMon-H. In our prototype, SysMon-H and H-Policy work cooperatively in OS kernel.

SESSION: System Management

Detecting System Failures with GPUs and LLVM

Since system failures cause a huge financial loss, they should be detected as early and accurately as possible and then be recovered rapidly. To detect system failures, there are mainly two methods: black-box and white-box monitoring. However, external black-box monitoring cannot obtain detailed information on system failures, while internal white-box one is largely affected by system failures. This paper proposes GPUSentinel for more reliable white-box monitoring using general-purpose GPUs. In GPUSentinel, system monitors running in a GPU analyze main memory and indirectly obtain the state of the target system. Since GPUs are isolated from the target system, system monitors are not easily affected by system failures. For easy development of system monitors, GPUSentinel provides a development environment including program transformation with LLVM. In addition, it also provides reliable notification mechanisms to remote hosts. We have implemented GPUSentinel using CUDA and the Linux kernel and confirmed that GPUSentinel could detect three types of system failures.

Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation

Troubleshooting performance bugs for dataflow computation often leads to a "painful" process, even for experienced developers. Existing approaches to configuration tuning or performance analysis are either specific to a particular framework or in need of code instrumentation. In this paper, we propose a framework-independent and non-intrusive approach to performance characterization. For each job, we first assemble the information provided by off-the-shelf profilers into a DAG-based execution profile. We then locate, for each DAG node (operation), the source code of its executed functions. Our key insight is that code contains learnable lexical and syntactic patterns that reveal resource information. We hence perform code analysis and infer the operations' resource usage with machine learning classifiers. Based on them, we establish a performance-resource model that correlates the job performance with the resources used. The evaluation with two Spark use cases demonstrates the effectiveness of our approach in detecting program bottlenecks and predicting job completion time under various resource configurations.

DeepPlace: Learning to Place Applications in Multi-Tenant Clusters

Large multi-tenant production clusters often have to handle a variety of jobs and applications with a variety of complex resource usage characteristics. It is non-trivial and non-optimal to manually create placement rules for scheduling that would decide which applications should co-locate. In this paper, we present DeepPlace, a scheduler that learns to exploits various temporal resource usage patterns of applications using Deep Reinforcement Learning (Deep RL) to reduce resource competition across jobs running in the same machine while at the same time optimizing for overall cluster utilization.

SESSION: Security and Networking

Brokered Agreements in Multi-Party Machine Learning

Rapid machine learning (ML) adoption across a range of industries has prompted numerous concerns. These range from privacy (how is my data being used?) to fairness (is this model's result representative?) and provenance (who is using my data and how can I restrict this usage?).

Now that ML is widely used, we believe it is time to rethink security, privacy, and incentives in the ML pipeline by re-considering control. We consider distributed multi-party ML proposals and identify their shortcomings. We then propose brokered learning, which distinguishes the curator (who determines the training set-up) from that of the broker coordinator (who runs the training process). We consider the implications of this setup and present evaluation results from implementing and deploying TorMentor, an example of a brokered learning system that implements the first distributed ML training system with anonymity guarantees.

Using Inputs and Context to Verify User Intentions in Internet Services

An open security problem is how a server can tell whether a request submitted by a client is legitimately intended by the user or fakes by malware that has infected the user's system. This paper proposes Attested Intentions (AINT), to ensure user intention is properly translated to service requests. AINT uses a trusted hypervisor to record user inputs and context, and uses an Intel SGX enclave to continuously verify that the context, where user interaction occurs, has not been tampered with. After verification, AINT also uses SGX enclave for execution protection to generate the service request using the inputs collected by the hypervisor. To address privacy concerns over the recording of user inputs and context, AINT performs all verification on the client device, so that recorded data is never transmitted to a remote party.

RocketStreams: A Framework for the Efficient Dissemination of Live Streaming Video

Live streaming video accounts for major portions of modern Internet traffic. Services like Twitch and YouTube Live rely on the high-speed distribution of live streaming video content to vast numbers of viewers. For popular content the data is disseminated (replicated) to multiple servers in data centres (or IXPs) for scalable, encrypted delivery to nearby viewers.

In this paper we sketch our design of RocketStreams, a framework designed to facilitate the high-performance dissemination of live streaming video content. RocketStreams removes the need for live streaming services to design complicated data management and networking solutions, replacing them with an easy-to-use API and backend that handles data movement on behalf of the applications. In addition to its support for TCP-based communication, RocketStreams supports CPU-efficient dissemination over RDMA, when available. We demonstrate the utility of RocketStreams for providing live streaming video dissemination by modifying a web server to make use of the framework. Preliminary results show that RocketStreams performs similarly to Redis on dissemination nodes. On delivery nodes, RocketStreams reduces CPU utilization by up to 54% compared to Redis, and therefore supports up to 27% higher simultaneous viewer throughput. When using RDMA, RocketStreams supports up to 73% higher ingest traffic on dissemination nodes compared with Redis, reduces delivery node CPU utilization by up to 95%, and supports up to 55% more simultaneous viewers.

SESSION: Data Center

Yawn: A CPU Idle-state Governor for Datacenter Applications

Idle-state governors partially turn off idle CPUs, allowing them to go to states known as idle-states to save power. Exiting from these idle-sates, however, imposes delays on the execution of tasks and aggravates tail latency. Menu, the default idle-state governor of Linux, predicts periods of idleness based on the historical data and the disk I/O information to choose proper idle-sates. Our experiments show that Menu can save power, but at the cost of sacrificing tail latency, making Menu an inappropriate governor for data centers that host latency-sensitive applications. In this paper, we present the initial design of Yawn, an idle-state governor that aims to mitigate tail latency without sacrificing power. Yawn leverages online machine learning techniques to predict the idle periods based on information gathered from all parameters affecting idleness, including network I/O, resulting in more accurate predictions, which in turn leads to reduced response times. Preliminary benchmarking results demonstrate that Yawn reduces the 99th latency percentile of Memcached requests by up to 40%.

Rebooting Virtualization Research (Again)

Visible or hidden, virtualization platforms remain the cornerstone of the cloud and the performance overheads of the latest generations have shrunk. Is hypervisor research dead? We argue that the upcoming trends of hardware disaggregation in the data center motivate a new chapter of virtualization research. We explain why the guest virtual machine abstraction is still relevant in such a new hardware environment and we discuss challenges and ideas for hypervisor and guest OS design in this context. Finally, we propose the architecture of a research platform to explore these questions.

Evaluation of a Disaggregated Rack System

NexTCA is an industrial-strength disaggregated rack system that is designed to provide a flexible computing infrastructure for heterogeneous workloads with wildly different resource demands. The key innovation in NexTCA is a PCI Express (PCIe) network that connects a rack of servers in such a way that the DRAMs, network interface cards (NIC), and SAS or NVMe solid state disks (SSD) associated with one server are directly accessible to other servers in the same rack. Leveraging this novel flexibility, NexTCA organizes the DRAMs, NICs and SSDs anywhere in a NexTCA rack into a global DRAM, NIC and SSD pool, respectively, and dynamically allocates hardware resources to individual servers that may or may not host the allocated resources. That is, NexTCA enables the notion of software-defined server, where the set of hardware resources bound to a server is configurable via software at run time, rather than fixed at manufacturing time. In addition, NexTCA uses the same PCIe network to support high-performance intra-rack inter-server communications. This paper starts with a brief description of the hardware architecture and systems software stack of NexTCA, and proceeds with a detailed evaluation of the first NexTCA prototype, which is jointly developed by ADLINK Technology and Industrial Technology Research Institute.