SOSP 2011 — Abstracts

Introduction

Monday 24^th, 08:30-09:00

Welcome and Awards

Monday 24^th, 09:00-10:30

Key-Value

SILT: A Memory-Efficient, High-Performance Key-Value Store

SILT (Small Index Large Table) is a memory-efficient, high-performance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and computation. Our contributions include: (1) the design of three basic key-value stores each with a different emphasis on memory-efficiency and write-friendliness; (2) synthesis of the basic key-value stores to build a SILT key-value store system; and (3) an analytical model for tuning system parameters carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude less memory to provide comparable throughput to current high-performance key-value systems on a commodity desktop system with flash storage.

Scalable Consistency in Scatter

PROCEEDINGS

Distributed storage systems often trade off strong semantics for improved scalability. This paper describes the design, implementation, and evaluation of Scatter, a scalable and consistent distributed key-value storage system. Scatter adopts the highly decentralized and self-organizing structure of scalable peer-to-peer systems, while preserving linearizable consistency even under adverse circumstances. Our prototype implementation demonstrates that even with very short node lifetimes, it is possible to build a scalable and consistent system with practical performance.

Fast Crash Recovery in RAMCloud

PROCEEDINGS

RAMCloud is a DRAM-based storage system that provides inexpensive durability and availability by recovering quickly after crashes, rather than storing replicas in DRAM. RAMCloud scatters backup data across hundreds or thousands of disks, and it harnesses hundreds of servers in parallel to reconstruct lost data. The system uses a log-structured approach for all its data, in DRAM as well as on disk; this provides high performance both during normal operation and during recovery. RAMCloud employs randomized techniques to manage the system in a scalable and decentralized fashion. In a 60-node cluster, RAMCloud recovers 35 GB of data from a failed server in 1.6 seconds. Our measurements suggest that the approach will scale to recover larger memory sizes (64 GB or more) in less time with larger clusters.

Monday 24^th, 11:00-12:30

Storage

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis

PROCEEDINGS

Enterprise storage systems are facing enormous challenges due to increasing growth and heterogeneity of the data stored. Designing future storage systems requires comprehensive insights that existing trace analysis methods are ill-equipped to supply. In this paper, we seek to provide such insights by using a new methodology that leverages an objective, multi-dimensional statistical technique to extract data access patterns from network storage system traces. We apply our method on two large-scale real-world production network storage system traces to obtain comprehensive access patterns and design insights at user, application, file, and directory levels. We derive simple, easily implementable, threshold-based design optimizations that enable efficient data placement and capacity optimization strategies for servers, consolidation policies for clients, and improved caching performance for both.

Differentiated Storage Services

PROCEEDINGS

We propose an I/O classification architecture to close the widening semantic gap between computer systems and storage systems. By classifying I/O, a computer system can request that different classes of data be handled with different storage system policies. Specifically, when a storage system is first initialized, we assign performance policies to predefined classes, such as the filesystem journal. Then, online, we include a classifier with each I/O command (e.g., SCSI), thereby allowing the storage system to enforce the associated policy for each I/O that it receives.

Our immediate application is caching. We present filesystem prototypes and a database proof-of-concept that classify all disk I/O — with very little modification to the filesystem, database, and operating system. We associate caching policies with various classes (e.g., large files shall be evicted before metadata and small files), and we show that end-to-end file system performance can be improved by over a factor of two, relative to conventional caches like LRU. And caching is simply one of many possible applications. As part of our ongoing work, we are exploring other classes, policies and storage system mechanisms that can be used to improve end-to-end performance, reliability and security.

A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications

PROCEEDINGS

We analyze the I/O behavior of iBench, a new collection of productivity and multimedia application workloads. Our analysis reveals a number of differences between iBench and typical file-system workload studies, including the complex organization of modern files, the lack of pure sequential access, the influence of underlying frameworks on I/O patterns, the widespread use of file synchronization and atomic operations, and the prevalence of threads. Our results have strong ramifications for the design of next generation local and cloud-based storage systems

Monday 24^th, 14:00-15:30

Security

CryptDB: Protecting Confidentiality with Encrypted Query Processing

PROCEEDINGS

Online applications are vulnerable to theft of sensitive information because adversaries can exploit software bugs to gain access to private data, and because curious or malicious administrators may capture and leak data. CryptDB is a system that provides practical and provable confidentiality in the face of these attacks for applications backed by SQL databases. It works by executing SQL queries over encrypted data using a collection of efficient SQL-aware encryption schemes. CryptDB can also chain encryption keys to user passwords, so that a data item can be decrypted only by using the password of one of the users with access to that data. As a result, a database administrator never gets access to decrypted data, and even if all servers are compromised, an adversary cannot decrypt the data of any user who is not logged in. An analysis of a trace of 126 million SQL queries from a production MySQL server shows that CryptDB can support operations over encrypted data for 99.5% of the 128,840 columns seen in the trace. Our evaluation shows that CryptDB has low overhead, reducing throughput by 14.5% for phpBB, a web forum application, and by 26% for queries from TPCC, compared to unmodified MySQL. Chaining encryption keys to user passwords requires 11–13 unique schema annotations to secure more than 20 sensitive fields and 2–7 lines of source code changes for three multi-user web applications.

Intrusion Recovery for Database-backed Web Applications

PROCEEDINGS

WARP is a system that helps users and administrators of web applications recover from intrusions such as SQL injection, cross-site scripting, and clickjacking attacks, while preserving legitimate user changes. WARP repairs from an intrusion by rolling back parts of the database to a version before the attack, and replaying subsequent legitimate actions. WARP allows administrators to retroactively patch security vulnerabilities—i.e., apply new security patches to past executions—to recover from intrusions without requiring the administrator to track down or even detect attacks. WARP’s time-travel database allows fine-grained rollback of database rows, and enables repair to proceed concurrently with normal operation of a web application. Finally, WARP captures and replays user input at the level of a browser’s DOM, to recover from attacks that involve a user’s browser. For a web server running MediaWiki, WARP requires no application source code changes to recover from a range of common web application vulnerabilities with minimal user input at a cost of 24–27% in throughput and 2–3.2 GB/day in storage

Software fault isolation with API integrity and multi-principal modules

PROCEEDINGS

The security of many applications relies on the kernel being secure, but history suggests that kernel vulnerabilities are routinely discovered and exploited. In particular, exploitable vulnerabilities in kernel modules are common. This paper proposes LXFI, a system which isolates kernel modules from the core kernel so that vulnerabilities in kernel modules cannot lead to a privilege escalation attack. To safely give kernel modules access to complex kernel APIs, LXFI introduces the notion of API integrity, which captures the set of contracts assumed by an interface. To partition the privileges within a shared module, LXFI introduces module principals. Programmers specify principals and API integrity rules through capabilities and annotations. Using a compiler plugin, LXFI instruments the generated code to grant, check, and transfer capabilities between modules, according to the programmer’s annotations. An evaluation with Linux shows that the annotations required on kernel functions to support a new module are moderate, and that LXFI is able to prevent three known privilege-escalation vulnerabilities. Stress tests of a network driver module also show that isolating this module using LXFI does not hurt TCP throughput but reduces UDP throughput by 35%, and increases CPU utilization by 2.2–3.7x.

Monday 24^th, 16:00-17:30

Reality

Thialfi: A Client Notification Service for Internet-Scale Applications

PROCEEDINGS

Ensuring the freshness of client data is a fundamental problem for applications that rely on cloud infrastructure to store data and mediate sharing. Thialfi is a notification service developed at Google to simplify this task. Thialfi supports applications written in multiple programming languages and running on multiple platforms, e.g., browsers, phones, and desktops. Applications register their interest in a set of shared objects and receive notifications when those objects change. Thialfi servers run in multiple Google data centers for availability and replicate their state asynchronously. Thialfi’s approach to recovery emphasizes simplicity: all server state is soft, and clients drive recovery and assist in replication. A principal goal of our design is to provide a straightforward API and good semantics despite a variety of failures, including server crashes, communication failures, storage unavailability, and data center failures.

Evaluation of live deployments confirms that Thialfi is scalable, efficient, and robust. In production use, Thialfi has scaled to millions of users and delivers notifications with an average delay of less than one second.

Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency

PROCEEDINGS

Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems.

An Empirical Study on Configuration Errors in Commercial and Open Source Systems

PROCEEDINGS

Configuration errors (i.e., misconfigurations) are among the dominant causes of system failures. Their importance has inspired many research efforts on detecting, diagnosing, and fixing misconfigurations; such research would benefit greatly from a real-world characteristic study on misconfigurations. Unfortunately, few such studies have been conducted in the past, primarily because historical misconfigurations usually have not been recorded rigorously in databases.

In this work, we undertake one of the first attempts to conduct a real-world misconfiguration characteristic study. We study a total of 546 real world misconfigurations, including 309 misconfigurations from a commercial storage system deployed at thousands of customers, and 237 from four widely used open source systems (CentOS, MySQL, Apache HTTP Server, and OpenLDAP). Some of our major findings include: (1) A majority of misconfigurations (70.0%~85.5%) are due to mistakes in setting configuration parameters; however, a significant number of misconfigurations are due to compatibility issues or component configurations (i.e., not parameter-related). (2) 38.1%~53.7% of parameter mistakes are caused by illegal parameters that clearly violate some format or rules, motivating the use of an automatic configuration checker to detect these miscon- figurations. (3) A significant percentage (12.2%~29.7%) of parameter-based mistakes are due to inconsistencies between different parameter values. (4) 21.7%~57.3% of the misconfigurations involve configurations external to the examined system, some even on entirely different hosts. (5) A significant portion of misconfigurations can cause hard-to-diagnose failures, such as crashes, hangs, or severe performance degradation, indicating that systems should be better-equipped to handle misconfigurations.

Monday 24^th, 17:30-19:15

Posters

Tuesday 25^th, 09:00-11:00

Virtualization

Cells: A Virtual Mobile Smartphone Architecture

PROCEEDINGS

Smartphones are increasingly ubiquitous, and many users carry multiple phones to accommodate work, personal, and geographic mobility needs. We present Cells, a virtualization architecture for enabling multiple virtual smartphones to run simultaneously on the same physical cellphone in an isolated, secure manner. Cells introduces a usage model of having one foreground virtual phone and multiple background virtual phones. This model enables a new device namespace mechanism and novel device proxies that integrate with lightweight operating system virtualization to multiplex phone hardware across multiple virtual phones while providing native hardware device performance. Cells virtual phone features include fully accelerated 3D graphics, complete power management features, and full telephony functionality with separately assignable telephone numbers and caller ID support. We have implemented a prototype of Cells that supports multiple Android virtual phones on the same phone. Our performance results demonstrate that Cells imposes only modest runtime and memory overhead, works seamlessly across multiple hardware devices including Google Nexus 1 and Nexus S phones, and transparently runs Android applications at native speed without any modifications.

Breaking Up is Hard to Do: Security and Functionality in a Commodity Hypervisor

PROCEEDINGS

Cloud computing uses virtualization to lease small slices of large-scale datacenter facilities to individual paying customers. These multi-tenant environments, on which numerous large and popular web-based applications run today, are founded on the belief that the virtualization platform is sufficiently secure to prevent breaches of isolation between different users who are co-located on the same host. Hypervisors are believed to be trustworthy in this role because of their small size and narrow interfaces.

We observe that despite the modest footprint of the hypervisor itself, these platforms have a large aggregate trusted computing base (TCB) that includes a monolithic control VM with numerous interfaces exposed to VMs. We present Xoar, a modified version of Xen that retrofits the modularity and isolation principles used in micro-kernels onto a mature virtualization platform. Xoar breaks the control VM into single-purpose components called service VMs. We show that this componentized abstraction brings a number of benefits: sharing of service components by guests is configurable and auditable, making exposure to risk explicit, and access to the hypervisor is restricted to the least privilege required for each component. Microrebooting components at configurable frequencies reduces the temporal attack surface of individual components. Our approach incurs little performance overhead, and does not require functionality to be sacrificed or components to be rewritten from scratch.

CloudVisor: Retrofitting Protection of Virtual Machines in Multi-tenant Cloud with Nested Virtualization

PROCEEDINGS

Multi-tenant cloud, which usually leases resources in the form of virtual machines, has been commercially available for years. Unfortunately, with the adoption of commodity virtualized infrastructures, software stacks in typical multi-tenant clouds are non-trivially large and complex, and thus are prone to compromise or abuse from adversaries including the cloud operators, which may lead to leakage of security-sensitive data.

In this paper, we propose a transparent, backward-compatible approach that protects the privacy and integrity of customers’ virtual machines on commodity virtualized infrastructures, even facing a total compromise of the virtual machine monitor (VMM) and the management VM. The key of our approach is the separation of the resource management from security protection in the virtualization layer. A tiny security monitor is introduced underneath the commodity VMM using nested virtualization and provides protection to the hosted VMs. As a result, our approach allows virtualization software (e.g., VMM, management VM and tools) to handle complex tasks of managing leased VMs for the cloud, without breaking security of users’ data inside the VMs.

We have implemented a prototype by leveraging commercially-available hardware support for virtualization. The prototype system, called CloudVisor, comprises only 5.5K LOCs and supports the Xen VMM with multiple Linux and Windows as the guest OSes. Performance evaluation shows that CloudVisor incurs moderate slowdown for I/O intensive applications and very small slowdown for other applications.

Atlantis: Robust, Extensible Execution Environments for Web Applications

PROCEEDINGS

Today’s web applications run inside a complex browser environment that is buggy, ill-specified, and implemented in different ways by different browsers. Thus, web applications that desire robustness must use a variety of conditional code paths and ugly hacks to deal with the vagaries of their runtime. Our new exokernel browser, called Atlantis, solves this problem by providing pages with an extensible execution environment. Atlantis defines a narrow API for basic services like collecting user input, exchanging network data, and rendering images. By composing these primitives, web pages can define custom, high-level execution environments. Thus, an application which does not want a dependence on Atlantis’ predefined web stack can selectively redefine components of that stack, or define markup formats and scripting languages that look nothing like the current browser runtime. Unlike prior microkernel browsers like OP, and unlike compile-to-JavaScript frameworks like GWT, Atlantis is the first browsing system to truly minimize a web page’s dependence on black box browser code. This makes it much easier to develop robust, secure web applications.

Tuesday 25^th, 11:30-12:30

OS Architecture

PTask: Operating System Abstractions To Manage GPUs as Compute Devices

PROCEEDINGS

We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models.

Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5x improvement in maximum throughput for the gestural interface.

Logical Attestation: An Authorization Architecture for Trustworthy Computing

PROCEEDINGS

This paper describes the design and implementation of a new operating system authorization architecture to support trustworthy computing. Called logical attestation, this architecture provides a sound framework for reasoning about run time behavior of applications. Logical attestation is based on attributable, unforgeable statements about program properties, expressed in a logic. These statements are suitable for mechanical processing, proof construction, and verification; they can serve as credentials, support authorization based on expressive authorization policies, and enable remote principals to trust software components without restricting the local user’s choice of binary implementations.

We have implemented logical attestation in a new operating system called the Nexus. The Nexus executes natively on x86 platforms equipped with secure coprocessors. It supports both native Linux applications and uses logical attestation to support new trustworthy-computing applications. When deployed on a trustworthy cloud-computing stack, logical attestation is efficient, achieves high-performance, and can run applications that provide qualitative guarantees not possible with existing modes of attestation.

Tuesday 25^th, 14:00-16:00

Detection and Tracing

Practical Software Model Checking via Dynamic Interface Reduction

PROCEEDINGS

Implementation-level software model checking explores the state space of a system implementation directly to find potential software defects without requiring any specification or modeling. Despite early successes, the effectiveness of this approach remains severely constrained due to poor scalability caused by state-space explosion. DEMETER makes software model checking more practical with the following contributions: (i) proposing dynamic interface reduction, a new state-space reduction technique, (ii) introducing a framework that enables dynamic interface reduction in an existing model checker with a reasonable amount of effort, and (iii) providing the framework with a distributed runtime engine that supports parallel distributed model checking.

We have integrated DEMETER into two existing model checkers, MACEMC and MODIST, each involving changes of around 1,000 lines of code. Compared to the original MACEMC and MODIST model checkers, our experiments have shown state-space reduction from a factor of five to up to five orders of magnitude in representative distributed applications such as PAXOS, Berkeley DB, CHORD, and PASTRY. As a result, when applied to a deployed PAXOS implementation, which has been running in production data centers for years to manage tens of thousands of machines, DEMETER manages to explore completely a logically meaningful state space that covers both phases of the PAXOS protocol, offering higher assurance of software reliability that was not possible before.

Detecting failures in distributed systems with the FALCON spy network

PROCEEDINGS

A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems’ unavailability when a crash occurs.

This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon’s common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon’s main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.

Secure Network Provenance

PROCEEDINGS

This paper introduces secure network provenance (SNP), a novel technique that enables networked systems to explain to their operators why they are in a certain state — e.g., why a suspicious routing table entry is present on a certain router, or where a given cache entry originated. SNP provides network forensics capabilities by permitting operators to track down faulty or misbehaving nodes, and to assess the damage such nodes may have caused to the rest of the system. SNP is designed for adversarial settings and is robust to manipulation; its tamper-evident properties ensure that operators can detect when compromised nodes lie or falsely implicate correct nodes.

We also present the design of SNooPy, a general-purpose SNP system. To demonstrate that SNooPy is practical, we apply it to three example applications: the Quagga BGP daemon, a declarative implementation of Chord, and Hadoop MapReduce. Our results indicate that SNooPy can efficiently explain state in an adversarial setting, that it can be applied with minimal effort, and that its costs are low enough to be practical.

Fay: Extensible Distributed Tracing from Kernels to Clusters

PROCEEDINGS

Fay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the lowest level, Fay can be safely extended with new tracing primitives, including even untrusted, fully-optimized machine code, and Fay can be applied to running user-mode or kernel-mode software without compromising system stability. At the highest level, Fay provides a unified, declarative means of specifying what events to trace, as well as the aggregation, processing, and analysis of those events.

We have implemented the Fay tracing platform for Windows and integrated it with two powerful, expressive systems for distributed programming. Our implementation is easy to use, can be applied to unmodified production systems, and provides primitives that allow the overhead of tracing to be greatly reduced, compared to previous dynamic tracing platforms. To show the generality of Fay tracing, we reimplement, in experiments, a range of tracing strategies and several custom mechanisms from existing tracing frameworks.

Fay shows that modern techniques for high-level querying and data-parallel processing of disaggregated data streams are well suited to comprehensive monitoring of software execution in distributed systems. Revisiting a lesson from the late 1960’s, Fay also demonstrates the efficiency and extensibility benefits of using safe, statically-verified machine code as the basis for low-level execution tracing. Finally, Fay establishes that, by automatically deriving optimized query plans and code for safe extensions, the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.

Tuesday 25^th, 16:30-18:00

Work in Progress

Wednesday 26^th, 09:00-11:00

Threads and Races

Dthreads: Efficient Deterministic Multithreading

PROCEEDINGS

Multithreaded programming is notoriously difficult to get right. A key problem is non-determinism, which complicates debugging, testing, and reproducing errors. One way to simplify multithreaded programming is to enforce deterministic execution, but current deterministic systems for C/C++ are incomplete or impractical. These systems require program modification, do not ensure determinism in the presence of data races, do not work with generalpurpose multithreaded programs, or run up to 8.4x slower than pthreads.

This paper presents DTHREADS, an efficient deterministic multithreading system for unmodified C/C++ applications that replaces the pthreads library. DTHREADS enforces determinism in the face of data races and deadlocks. DTHREADS works by exploding multithreaded applications into multiple processes, with private, copy-on-write mappings to shared memory. It uses standard virtual memory protection to track writes, and deterministically orders updates by each thread. By separating updates from different threads, DTHREADS has the additional benefit of eliminating false sharing. Experimental results show that DTHREADS substantially outperforms a state-of-the-art deterministic runtime system, and for a majority of the benchmarks evaluated here, matches and occasionally exceeds the performance of pthreads.

Efficient Deterministic Multithreading through Schedule Relaxation

PROCEEDINGS

Deterministic multithreading (DMT) eliminates many pernicious software problems caused by nondeterminism. It works by constraining a program to repeat the same thread interleavings, or schedules, when given same input. Despite much recent research, it remains an open challenge to build both deterministic and efficient DMT systems for general programs on commodity hardware. To deterministically resolve a data race, a DMT system must enforce a deterministic schedule of shared memory accesses, or mem-schedule, which can incur prohibitive overhead. By using schedules consisting only of synchronization operations, or sync-schedule, this overhead can be avoided. However, a sync-schedule is deterministic only for race-free programs, but most programs have races.

Our key insight is that races tend to occur only within minor portions of an execution, and a dominant majority of the execution is still race-free. Thus, we can resort to a mem-schedule only for the “racy” portions and enforce a sync-schedule otherwise, combining the efficiency of sync-schedules and the determinism of memschedules. We call these combined schedules hybrid schedules.

Based on this insight, we have built PEREGRINE, an efficient deterministic multithreading system. When a program first runs on an input, PEREGRINE records an execution trace. It then relaxes this trace into a hybrid schedule and reuses the schedule on future compatible inputs efficiently and deterministically. PEREGRINE further improves efficiency with two new techniques: determinism-preserving slicing to generalize a schedule to more inputs while preserving determinism, and schedule-guided simplification to precisely analyze a program according to a specific schedule. Our evaluation on a diverse set of programs shows that PEREGRINE is deterministic and efficient, and can frequently reuse schedules for half of the evaluated programs.

Pervasive Detection of Process Races in Deployed Systems

PROCEEDINGS

Process races occur when multiple processes access shared operating system resources, such as files, without proper synchronization. We present the first study of real process races and the first system designed to detect them. Our study of hundreds of applications shows that process races are numerous, difficult to debug, and a real threat to reliability. To address this problem, we created RACEPRO, a system for automatically detecting these races. RACEPRO checks deployed systems in-vivo by recording live executions then deterministically replaying and checking them later. This approach increases checking coverage beyond the configurations or executions covered by software vendors or beta testing sites. RACEPRO records multiple processes, detects races in the recording among system calls that may concurrently access shared kernel objects, then tries different execution orderings of such system calls to determine which races are harmful and result in failures. To simplify race detection, RACEPRO models under-specified system calls based on load and store micro-operations. To reduce false positives and negatives, RACEPRO uses a replay and go-live mechanism to distill harmful races from benign ones. We have implemented RACEPRO in Linux, shown that it imposes only modest recording overhead, and used it to detect a number of previously unknown bugs in real applications caused by process races.

Detecting and Surviving Data Races using Complementary Schedules

PROCEEDINGS

Data races are a common source of errors in multithreaded programs. In this paper, we show how to protect a program from data race errors at runtime by executing multiple replicas of the program with complementary thread schedules. Complementary schedules are a set of replica thread schedules crafted to ensure that replicas diverge only if a data race occurs and to make it very likely that harmful data races cause divergences. Our system, called Frost, uses complementary schedules to cause at least one replica to avoid the order of racing instructions that leads to incorrect program execution for most harmful data races. Frost introduces outcome-based race detection, which detects data races by comparing the state of replicas executing complementary schedules. We show that this method is substantially faster than existing dynamic race detectors for unmanaged code. To help programs survive bugs in production, Frost also diagnoses the data race bug and selects an appropriate recovery strategy, such as choosing a replica that is likely to be correct or executing more replicas to gather additional information.

Frost controls the thread schedules of replicas by running all threads of a replica non-preemptively on a single core. To scale the program to multiple cores, Frost runs a third replica in parallel to generate checkpoints of the program’s likely future states — these checkpoints let Frost divide program execution into multiple epochs, which it then runs in parallel.

We evaluate Frost using 11 real data race bugs in desktop and server applications. Frost both detects and survives all of these data races. Since Frost runs three replicas, its utilization cost is 3x. However, if there are spare cores to absorb this increased utilization, Frost adds only 3–12% overhead to application runtime.

Wednesday 26^th, 11:30-12:30

Geo-Replication

Transactional storage for geo-replicated systems

PROCEEDINGS

We describe the design and implementation of Walter, a key-value store that provides transactions and replicates data across distant sites. A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously across sites, while providing strong guarantees at a single site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. We use Walter to build a social networking application and port a Twitter-like application.

Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

PROCEEDINGS

Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an “always-on” experience where operations always complete with low latency. Today’s systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model—causal consistency with convergent conflict handling, or causal+—that is the strongest achieved under these constraints.

We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.