ACM Symposium on Operating Systems Principles
Abstracts, 1967 to 2015
25th SOSP — October 4-7, 2015 — Monterey, California, USA
2015
Introduction
Monday 5th, 8:30am - 9am
Welcome and Awards
Monday 5th, 9am - 10:30am
Formal Systems
2015
IronFleet: Proving Practical Distributed Systems Correct
Distributed systems are notorious for harboring subtle bugs. Verification can, in principle, eliminate these bugs a priori, but verification has historically been difficult to apply at full-program scale, much less distributed-system scale.

We describe a methodology for building practical and provably correct distributed systems based on a unique blend of TLA-style state-machine refinement and Hoare-logic verification. We demonstrate the methodology on a complex implementation of a Paxos-based replicated state machine library and a lease-based sharded key-value store. We prove that each obeys a concise safety specification, as well as desirable liveness requirements. Each implementation achieves performance competitive with a reference system. With our methodology and lessons learned, we aim to raise the standard for distributed systems from “tested” to “correct.”

2015
Using Crash Hoare Logic for Certifying the FSCQ File System
FSCQ is the first file system with a machine-checkable proof (using the Coq proof assistant) that its implementation meets its specification and whose specification includes crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ’s theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover the file system correctly without losing data.

To state FSCQ’s theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ’s design is relatively simple, experiments with FSCQ running as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ’s specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.

2015
SibylFS: formal specification and oracle-based testing for POSIX and real-world file systems
Systems depend critically on the behaviour of file systems, but that behaviour differs in many details, both between implementations and between each implementation and the POSIX (and other) prose specifications. Building robust and portable software requires understanding these details and differences, but there is currently no good way to systematically describe, investigate, or test file system behaviour across this complex multi-platform interface.

In this paper we show how to characterise the envelope of allowed behaviour of file systems in a form that enables practical and highly discriminating testing. We give a mathematically rigorous model of file system behaviour, SibylFS, that specifies the range of allowed behaviours of a file system for any sequence of the system calls within our scope, and that can be used as a test oracle to decide whether an observed trace is allowed by the model, both for validating the model and for testing file systems against it. SibylFS is modular enough to not only describe POSIX, but also specific Linux, OS X and FreeBSD behaviours. We complement the model with an extensive test suite of over 21,000 tests; this can be run on a target file system and checked in less than 5 minutes, making it usable in practice. Finally, we report experimental results for around 40 configurations of many file systems, identifying many differences and some serious flaws.

Monday 5th, 11am - 12:30pm
Distributed Transactions
2015
No compromises: distributed transactions with consistency, availability, and performance
Transactions with strong consistency and high availability simplify building and reasoning about distributed systems. However, previous implementations performed poorly. This forced system designers to avoid transactions completely, to weaken consistency guarantees, or to provide single-machine transactions that require programmers to partition their data. In this paper, we show that there is no need to compromise in modern data centers. We show that a main memory distributed computing platform called FaRM can provide distributed transactions with strict serializability, high performance, durability, and high availability. FaRM achieves a peak throughput of 140 million TATP transactions per second on 90 machines with a 4.9 TB database, and it recovers from a failure in less than 50 ms. Key to achieving these results was the design of new transaction, replication, and recovery protocols from first principles to leverage commodity networks with RDMA and a new, inexpensive approach to providing non-volatile DRAM.
2015
Implementing Linearizability at Large Scale and Low Latency
Linearizability is the strongest form of consistency for concurrent systems, but most large-scale storage systems settle for weaker forms of consistency. RIFL provides a general-purpose mechanism for converting at-least-once RPC semantics to exactly-once semantics, thereby making it easy to turn non-linearizable operations into linearizable ones. RIFL is designed for large-scale systems and is lightweight enough to be used in low-latency environments. RIFL handles data migration by associating linearizability metadata with objects in the underlying store and migrating metadata with the corresponding objects. It uses a lease mechanism to implement garbage collection for metadata. We have implemented RIFL in the RAMCloud storage system and used it to make basic operations such as writes and atomic increments linearizable; RIFL adds only 530 ns to the 13.5 µs base latency for durable writes. We also used RIFL to construct a new multi-object transaction mechanism in RAMCloud; RIFL’s facilities significantly simplified the transaction implementation. The transaction mechanism can commit simple distributed transactions in about 20 µs and it outperforms the H-Store main-memory database system for the TPC-C benchmark.
2015
Fast In-memory Transaction Processing using RDMA and HTM
We present DrTM, a fast in-memory transaction processing system that exploits advanced hardware features (i.e., RDMA and HTM) to improve latency and throughput by over one order of magnitude compared to state-of-the-art distributed transaction systems. The high performance of DrTM is enabled by mostly offloading concurrency control within a local machine into HTM and leveraging the strong consistency between RDMA and HTM to ensure serializability among concurrent transactions across machines. We further build an efficient hash table for DrTM by leveraging HTM and RDMA to simplify the design and notably improve the performance. We describe how DrTM supports common database features like read-only transactions and logging for durability. Evaluation using typical OLTP workloads including TPC-C and SmallBank show that DrTM scales well on a 6-node cluster and achieves over 5.52 and 138 million transactions per second for TPC-C and Small- Bank respectively. This number outperforms a state-of-theart distributed transaction system (namely Calvin) by at least 17.9× for TPC-C.
Monday 5th, 2pm - 3:30pm
Distributed Systems
2015
Paxos Made Transparent
State machine replication (SMR) leverages distributed consensus protocols such as PAXOS to keep multiple replicas of a program consistent in face of replica failures or network partitions. This fault tolerance is enticing on implementing a principled SMR system that replicates general programs, especially server programs that demand high availability. Unfortunately, SMR assumes deterministic execution, but most server programs are multi-threaded and thus non-deterministic. Moreover, existing SMR systems provide narrow state machine interfaces to suit specific programs, and it can be quite strenuous and error-prone to orchestrate a general program into these interfaces.

This paper presents CRANE, an SMR system that transparently replicates general server programs. CRANE achieves distributed consensus on the socket API, a common interface to almost all server programs. It leverages deterministic multi-threading (specifically, our prior system PARROT) to make multi-threaded replicas deterministic. It uses a new technique we call time bubbling to efficiently tackle a difficult challenge of non-deterministic network input timing. Evaluation on five widely used server programs (e.g., Apache, ClamAV, and MySQL) shows that CRANE is easy to use, has moderate overhead, and is robust. CRANE’s source code is at github.com/columbia/crane.

2015
E2: A Framework for NFV Applications
By moving network appliance functionality from proprietary hardware to software, Network Function Virtualization promises to bring the advantages of cloud computing to network packet processing. However, the evolution of cloud computing (particularly for data analytics) has greatly benefited from application-independent methods for scaling and placement that achieve high efficiency while relieving programmers of these burdens. NFV has no such general management solutions. In this paper, we present a scalable and application-agnostic scheduling framework for packet processing, and compare its performance to current approaches.
2015
Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis
Private messaging over the Internet has proven challenging to implement, because even if message data is encrypted, it is difficult to hide metadata about who is communicating in the face of traffic analysis. Systems that offer strong privacy guarantees, such as Dissent, scale to only several thousand clients, because they use techniques with superlinear cost in the number of clients (e.g., each client broadcasts their message to all other clients). On the other hand, scalable systems, such as Tor, do not protect against traffic analysis, making them ineffective in an era of pervasive network monitoring.

Vuvuzela is a new scalable messaging system that offers strong privacy guarantees, hiding both message data and metadata. Vuvuzela is secure against adversaries that observe and tamper with all network traffic, and that control all nodes except for one server. Vuvuzela’s key insight is to minimize the number of variables observable by an attacker, and to use differential privacy techniques to add noise to all observable variables in a way that provably hides information about which users are communicating. Vuvuzela has a linear cost in the number of clients, and experiments show that it can achieve a throughput of 68,000 messages per second for 1 million users with a 37-second end-to-end latency on commodity servers.

Monday 5th, 4pm - 5:30pm
Concurrency and Performance
2015
Parallelizing User-Defined Aggregations using Symbolic Execution
User-defined aggregations (UDAs) are integral to large-scale data-processing systems, such as MapReduce and Hadoop, because they let programmers express application-specific aggregation logic. System-supported associative aggregations, such as counting or finding the maximum, are data-parallel and thus these systems optimize their execution, leading in many cases to orders-of-magnitude performance improvements. These optimizations, however, are not possible on arbitrary UDAs.

This paper presents SYMPLE, a system for performing MapReduce-style group-by-aggregate queries that automatically parallelizes UDAs. Users specify UDAs using stylized C++ code with possible loop-carried dependences. SYMPLE parallelizes these UDAs by breaking dependences using symbolic execution, where unresolved dependences are treated as symbolic values and the SYMPLE runtime partially evaluates the resulting symbolic expressions on concrete input. Programmers write UDAs using SYMPLE’s symbolic data types, which look and behave like standard C++ types. These data types (i) encode specialized decision procedures for efficient symbolic execution and (ii) generate compact symbolic expressions for efficient network transfers. Evaluation on both Amazon’s Elastic cloud and a private 380-node Hadoop cluster housing terabytes of data demonstrates that SYMPLE reduces network communication up to several orders of magnitude and job latency by as much as 5.9× for a representative set of queries.

2015
Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming
This paper introduces read-log-update (RLU), a novel extension of the popular read-copy-update (RCU) synchronization mechanism that supports scalability of concurrent code by allowing unsynchronized sequences of reads to execute concurrently with updates. RLU overcomes the major limitations of RCU by allowing, for the first time, concurrency of reads with multiple writers, and providing automation that eliminates most of the programming difficulty associated with RCU programming. At the core of the RLU design is a logging and coordination mechanism inspired by software transactional memory algorithms. In a collection of micro-benchmarks in both the kernel and user space, we show that RLU both simplifies the code and matches or improves on the performance of RCU. As an example of its power, we show how it readily scales the performance of a real-world application, Kyoto Cabinet, a truly difficult concurrent programming feat to attempt in general, and in particular with classic RCU.
2015
COZ: Finding Code that Counts with Causal Profiling
Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities.

This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus “virtually” speeding it up.

We present COZ, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. COZ identifies previously unknown optimization opportunities that are both significant and targeted. Guided by COZ, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.

Tuesday 6th, 9am - 10:30am
Energy Aware Systems
2015
JouleGuard: Energy Guarantees for Approximate Applications
Energy consumption limits battery life in mobile devices and increases costs for servers and data centers. Approximate computing addresses energy concerns by allowing applications to trade accuracy for decreased energy consumption. Approximation frameworks can guarantee accuracy or performance and generally reduce energy usage; however, they provide no energy guarantees. Such guarantees would be beneficial for users who have a fixed energy budget and want to maximize application accuracy within that budget. We address this need by presenting JouleGuard: a runtime control system that coordinates approximate applications with system resource usage to provide control theoretic formal guarantees of energy consumption, while maximizing accuracy. We implement JouleGuard and test it on three different platforms (a mobile, tablet, and server) with eight different approximate applications created from two different frameworks. We find that JouleGuard respects energy budgets, provides near optimal accuracy, adapts to phases in application workload, and provides better outcomes than application approximation or system resource adaptation alone. JouleGuard is general with respect to the applications and systems it controls, making it a suitable runtime for a number of approximate computing frameworks.
2015
Software Defined Batteries
Different battery chemistries perform better on different axes, such as energy density, cost, peak power, recharge time, longevity, and efficiency. Mobile system designers are constrained by existing technology, and are forced to select a single chemistry that best meets their diverse needs, thereby compromising other desirable features. In this paper, we present a new hardware-software system, called Software Defined Battery (SDB), which allows system designers to integrate batteries of different chemistries. SDB exposes APIs to the operating system which control the amount of charge flowing in and out of each battery, enabling it to dynamically trade one battery property for another depending on application and/or user needs. Using microbenchmarks from our prototype SDB implementation, and through detailed simulations, we demonstrate that it is possible to combine batteries which individually excel along different axes to deliver an enhanced collective performance when compared to traditional battery packs.
2015
Drowsy Power Management
Portable computing devices have fast multi-core processors, large memories, and many on-board sensors and radio interfaces, but are often limited by their energy consumption. Traditional power management subsystems have been extended for smartphones and other portable devices, with the intention of maximizing the time that the devices are in a low-power “sleep” state. The approaches taken by these subsystems prove inefficient for many short-lived tasks common to portable devices, e.g., querying a sensor or polling a cloud service.

We introduce Drowsy, a new power management state that replaces “awake.” In the Drowsy state, not all system components are woken up, only the minimal set required for a pending task(s). Drowsy constructs and maintains the minimal task set by dynamically and continuously inferring dependencies between system components at run-time.We have implemented Drowsy within Android, and our results show a significant improvement (1.5-5×) in energy efficiency for common short-lived tasks.

Tuesday 6th, 11am - 12:30pm
More Distributed Transactions
2015
Yesquel: Scalable SQL storage for Web applications
Web applications have been shifting their storage systems from SQL to NOSQL systems. NOSQL systems scale well but drop many convenient SQL features, such as joins, secondary indexes, and/or transactions. We design, develop, and evaluate Yesquel, a system that provides performance and scalability comparable to NOSQL with all the features of a SQL relational system. Yesquel has a new architecture and a new distributed data structure, called YDBT, which Yesquel uses for storage, and which performs well under contention by many concurrent clients. We evaluate Yesquel and find that Yesquel performs almost as well as Redis—a popular NOSQL system—and much better than MYSQL Cluster, while handling SQL queries at scale.
2015
Building Consistent Transactions with Inconsistent Replication
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google’s Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use — in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications.

We present TAPIR — the Transactional Application Protocol for Inconsistent Replication — the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and throughput.

2015
High-Performance ACID via Modular Concurrency Control
This paper describes the design, implementation, and evaluation of Callas, a distributed database system that offers to unmodified, transactional ACID applications the opportunity to achieve a level of performance that can currently only be reached by rewriting all or part of the application in a BASE/NoSQL style. The key to combining performance and ease of programming is to decouple the ACID abstraction—which Callas offers identically for all transactions—from the mechanism used to support it. MCC, the new Modular approach to Concurrency Control at the core of Callas, makes it possible to partition transactions in groups with the guarantee that, as long as the concurrency control mechanism within each group upholds a given isolation property, that property will also hold among transactions in different groups. Because of their limited and specialized scope, these group-specific mechanisms can be customized for concurrency with unprecedented aggressiveness. In our MySQL Cluster-based prototype, Callas yields an 8.2× throughput gain for TPC-C with no programming effort.
Tuesday 6th, 2pm - 3:30pm
Experience and Practice
2015
Existential Consistency: Measuring and Understanding Consistency at Facebook
Replicated storage for large Web services faces a trade-off between stronger forms of consistency and higher performance properties. Stronger consistency prevents anomalies, i.e., unexpected behavior visible to users, and reduces programming complexity. There is much recent work on improving the performance properties of systems with stronger consistency, yet the flip-side of this trade-off remains elusively hard to quantify. To the best of our knowledge, no prior work does so for a large, production Web service.

We use measurement and analysis of requests to Facebook’s TAO system to quantify how often anomalies happen in practice, i.e., when results returned by eventually consistent TAO differ from what is allowed by stronger consistency models. For instance, our analysis shows that 0.0004% of reads to vertices would return different results in a linearizable system. This in turn gives insight into the benefits of stronger consistency; 0.0004% of reads are potential anomalies that a linearizable system would prevent. We directly study local consistency models—i.e., those we can analyze using requests to a sample of objects—and use the relationships between models to infer bounds on the others.

We also describe a practical consistency monitoring system that tracks ϕ-consistency, a new consistency metric ideally suited for health monitoring. In addition, we give insight into the increased programming complexity of weaker consistency by discussing bugs our monitoring uncovered, and anti-patterns we teach developers to avoid.

2015
Virtual CPU Validation
Testing the hypervisor is important for ensuring the correct operation and security of systems, but it is a hard and challenging task. We observe, however, that the challenge is similar in many respects to that of testing real CPUs. We thus propose to apply the testing environment of CPU vendors to hypervisors. We demonstrate the advantages of our proposal by adapting Intel’s testing facility to the Linux KVM hypervisor. We uncover and fix 117 bugs, six of which are security vulnerabilities. We further find four flaws in Intel virtualization technology, causing a disparity between the observable behavior of code running on virtual and bare-metal servers.
2015
Holistic Configuration Management at Facebook
Facebook’s web site and mobile apps are very dynamic. Every day, they undergo thousands of online configuration changes, and execute trillions of configuration checks to personalize the product features experienced by hundreds of million of daily active users. For example, configuration changes help manage the rollouts of new product features, perform A/B testing experiments on mobile devices to identify the best echo-canceling parameters for VoIP, rebalance the load across global regions, and deploy the latest machine learning models to improve News Feed ranking. This paper gives a comprehensive description of the use cases, design, implementation, and usage statistics of a suite of tools that manage Facebook’s configuration end-to-end, including the frontend products, backend systems, and mobile apps.
Tuesday 6th, 4pm - 5:30pm
Bugs and Analysis
2015
Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures
Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice.

We propose failure sketching, an automated debugging technique that provides developers with an explanation (“failure sketch”) of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause. Our approach combines static program analysis with a cooperative and adaptive form of dynamic program analysis.

We built Gist, a prototype for failure sketching that relies on hardware watchpoints and a new hardware feature for extracting control flow traces (Intel Processor Trace).We show that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached.

2015
Cross-checking Semantic Correctness: The Case of Finding File System Bugs
Today, systems software is too complex to be bug-free. To find bugs in systems software, developers often rely on code checkers, like Linux’s Sparse. However, the capability of existing tools used in commodity, large-scale systems is limited to finding only shallow bugs that tend to be introduced by simple programmer mistakes, and so do not require a deep understanding of code to find them. Unfortunately, the majority of bugs as well as those that are difficult to find are semantic ones, which violate high-level rules or invariants (e.g., missing a permission check). Thus, it is difficult for code checkers lacking the understanding of a programmer’s true intention to reason about semantic correctness.

To solve this problem, we present JUXTA, a tool that automatically infers high-level semantics directly from source code. The key idea in JUXTA is to compare and contrast multiple existing implementations that obey latent yet implicit high-level semantics. For example, the implementation of open() at the file system layer expects to handle an out-of-space error from the disk in all file systems. We applied JUXTA to 54 file systems in the stock Linux kernel (680K LoC), found 118 previously unknown semantic bugs (one bug per 5.8K LoC), and provided corresponding patches to 39 different file systems, including mature, popular ones like ext4, btrfs, XFS, and NFS. These semantic bugs are not easy to locate, as all the ones found by JUXTA have existed for over 6.2 years on average. Not only do our empirical results look promising, but the design of JUXTA is generic enough to be extended easily beyond file systems to any software that has multiple implementations, like Web browsers or protocols at the same layer of a network stack.

2015
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today — logs, counters, and metrics — have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.
Wednesday 7th, 9am - 10:30am
Big Data
2015
Interruptible Tasks: Treating Memory Pressure As Interrupts for Highly Scalable Data-Parallel Programs
Real-world data-parallel programs commonly suffer from great memory pressure, especially when they are executed to process large datasets. Memory problems lead to excessive GC effort and out-of-memory errors, significantly hurting system performance and scalability. This paper proposes a systematic approach that can help data-parallel tasks survive memory pressure, improving their performance and scalability without needing any manual effort to tune system parameters. Our approach advocates interruptible task (ITask), a new type of data-parallel tasks that can be interrupted upon memory pressure—with part or all of their used memory reclaimed—and resumed when the pressure goes away.

To support ITasks, we propose a novel programming model and a runtime system, and have instantiated them on two state-of-the-art platforms Hadoop and Hyracks. A thorough evaluation demonstrates the effectiveness of ITask: it has helped real-world Hadoop programs survive 13 out-of-memory problems reported on StackOverflow; a second set of experiments with 5 already well-tuned programs in Hyracks on datasets of different sizes shows that the ITaskbased versions are 1.5–3× faster and scale to 3–24× larger datasets than their regular counterparts.

2015
Chaos: Scale-out Graph Processing from Secondary Storage
Chaos scales graph processing from secondary storage to multiple machines in a cluster. Earlier systems that process graphs from secondary storage are restricted to a single machine, and therefore limited by the bandwidth and capacity of the storage system on a single machine. Chaos is limited only by the aggregate bandwidth and capacity of all storage devices in the entire cluster.

Chaos builds on the streaming partitions introduced by X-Stream in order to achieve sequential access to storage, but parallelizes the execution of streaming partitions. Chaos is novel in three ways. First, Chaos partitions for sequential storage access, rather than for locality and load balance, resulting in much lower pre-processing times. Second, Chaos distributes graph data uniformly randomly across the cluster and does not attempt to achieve locality, based on the observation that in a small cluster network bandwidth far outstrips storage bandwidth. Third, Chaos uses work stealing to allow multiple machines to work on a single partition, thereby achieving load balance at runtime.

In terms of performance scaling, on 32 machines Chaos takes on average only 1.61 times longer to process a graph 32 times larger than on a single machine. In terms of capacity scaling, Chaos is capable of handling a graph with 1 trillion edges representing 16 TB of input data, a new milestone for graph processing capacity on a small commodity cluster.

2015
Arabesque: A System for Distributed Graph Mining
Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some “interestingness” criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics.

In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque’s API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Wednesday 7th, 11am - 12:30pm
Storage Systems
2015
How to Get More Value From Your File System Directory Cache
Applications frequently request file system operations that traverse the file system directory tree, such as opening a file or reading a file’s metadata. As a result, caching file system directory structure and metadata in memory is an important performance optimization for an OS kernel.

This paper identifies several design principles that can substantially improve hit rate and reduce hit cost transparently to applications and file systems. Specifically, our directory cache design can look up a directory in a constant number of hash table operations, separates finding paths from permission checking, memoizes the results of access control checks, uses signatures to accelerate lookup, and reduces miss rates through caching directory completeness. This design can meet a range of idiosyncratic requirements imposed by POSIX, Linux Security Modules, namespaces, and mount aliases. These optimizations are a significant net improvement for real-world applications, such as improving the throughput of the Dovecot IMAP server by up to 12% and the updatedb utility by up to 29%.

2015
Opportunistic Storage Maintenance
Storage systems rely on maintenance tasks, such as backup and layout optimization, to ensure data availability and good performance. These tasks access large amounts of data and can significantly impact foreground applications. We argue that storage maintenance can be performed more efficiently by prioritizing processing of data that is currently cached in memory. Data can be cached either due to other maintenance tasks requesting it previously, or due to overlapping foreground I/O activity.

We present Duet, a framework that provides notifications about page-level events to maintenance tasks, such as a page being added or modified in memory. Tasks use these events as hints to opportunistically process cached data. We show that tasks using Duet can complete maintenance work more efficiently because they perform fewer I/O operations. The I/O reduction depends on the amount of data overlap with other maintenance tasks and foreground applications. Consequently, Duet’s efficiency increases with additional tasks because opportunities for synergy appear more often.

2015
Split-Level I/O Scheduling
We introduce split-level I/O scheduling, a new framework that splits I/O scheduling logic across handlers at three layers of the storage stack: block, system call, and page cache. We demonstrate that traditional block-level I/O schedulers are unable to meet throughput, latency, and isolation goals. By utilizing the split-level framework, we build a variety of novel schedulers to readily achieve these goals: our Actually Fair Queuing scheduler reduces priority-misallocation by 28×; our Split- Deadline scheduler reduces tail latencies by 4×; our Split-Token scheduler reduces sensitivity to interference by 6×. We show that the framework is general and operates correctly with disparate file systems (ext4 and XFS). Finally, we demonstrate that split-level scheduling serves as a useful foundation for databases (SQLite and PostgreSQL), hypervisors (QEMU), and distributed file systems (HDFS), delivering improved isolation and performance in these important application scenarios.

24th SOSP — November 3-6, 2013 — Farmington, Pennsylvania, USA
2013
Introduction
Monday 3rd, 08:30-09:00
Welcome and Awards
Monday 3rd, 09:00-10:30
Juggling Chainsaws
2013
The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors

What fundamental opportunities for scalability are latent in interfaces, such as system call APIs? Can scalability opportunities be identified even before any implementation exists, simply by considering interface specifications? To answer these questions this paper introduces the following rule: Whenever interface operations commute, they can be implemented in a way that scales. This rule aids developers in building more scalable software starting from interface design and carrying on through implementation, testing, and evaluation.

To help developers apply the rule, a new tool named Commuter accepts high-level interface models and generates tests of operations that commute and hence could scale. Using these tests, Commuter can evaluate the scalability of an implementation. We apply Commuter to 18 POSIX calls and use the results to guide the implementation of a new research operating system kernel called sv6. Linux scales for 68% of the 13,664 tests generated by Commuter for these calls, and Commuter finds many problems that have been observed to limit application scalability. sv6 scales for 99% of the tests.

2013
Speedy Transactions in Multicore In-Memory Databases

Silo is a new in-memory database that achieves excellent performance and scalability on modern multicore machines. Silo was designed from the ground up to use system memory and caches efficiently. For instance, it avoids all centralized contention points, including that of centralized transaction ID assignment. Silo's key contribution is a commit protocol based on optimistic concurrency control that provides serializability while avoiding all shared-memory writes for records that were only read. Though this might seem to complicate the enforcement of a serial order, correct logging and recovery is provided by linking periodically-updated epochs with the commit protocol. Silo provides the same guarantees as any serializable database without unnecessary scalability bottlenecks or much additional latency. Silo achieves almost 700,000 transactions per second on a standard TPC-C workload mix on a 32-core machine, as well as near-linear scalability. Considered per core, this is several times higher than previously reported results.

2013
Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask

This paper presents the most exhaustive study of synchronization to date. We span multiple layers, from hardware cache-coherence protocols up to high-level concurrent software. We do so on different types of architectures, from single-socket -- uniform and non-uniform -- to multi-socket -- directory and broadcast-based -- many-cores. We draw a set of observations that, roughly speaking, imply that scalability of synchronization is mainly a property of the hardware.

Monday 3rd, 11:00-12:30
Time is of the Essence
2013
Dandelion: A Compiler and Runtime for Heterogeneous Systems

Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and energy efficiency. Because heterogeneous systems typically comprise multiple execution contexts with different programming abstractions and runtimes, programming them remains extremely challenging.

Dandelion is a system designed to address this programmability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution contexts including CPUs, GPUs, FPGAs, and the cloud. It adopts the .NET LINQ (Language INtegrated Query) approach, integrating data-parallel operators into general purpose programming languages such as C# and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools.

Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of .NET code on GPUs, Dandelion cross-compiles .NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the design and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads.

2013
Sparrow: Distributed, Low Latency Scheduling

Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.

2013
Timecard: Controlling User-Perceived Delays in Server-Based Mobile Applications

Providing consistent response times to users of mobile applications is challenging because there are several variable delays between the start of a user's request and the completion of the response. These delays include location lookup, sensor data acquisition, radio wake-up, network transmissions, and processing on both the client and server. To allow applications to achieve consistent response times in the face of these variable delays, this paper presents the design, implementation, and evaluation of the Timecard system. Timecard provides two abstractions: the first returns the time elapsed since the user started the request, and the second returns an estimate of the time it would take to transmit the response from the server to the client and process the response at the client. With these abstractions, the server can adapt its processing time to control the end-to-end delay for the request. Implementing these abstractions requires Timecard to track delays across multiple asynchronous activities, handle time skew between client and server, and estimate network transfer times. Experiments with Timecard incorporated into two mobile applications show that the end-to-end delay is within 50 ms of the target delay of 1200 ms over 90% of the time.

Monday 3rd, 14:00-15:30
Seed Corn
2013
Fast Dynamic Binary Translation for the Kernel

Dynamic binary translation (DBT) is a powerful technique with several important applications. System-level binary translators have been used for implementing a Virtual Machine Monitor [2] and for instrumentation in the OS kernel [10]. In current designs, the performance overhead of binary translation on kernel-intensive workloads is high. e.g., over 10x slowdowns were reported on the syscall nanobenchmark in [2], 2-5x slowdowns were reported on lmbench microbenchmarks in [10]. These overheads are primarily due to the extra work required to correctly handle kernel mechanisms like interrupts, exceptions, and physical CPU concurrency.

We present a kernel-level binary translation mechanism which exhibits near-native performance even on applications with large kernel activity. Our translator relaxes transparency requirements and aggressively takes advantage of kernel invariants to eliminate sources of slowdown. We have implemented our translator as a loadable module in unmodified Linux, and present performance and scalability experiments on multiprocessor hardware. Although our implementation is Linux specific, our mechanisms are quite general; we only take advantage of typical kernel design patterns, not Linux-specific features. For example, our translator performs 3x faster than previous kernel-level DBT implementations while running the Apache web server.

2013
VirtuOS: An Operating System with Kernel Virtualization

Most operating systems provide protection and isolation to user processes, but not to critical system components such as device drivers or other system code. Consequently, failures in these components often lead to system failures. VirtuOS is an operating system that exploits a new method of decomposition to protect against such failures. VirtuOS exploits virtualization to isolate and protect vertical slices of existing OS kernels in separate service domains. Each service domain represents a partition of an existing kernel, which implements a subset of that kernel's functionality. Unlike competing solutions that merely isolate device drivers, or cannot protect from malicious and vulnerable code, VirtuOS provides full protection of isolated system components. VirtuOS's user library dispatches system calls directly to service domains using an exceptionless system call model, avoiding the cost of a system call trap in many cases.

We have implemented a prototype based on the Linux kernel and Xen hypervisor. We demonstrate the viability of our approach by creating and evaluating a network and a storage service domain. Our prototype can survive the failure of individual service domains while outperforming alternative approaches such as isolated driver domains and even exceeding the performance of native Linux for some multithreaded workloads. Thus, VirtuOS may provide a suitable basis for kernel decomposition while retaining compatibility with existing applications and good performance.

2013
From L3 to seL4: What Have We Learnt in 20 Years of L4 Microkernels?

The L4 microkernel has undergone 20 years of use and evolution. It has an active user and developer community, and there are commercial versions which are deployed on a large scale and in safety-critical systems. In this paper we examine the lessons learnt in those 20 years about microkernel design and implementation. We revisit the L4 design papers, and examine the evolution of design and implementation from the original L4 to the latest generation of L4 kernels, especially seL4, which has pushed the L4 model furthest and was the first OS kernel to undergo a complete formal verification of its implementation as well as a sound analysis of worst-case execution times. We demonstrate that while much has changed, the fundamental principles of minimality and high IPC performance remain the main drivers of design and implementation decisions.

Monday 3rd, 16:00-18:00
Everything in its Place
2013
Replication, History, and Grafting in the Ori File System

Ori is a file system that manages user data in a modern setting where users have multiple devices and wish to access files everywhere, synchronize data, recover from disk failure, access old versions, and share data. The key to satisfying these needs is keeping and replicating file system history across devices, which is now practical as storage space has outpaced both wide-area network (WAN) bandwidth and the size of managed data. Replication provides access to files from multiple devices. History provides synchronization and offline access. Replication and history together subsume backup by providing snapshots and avoiding any single point of failure. In fact, Ori is fully peer-to-peer, offering opportunistic synchronization between user devices in close proximity and ensuring that the file system is usable so long as a single replica remains. Cross-file system data sharing with history is provided by a new mechanism called grafting. An evaluation shows that as a local file system, Ori has low overhead compared to a File system in User Space (FUSE) loopback driver; as a network file system, Ori over a WAN outperforms NFS over a LAN.

2013
An Analysis of Facebook Photo Caching

This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs. Facebook's image-management infrastructure is complex and geographically distributed. It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional caching via Akamai. The underlying image storage layer is widely distributed, and includes multiple data centers.

We instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos. This permits us to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses. Our results (1) quantify the overall traffic percentages served by different layers: 65.5% browser cache, 20.0% Edge Cache, 4.6% Origin Cache, and 9.9% Backend storage, (2) reveal that a significant portion of photo requests are routed to remote PoPs and data centers as a consequence both of load-balancing and peering policy, (3) demonstrate the potential performance benefits of coordinating Edge Caches and adopting S4LRU eviction algorithms at both Edge and Origin layers, and (4) show that the popularity of photos is highly dependent on content age and conditionally dependent on the social-networking metrics we considered.

2013
IOFlow: A Software-Defined Storage Architecture

In data centers, the IO path to storage is long and complex. It comprises many layers or "stages" with opaque interfaces between them. This makes it hard to enforce end-to-end policies that dictate a storage IO flow's performance (e.g., guarantee a tenant's IO bandwidth) and routing (e.g., route an untrusted VM's traffic through a sanitization middlebox). These policies require IO differentiation along the flow path and global visibility at the control plane. We design IOFlow, an architecture that uses a logically centralized control plane to enable high-level flow policies. IOFlow adds a queuing abstraction at data-plane stages and exposes this to the controller. The controller can then translate policies into queuing rules at individual stages. It can also choose among multiple stages for policy enforcement.

We have built the queue and control functionality at two key OS stages-- the storage drivers in the hypervisor and the storage server. IOFlow does not require application or VM changes, a key strength for deployability. We have deployed a prototype across a small testbed with a 40 Gbps network and storage devices. We have built control applications that enable a broad class of multi-point flow policies that are hard to achieve today.

2013
From ARIES to MARS: Transaction Support for Next-Generation, Solid-State Drives

Transaction-based systems often rely on write-ahead logging (WAL) algorithms designed to maximize performance on disk-based storage. However, emerging fast, byte-addressable, non-volatile memory (NVM) technologies (e.g., phase-change memories, spin-transfer torque MRAMs, and the memristor) present very different performance characteristics, so blithely applying existing algorithms can lead to disappointing performance.

This paper presents a novel storage primitive, called editable atomic writes (EAW), that enables sophisticated, highly-optimized WAL schemes in fast NVM-based storage systems. EAWs allow applications to safely access and modify log contents rather than treating the log as an append-only, write-only data structure, and we demonstrate that this can make implementating complex transactions simpler and more efficient. We use EAWs to build MARS, a WAL scheme that provides the same as features ARIES [26] (a widely-used WAL system for databases) but avoids making disk-centric implementation decisions.

We have implemented EAWs and MARS in a next-generation SSD to demonstrate that the overhead of EAWs is minimal compared to normal writes, and that they provide large speedups for transactional updates to hash tables, B+trees, and large graphs. In addition, MARS outperforms ARIES by up to 3.7 x while reducing software complexity.

Monday 3rd, 18:00-20:00
Poster Session 1
Tuesday 4th, 08:30-10:30
Whoops
2013
Asynchronous Intrusion Recovery for Interconnected Web Services

Recovering from attacks in an interconnected system is difficult, because an adversary that gains access to one part of the system may propagate to many others, and tracking down and recovering from such an attack requires significant manual effort. Web services are an important example of an interconnected system, as they are increasingly using protocols such as OAuth and REST APIs to integrate with one another. This paper presents Aire, an intrusion recovery system for such web services. Aire addresses several challenges, such as propagating repair across services when some servers may be unavailable, and providing appropriate consistency guarantees when not all servers have been repaired yet. Experimental results show that Aire can recover from four realistic attacks, including one modeled after a recent Facebook OAuth vulnerability; that porting existing applications to Aire requires little effort; and that Aire imposes a 19--30% CPU overhead and 6--9 KB/request storage cost for Askbot, an existing web application.

2013
Optimistic Crash Consistency

We introduce optimistic crash consistency, a new approach to crash consistency in journaling file systems. Using an array of novel techniques, we demonstrate how to build an optimistic commit protocol that correctly recovers from crashes and delivers high performance. We implement this optimistic approach within a Linux ext4 variant which we call OptFS. We introduce two new file-system primitives, osync() and dsync(), that decouple ordering of writes from their durability. We show through experiments that OptFS improves performance for many workloads, sometimes by an order of magnitude; we confirm its correctness through a series of robustness tests, showing it recovers to a consistent state after crashes. Finally, we show that osync() and dsync() are useful in atomic file system and database update scenarios, both improving performance and meeting application-level consistency demands.

2013
Do Not Blame Users for Misconfigurations

Similar to software bugs, configuration errors are also one of the major causes of today's system failures. Many configuration issues manifest themselves in ways similar to software bugs such as crashes, hangs, silent failures. It leaves users clueless and forced to report to developers for technical support, wasting not only users' but also developers' precious time and effort. Unfortunately, unlike software bugs, many software developers take a much less active, responsible role in handling configuration errors because "they are users' faults."

This paper advocates the importance for software developers to take an active role in handling misconfigurations. It also makes a concrete first step towards this goal by providing tooling support to help developers improve their configuration design, and harden their systems against configuration errors. Specifically, we build a tool, called Spex, to automatically infer configuration requirements (referred to as constraints) from software source code, and then use the inferred constraints to: (1) expose misconfiguration vulnerabilities (i.e., bad system reactions to configuration errors such as crashes, hangs, silent failures); and (2) detect certain types of error-prone configuration design and handling.

We evaluate Spex with one commercial storage system and six open-source server applications. Spex automatically infers a total of 3800 constraints for more than 2500 configuration parameters. Based on these constraints, Spex further detects 743 various misconfiguration vulnerabilities and at least 112 error-prone constraints in the latest versions of the evaluated systems. To this day, 364 vulnerabilities and 80 inconsistent constraints have been confirmed or fixed by developers after we reported them. Our results have influenced the Squid Web proxy project to improve its configuration parsing library towards a more user-friendly design.

2013
Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior

This paper studies an emerging class of software bugs called optimization-unstable code: code that is unexpectedly discarded by compiler optimizations due to undefined behavior in the program. Unstable code is present in many systems, including the Linux kernel and the Postgres database. The consequences of unstable code range from incorrect functionality to missing security checks.

To reason about unstable code, this paper proposes a novel model, which views unstable code in terms of optimizations that leverage undefined behavior. Using this model, we introduce a new static checker called Stack that precisely identifies unstable code. Applying Stack to widely used systems has uncovered 160 new bugs that have been confirmed and fixed by developers.

Tuesday 4th, 11:00-12:30
Data, Data, Everywhere
2013
Transaction Chains: Achieving Serializability with Low Latency in Geo-Distributed Storage Systems

Currently, users of geo-distributed storage systems face a hard choice between having serializable transactions with high latency, or limited or no transactions with low latency. We show that it is possible to obtain both serializable transactions and low latency, under two conditions. First, transactions are known ahead of time, permitting an a priori static analysis of conflicts. Second, transactions are structured as transaction chains consisting of a sequence of hops, each hop modifying data at one server. To demonstrate this idea, we built Lynx, a geo-distributed storage system that offers transaction chains, secondary indexes, materialized join views, and geo-replication. Lynx uses static analysis to determine if each hop can execute separately while preserving serializability---if so, a client needs wait only for the first hop to complete, which occurs quickly. To evaluate Lynx, we built three applications: an auction service, a Twitter-like microblogging site and a social networking site. These applications successfully use chains to achieve low latency operation and good throughput.

2013
SPANStore: Cost-Effective Geo-Replicated Storage Spanning Multiple Cloud Services

By offering storage services in several geographically distributed data centers, cloud computing platforms enable applications to offer low latency access to user data. However, application developers are left to deal with the complexities associated with choosing the storage services at which any object is replicated and maintaining consistency across these replicas.

In this paper, we present SPANStore, a key-value store that exports a unified view of storage services in geographically distributed data centers. To minimize an application provider's cost, we combine three key principles. First, SPANStore spans multiple cloud providers to increase the geographical density of data centers and to minimize cost by exploiting pricing discrepancies across providers. Second, by estimating application workload at the right granularity, SPANStore judiciously trades off greater geo-distributed replication necessary to satisfy latency goals with the higher storage and data propagation costs this entails in order to satisfy fault tolerance and consistency requirements. Finally, SPANStore minimizes the use of compute resources to implement tasks such as two-phase locking and data propagation, which are necessary to offer a global view of the storage services that it builds upon. Our evaluation of SPANStore shows that it can lower costs by over 10x in several scenarios, in comparison with alternative solutions that either use a single storage provider or replicate every object to every data center from which it is accessed.

2013
Consistency-Based Service Level Agreements for Cloud Storage

Choosing a cloud storage system and specific operations for reading and writing data requires developers to make decisions that trade off consistency for availability and performance. Applications may be locked into a choice that is not ideal for all clients and changing conditions. Pileus is a replicated key-value store that allows applications to declare their consistency and latency priorities via consistency-based service level agreements (SLAs). It dynamically selects which servers to access in order to deliver the best service given the current configuration and system conditions. In application-specific SLAs, developers can request both strong and eventual consistency as well as intermediate guarantees such as read-my-writes. Evaluations running on a worldwide test bed with geo-replicated data show that the system adapts to varying client-server latencies to provide service that matches or exceeds the best static consistency choice and server selection scheme.

Tuesday 4th, 15:30-17:00
Right Makes Might
2013
Tango: Distributed Data Structures over a Shared Log

Distributed systems are easier to build than ever with the emergence of new, data-centric abstractions for storing and computing over massive datasets. However, similar abstractions do not exist for storing and accessing meta-data. To fill this gap, Tango provides developers with the abstraction of a replicated, in-memory data structure (such as a map or a tree) backed by a shared log. Tango objects are easy to build and use, replicating state via simple append and read operations on the shared log instead of complex distributed protocols; in the process, they obtain properties such as linearizability, persistence and high availability from the shared log. Tango also leverages the shared log to enable fast transactions across different objects, allowing applications to partition state across machines and scale to the limits of the underlying log without sacrificing consistency.

2013
Verifying Computations with State

When a client outsources a job to a third party (e.g., the cloud), how can the client check the result, without re-executing the computation? Recent work in proof-based verifiable computation has made significant progress on this problem by incorporating deep results from complexity theory and cryptography into built systems. However, these systems work within a stateless model: they exclude computations that interact with RAM or a disk, or for which the client does not have the full input.

This paper describes Pantry, a built system that overcomes these limitations. Pantry composes proof-based verifiable computation with untrusted storage: the client expresses its computation in terms of digests that attest to state, and verifiably outsources that computation. Using Pantry, we extend verifiability to MapReduce jobs, simple database queries, and interactions with private state. Thus, Pantry takes another step toward practical proof-based verifiable computation for realistic applications.

2013
There Is More Consensus In Egalitarian Parliaments

This paper describes the design and implementation of Egalitarian Paxos (EPaxos), a new distributed consensus algorithm based on Paxos. EPaxos achieves three goals: (1) optimal commit latency in the wide-area when tolerating one and two failures, under realistic conditions; (2) uniform load balancing across all replicas (thus achieving high throughput); and (3) graceful performance degradation when replicas are slow or crash.

Egalitarian Paxos is to our knowledge the first protocol to achieve the previously stated goals efficiently---that is, requiring only a simple majority of replicas to be non-faulty, using a number of messages linear in the number of replicas to choose a command, and committing commands after just one communication round (one round trip) in the common case or after at most two rounds in any case. We prove Egalitarian Paxos's properties theoretically and demonstrate its advantages empirically through an implementation running on Amazon EC2.

Tuesday 4th, 17:00-18:00
Work in Progress
Tuesday 4th, 18:00-20:00
Poster Session 2
Wednesday 5th, 09:00-10:30
N' Sync
2013
ROOT: Replaying Multithreaded Traces with Resource-Oriented Ordering

We describe ROOT, a new method for incorporating the nondeterministic I/O behavior of multithreaded applications into trace replay. ROOT is the application of Resource-Oriented Ordering to Trace replay: actions involving a common resource are replayed in an order similar to that of the original trace. ROOT is based on the idea that how a program manages resources, as seen in a trace, provides hints about an application's internal dependencies. Inferring these dependencies allows us to partially constrain trace replay in a way that reflects the constraints of the original program. We make three contributions: (1) we describe the ROOT approach, (2) we release ARTC, a new ROOT-based tool for replaying I/O traces, and (3) we create Magritte, a file-system benchmark suite generated by applying ARTC to 34 Apple desktop application traces. When collecting traces on one platform and replaying on another, ARTC achieves an average timing inaccuracy of 10.6% on our benchmark workloads, halving the 21.3% achieved by the next-best replay method we evaluate.

2013
PARROT: A Practical Runtime for Deterministic, Stable, and Reliable Threads

Multithreaded programs are hard to get right. A key reason is that the contract between developers and runtimes grants exponentially many schedules to the runtimes. We present Parrot, a simple, practical runtime with a new contract to developers. By default, it orders thread synchronizations in the well-defined round-robin order, vastly reducing schedules to provide determinism (more precisely, deterministic synchronizations) and stability (i.e., robustness against input or code perturbations, a more useful property than determinism). When default schedules are slow, it allows developers to write intuitive performance hints in their code to switch or add schedules for speed. We believe this "meet in the middle" contract eases writing correct, efficient programs.

We further present an ecosystem formed by integrating Parrot with a model checker called dbug. This ecosystem is more effective than either system alone: dbug checks the schedules that matter to Parrot, and Parrot greatly increases the coverage of dbug.

Results on a diverse set of 108 programs, roughly 10× more than any prior evaluation, show that Parrot is easy to use (averaging 1.2 lines of hints per program); achieves low overhead (6.9% for 55 real-world programs and 12.7% for all 108 programs), 10× better than two prior systems; scales well to the maximum allowed cores on a 24-core server and to different scales/types of workloads; and increases Dbug's coverage by 106--1019734 for 56 programs. Parrot's source code, entire benchmark suite, and raw results are available at github.com/columbia/smt-mc.

2013
RaceMob: Crowdsourced Data Race Detection

Some of the worst concurrency problems in multi-threaded systems today are due to data races---these bugs can have messy consequences, and they are hard to diagnose and fix. To avoid the introduction of such bugs, system developers need discipline and good data race detectors; today, even if they have the former, they lack the latter.

We present RaceMob, a new data race detector that has both low overhead and good accuracy. RaceMob starts by detecting potential races statically (hence it has few false negatives), and then dynamically validates whether these are true races (hence has few false positives). It achieves low runtime overhead and a high degree of realism by combining real-user crowdsourcing with a new on-demand dynamic data race validation technique.

We evaluated RaceMob on ten systems, including Apache, SQLite, and Memcached---it detects data races with higher accuracy than state-of-the-art detectors (both static and dynamic), and RaceMob users experience an average runtime overhead of about 2%, which is orders of magnitude less than the overhead of modern dynamic data race detectors. To the best of our knowledge, RaceMob is the first data race detector that can both be used always-on in production and provides good accuracy.

Wednesday 5th, 11:00-13:00
Data into Information
2013
Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming.

2013
Naiad: A Timely Dataflow System

Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on multiple platforms, at the expense of efficiency, maintainability, and simplicity. Naiad resolves the complexities of combining these features in one framework.

A new computational model, timely dataflow, underlies Naiad and captures opportunities for parallelism across a wide class of algorithms. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism.

We show that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining. Naiad outperforms specialized systems in their target application domains, and its unique features enable the development of new high-performance applications.

2013
A Lightweight Infrastructure for Graph Analytics

Several domain-specific languages (DSLs) for parallel graph analytics have been proposed recently. In this paper, we argue that existing DSLs can be implemented on top of a general-purpose infrastructure that (i) supports very fine-grain tasks, (ii) implements autonomous, speculative execution of these tasks, and (iii) allows application-specific control of task scheduling policies. To support this claim, we describe such an implementation called the Galois system.

We demonstrate the capabilities of this infrastructure in three ways. First, we implement more sophisticated algorithms for some of the graph analytics problems tackled by previous DSLs and show that end-to-end performance can be improved by orders of magnitude even on power-law graphs, thanks to the better algorithms facilitated by a more general programming model. Second, we show that, even when an algorithm can be expressed in existing DSLs, the implementation of that algorithm in the more general system can be orders of magnitude faster when the input graphs are road networks and similar graphs with high diameter, thanks to more sophisticated scheduling. Third, we implement the APIs of three existing graph DSLs on top of the common infrastructure in a few hundred lines of code and show that even for power-law graphs, the performance of the resulting implementations often exceeds that of the original DSL systems, thanks to the lightweight infrastructure.

2013
X-Stream: Edge-Centric Graph Processing using Streaming Partitions

X-Stream is a system for processing both in-memory and out-of-core graphs on a single shared-memory machine. While retaining the scatter-gather programming model with state stored in the vertices, X-Stream is novel in (i) using an edge-centric rather than a vertex-centric implementation of this model, and (ii) streaming completely unordered edge lists rather than performing random access. This design is motivated by the fact that sequential bandwidth for all storage media (main memory, SSD, and magnetic disk) is substantially larger than random access bandwidth.

We demonstrate that a large number of graph algorithms can be expressed using the edge-centric scatter-gather model. The resulting implementations scale well in terms of number of cores, in terms of number of I/O devices, and across different storage media. X-Stream competes favorably with existing systems for graph processing. Besides sequential access, we identify as one of the main contributors to better performance the fact that X-Stream does not need to sort edge lists during preprocessing.


23rd SOSP — October 23-26, 2011 — Cascais, Portugal
2011
Introduction
Monday 24th, 08:30-09:00
Welcome and Awards
Monday 24th, 09:00-10:30
Key-Value
2011
SILT: A Memory-Efficient, High-Performance Key-Value Store
SILT (Small Index Large Table) is a memory-efficient, high-performance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and computation. Our contributions include: (1) the design of three basic key-value stores each with a different emphasis on memory-efficiency and write-friendliness; (2) synthesis of the basic key-value stores to build a SILT key-value store system; and (3) an analytical model for tuning system parameters carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude less memory to provide comparable throughput to current high-performance key-value systems on a commodity desktop system with flash storage.
2011
Scalable Consistency in Scatter
Distributed storage systems often trade off strong semantics for improved scalability. This paper describes the design, implementation, and evaluation of Scatter, a scalable and consistent distributed key-value storage system. Scatter adopts the highly decentralized and self-organizing structure of scalable peer-to-peer systems, while preserving linearizable consistency even under adverse circumstances. Our prototype implementation demonstrates that even with very short node lifetimes, it is possible to build a scalable and consistent system with practical performance.
2011
Fast Crash Recovery in RAMCloud
RAMCloud is a DRAM-based storage system that provides inexpensive durability and availability by recovering quickly after crashes, rather than storing replicas in DRAM. RAMCloud scatters backup data across hundreds or thousands of disks, and it harnesses hundreds of servers in parallel to reconstruct lost data. The system uses a log-structured approach for all its data, in DRAM as well as on disk; this provides high performance both during normal operation and during recovery. RAMCloud employs randomized techniques to manage the system in a scalable and decentralized fashion. In a 60-node cluster, RAMCloud recovers 35 GB of data from a failed server in 1.6 seconds. Our measurements suggest that the approach will scale to recover larger memory sizes (64 GB or more) in less time with larger clusters.
Monday 24th, 11:00-12:30
Storage
2011
Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis
Enterprise storage systems are facing enormous challenges due to increasing growth and heterogeneity of the data stored. Designing future storage systems requires comprehensive insights that existing trace analysis methods are ill-equipped to supply. In this paper, we seek to provide such insights by using a new methodology that leverages an objective, multi-dimensional statistical technique to extract data access patterns from network storage system traces. We apply our method on two large-scale real-world production network storage system traces to obtain comprehensive access patterns and design insights at user, application, file, and directory levels. We derive simple, easily implementable, threshold-based design optimizations that enable efficient data placement and capacity optimization strategies for servers, consolidation policies for clients, and improved caching performance for both.
2011
Differentiated Storage Services
We propose an I/O classification architecture to close the widening semantic gap between computer systems and storage systems. By classifying I/O, a computer system can request that different classes of data be handled with different storage system policies. Specifically, when a storage system is first initialized, we assign performance policies to predefined classes, such as the filesystem journal. Then, online, we include a classifier with each I/O command (e.g., SCSI), thereby allowing the storage system to enforce the associated policy for each I/O that it receives.

Our immediate application is caching. We present filesystem prototypes and a database proof-of-concept that classify all disk I/O — with very little modification to the filesystem, database, and operating system. We associate caching policies with various classes (e.g., large files shall be evicted before metadata and small files), and we show that end-to-end file system performance can be improved by over a factor of two, relative to conventional caches like LRU. And caching is simply one of many possible applications. As part of our ongoing work, we are exploring other classes, policies and storage system mechanisms that can be used to improve end-to-end performance, reliability and security.

2011
A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications
We analyze the I/O behavior of iBench, a new collection of productivity and multimedia application workloads. Our analysis reveals a number of differences between iBench and typical file-system workload studies, including the complex organization of modern files, the lack of pure sequential access, the influence of underlying frameworks on I/O patterns, the widespread use of file synchronization and atomic operations, and the prevalence of threads. Our results have strong ramifications for the design of next generation local and cloud-based storage systems
Monday 24th, 14:00-15:30
Security
2011
CryptDB: Protecting Confidentiality with Encrypted Query Processing
Online applications are vulnerable to theft of sensitive information because adversaries can exploit software bugs to gain access to private data, and because curious or malicious administrators may capture and leak data. CryptDB is a system that provides practical and provable confidentiality in the face of these attacks for applications backed by SQL databases. It works by executing SQL queries over encrypted data using a collection of efficient SQL-aware encryption schemes. CryptDB can also chain encryption keys to user passwords, so that a data item can be decrypted only by using the password of one of the users with access to that data. As a result, a database administrator never gets access to decrypted data, and even if all servers are compromised, an adversary cannot decrypt the data of any user who is not logged in. An analysis of a trace of 126 million SQL queries from a production MySQL server shows that CryptDB can support operations over encrypted data for 99.5% of the 128,840 columns seen in the trace. Our evaluation shows that CryptDB has low overhead, reducing throughput by 14.5% for phpBB, a web forum application, and by 26% for queries from TPCC, compared to unmodified MySQL. Chaining encryption keys to user passwords requires 11–13 unique schema annotations to secure more than 20 sensitive fields and 2–7 lines of source code changes for three multi-user web applications.
2011
Intrusion Recovery for Database-backed Web Applications
WARP is a system that helps users and administrators of web applications recover from intrusions such as SQL injection, cross-site scripting, and clickjacking attacks, while preserving legitimate user changes. WARP repairs from an intrusion by rolling back parts of the database to a version before the attack, and replaying subsequent legitimate actions. WARP allows administrators to retroactively patch security vulnerabilities—i.e., apply new security patches to past executions—to recover from intrusions without requiring the administrator to track down or even detect attacks. WARP’s time-travel database allows fine-grained rollback of database rows, and enables repair to proceed concurrently with normal operation of a web application. Finally, WARP captures and replays user input at the level of a browser’s DOM, to recover from attacks that involve a user’s browser. For a web server running MediaWiki, WARP requires no application source code changes to recover from a range of common web application vulnerabilities with minimal user input at a cost of 24–27% in throughput and 2–3.2 GB/day in storage
2011
Software fault isolation with API integrity and multi-principal modules
The security of many applications relies on the kernel being secure, but history suggests that kernel vulnerabilities are routinely discovered and exploited. In particular, exploitable vulnerabilities in kernel modules are common. This paper proposes LXFI, a system which isolates kernel modules from the core kernel so that vulnerabilities in kernel modules cannot lead to a privilege escalation attack. To safely give kernel modules access to complex kernel APIs, LXFI introduces the notion of API integrity, which captures the set of contracts assumed by an interface. To partition the privileges within a shared module, LXFI introduces module principals. Programmers specify principals and API integrity rules through capabilities and annotations. Using a compiler plugin, LXFI instruments the generated code to grant, check, and transfer capabilities between modules, according to the programmer’s annotations. An evaluation with Linux shows that the annotations required on kernel functions to support a new module are moderate, and that LXFI is able to prevent three known privilege-escalation vulnerabilities. Stress tests of a network driver module also show that isolating this module using LXFI does not hurt TCP throughput but reduces UDP throughput by 35%, and increases CPU utilization by 2.2–3.7x.
Monday 24th, 16:00-17:30
Reality
2011
Thialfi: A Client Notification Service for Internet-Scale Applications
Ensuring the freshness of client data is a fundamental problem for applications that rely on cloud infrastructure to store data and mediate sharing. Thialfi is a notification service developed at Google to simplify this task. Thialfi supports applications written in multiple programming languages and running on multiple platforms, e.g., browsers, phones, and desktops. Applications register their interest in a set of shared objects and receive notifications when those objects change. Thialfi servers run in multiple Google data centers for availability and replicate their state asynchronously. Thialfi’s approach to recovery emphasizes simplicity: all server state is soft, and clients drive recovery and assist in replication. A principal goal of our design is to provide a straightforward API and good semantics despite a variety of failures, including server crashes, communication failures, storage unavailability, and data center failures.

Evaluation of live deployments confirms that Thialfi is scalable, efficient, and robust. In production use, Thialfi has scaled to millions of users and delivers notifications with an average delay of less than one second.

2011
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems.
2011
An Empirical Study on Configuration Errors in Commercial and Open Source Systems
Configuration errors (i.e., misconfigurations) are among the dominant causes of system failures. Their importance has inspired many research efforts on detecting, diagnosing, and fixing misconfigurations; such research would benefit greatly from a real-world characteristic study on misconfigurations. Unfortunately, few such studies have been conducted in the past, primarily because historical misconfigurations usually have not been recorded rigorously in databases.

In this work, we undertake one of the first attempts to conduct a real-world misconfiguration characteristic study. We study a total of 546 real world misconfigurations, including 309 misconfigurations from a commercial storage system deployed at thousands of customers, and 237 from four widely used open source systems (CentOS, MySQL, Apache HTTP Server, and OpenLDAP). Some of our major findings include: (1) A majority of misconfigurations (70.0%~85.5%) are due to mistakes in setting configuration parameters; however, a significant number of misconfigurations are due to compatibility issues or component configurations (i.e., not parameter-related). (2) 38.1%~53.7% of parameter mistakes are caused by illegal parameters that clearly violate some format or rules, motivating the use of an automatic configuration checker to detect these miscon- figurations. (3) A significant percentage (12.2%~29.7%) of parameter-based mistakes are due to inconsistencies between different parameter values. (4) 21.7%~57.3% of the misconfigurations involve configurations external to the examined system, some even on entirely different hosts. (5) A significant portion of misconfigurations can cause hard-to-diagnose failures, such as crashes, hangs, or severe performance degradation, indicating that systems should be better-equipped to handle misconfigurations.

Monday 24th, 17:30-19:15
Posters
Tuesday 25th, 09:00-11:00
Virtualization
2011
Cells: A Virtual Mobile Smartphone Architecture
Smartphones are increasingly ubiquitous, and many users carry multiple phones to accommodate work, personal, and geographic mobility needs. We present Cells, a virtualization architecture for enabling multiple virtual smartphones to run simultaneously on the same physical cellphone in an isolated, secure manner. Cells introduces a usage model of having one foreground virtual phone and multiple background virtual phones. This model enables a new device namespace mechanism and novel device proxies that integrate with lightweight operating system virtualization to multiplex phone hardware across multiple virtual phones while providing native hardware device performance. Cells virtual phone features include fully accelerated 3D graphics, complete power management features, and full telephony functionality with separately assignable telephone numbers and caller ID support. We have implemented a prototype of Cells that supports multiple Android virtual phones on the same phone. Our performance results demonstrate that Cells imposes only modest runtime and memory overhead, works seamlessly across multiple hardware devices including Google Nexus 1 and Nexus S phones, and transparently runs Android applications at native speed without any modifications.
2011
Breaking Up is Hard to Do: Security and Functionality in a Commodity Hypervisor
Cloud computing uses virtualization to lease small slices of large-scale datacenter facilities to individual paying customers. These multi-tenant environments, on which numerous large and popular web-based applications run today, are founded on the belief that the virtualization platform is sufficiently secure to prevent breaches of isolation between different users who are co-located on the same host. Hypervisors are believed to be trustworthy in this role because of their small size and narrow interfaces.

We observe that despite the modest footprint of the hypervisor itself, these platforms have a large aggregate trusted computing base (TCB) that includes a monolithic control VM with numerous interfaces exposed to VMs. We present Xoar, a modified version of Xen that retrofits the modularity and isolation principles used in micro-kernels onto a mature virtualization platform. Xoar breaks the control VM into single-purpose components called service VMs. We show that this componentized abstraction brings a number of benefits: sharing of service components by guests is configurable and auditable, making exposure to risk explicit, and access to the hypervisor is restricted to the least privilege required for each component. Microrebooting components at configurable frequencies reduces the temporal attack surface of individual components. Our approach incurs little performance overhead, and does not require functionality to be sacrificed or components to be rewritten from scratch.

2011
CloudVisor: Retrofitting Protection of Virtual Machines in Multi-tenant Cloud with Nested Virtualization
Multi-tenant cloud, which usually leases resources in the form of virtual machines, has been commercially available for years. Unfortunately, with the adoption of commodity virtualized infrastructures, software stacks in typical multi-tenant clouds are non-trivially large and complex, and thus are prone to compromise or abuse from adversaries including the cloud operators, which may lead to leakage of security-sensitive data.

In this paper, we propose a transparent, backward-compatible approach that protects the privacy and integrity of customers’ virtual machines on commodity virtualized infrastructures, even facing a total compromise of the virtual machine monitor (VMM) and the management VM. The key of our approach is the separation of the resource management from security protection in the virtualization layer. A tiny security monitor is introduced underneath the commodity VMM using nested virtualization and provides protection to the hosted VMs. As a result, our approach allows virtualization software (e.g., VMM, management VM and tools) to handle complex tasks of managing leased VMs for the cloud, without breaking security of users’ data inside the VMs.

We have implemented a prototype by leveraging commercially-available hardware support for virtualization. The prototype system, called CloudVisor, comprises only 5.5K LOCs and supports the Xen VMM with multiple Linux and Windows as the guest OSes. Performance evaluation shows that CloudVisor incurs moderate slowdown for I/O intensive applications and very small slowdown for other applications.

2011
Atlantis: Robust, Extensible Execution Environments for Web Applications
Today’s web applications run inside a complex browser environment that is buggy, ill-specified, and implemented in different ways by different browsers. Thus, web applications that desire robustness must use a variety of conditional code paths and ugly hacks to deal with the vagaries of their runtime. Our new exokernel browser, called Atlantis, solves this problem by providing pages with an extensible execution environment. Atlantis defines a narrow API for basic services like collecting user input, exchanging network data, and rendering images. By composing these primitives, web pages can define custom, high-level execution environments. Thus, an application which does not want a dependence on Atlantis’ predefined web stack can selectively redefine components of that stack, or define markup formats and scripting languages that look nothing like the current browser runtime. Unlike prior microkernel browsers like OP, and unlike compile-to-JavaScript frameworks like GWT, Atlantis is the first browsing system to truly minimize a web page’s dependence on black box browser code. This makes it much easier to develop robust, secure web applications.
Tuesday 25th, 11:30-12:30
OS Architecture
2011
PTask: Operating System Abstractions To Manage GPUs as Compute Devices
We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models.

Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5x improvement in maximum throughput for the gestural interface.

2011
Logical Attestation: An Authorization Architecture for Trustworthy Computing
This paper describes the design and implementation of a new operating system authorization architecture to support trustworthy computing. Called logical attestation, this architecture provides a sound framework for reasoning about run time behavior of applications. Logical attestation is based on attributable, unforgeable statements about program properties, expressed in a logic. These statements are suitable for mechanical processing, proof construction, and verification; they can serve as credentials, support authorization based on expressive authorization policies, and enable remote principals to trust software components without restricting the local user’s choice of binary implementations.

We have implemented logical attestation in a new operating system called the Nexus. The Nexus executes natively on x86 platforms equipped with secure coprocessors. It supports both native Linux applications and uses logical attestation to support new trustworthy-computing applications. When deployed on a trustworthy cloud-computing stack, logical attestation is efficient, achieves high-performance, and can run applications that provide qualitative guarantees not possible with existing modes of attestation.

Tuesday 25th, 14:00-16:00
Detection and Tracing
2011
Practical Software Model Checking via Dynamic Interface Reduction
Implementation-level software model checking explores the state space of a system implementation directly to find potential software defects without requiring any specification or modeling. Despite early successes, the effectiveness of this approach remains severely constrained due to poor scalability caused by state-space explosion. DEMETER makes software model checking more practical with the following contributions: (i) proposing dynamic interface reduction, a new state-space reduction technique, (ii) introducing a framework that enables dynamic interface reduction in an existing model checker with a reasonable amount of effort, and (iii) providing the framework with a distributed runtime engine that supports parallel distributed model checking.

We have integrated DEMETER into two existing model checkers, MACEMC and MODIST, each involving changes of around 1,000 lines of code. Compared to the original MACEMC and MODIST model checkers, our experiments have shown state-space reduction from a factor of five to up to five orders of magnitude in representative distributed applications such as PAXOS, Berkeley DB, CHORD, and PASTRY. As a result, when applied to a deployed PAXOS implementation, which has been running in production data centers for years to manage tens of thousands of machines, DEMETER manages to explore completely a logically meaningful state space that covers both phases of the PAXOS protocol, offering higher assurance of software reliability that was not possible before.

2011
Detecting failures in distributed systems with the FALCON spy network
A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems’ unavailability when a crash occurs.

This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon’s common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon’s main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.

2011
Secure Network Provenance
This paper introduces secure network provenance (SNP), a novel technique that enables networked systems to explain to their operators why they are in a certain state — e.g., why a suspicious routing table entry is present on a certain router, or where a given cache entry originated. SNP provides network forensics capabilities by permitting operators to track down faulty or misbehaving nodes, and to assess the damage such nodes may have caused to the rest of the system. SNP is designed for adversarial settings and is robust to manipulation; its tamper-evident properties ensure that operators can detect when compromised nodes lie or falsely implicate correct nodes.

We also present the design of SNooPy, a general-purpose SNP system. To demonstrate that SNooPy is practical, we apply it to three example applications: the Quagga BGP daemon, a declarative implementation of Chord, and Hadoop MapReduce. Our results indicate that SNooPy can efficiently explain state in an adversarial setting, that it can be applied with minimal effort, and that its costs are low enough to be practical.

2011
Fay: Extensible Distributed Tracing from Kernels to Clusters
Fay is a flexible platform for the efficient collection, processing, and analysis of software execution traces. Fay provides dynamic tracing through use of runtime instrumentation and distributed aggregation within machines and across clusters. At the lowest level, Fay can be safely extended with new tracing primitives, including even untrusted, fully-optimized machine code, and Fay can be applied to running user-mode or kernel-mode software without compromising system stability. At the highest level, Fay provides a unified, declarative means of specifying what events to trace, as well as the aggregation, processing, and analysis of those events.

We have implemented the Fay tracing platform for Windows and integrated it with two powerful, expressive systems for distributed programming. Our implementation is easy to use, can be applied to unmodified production systems, and provides primitives that allow the overhead of tracing to be greatly reduced, compared to previous dynamic tracing platforms. To show the generality of Fay tracing, we reimplement, in experiments, a range of tracing strategies and several custom mechanisms from existing tracing frameworks.

Fay shows that modern techniques for high-level querying and data-parallel processing of disaggregated data streams are well suited to comprehensive monitoring of software execution in distributed systems. Revisiting a lesson from the late 1960’s, Fay also demonstrates the efficiency and extensibility benefits of using safe, statically-verified machine code as the basis for low-level execution tracing. Finally, Fay establishes that, by automatically deriving optimized query plans and code for safe extensions, the expressiveness and performance of high-level tracing queries can equal or even surpass that of specialized monitoring tools.

Tuesday 25th, 16:30-18:00
Work in Progress
Wednesday 26th, 09:00-11:00
Threads and Races
2011
Dthreads: Efficient Deterministic Multithreading
Multithreaded programming is notoriously difficult to get right. A key problem is non-determinism, which complicates debugging, testing, and reproducing errors. One way to simplify multithreaded programming is to enforce deterministic execution, but current deterministic systems for C/C++ are incomplete or impractical. These systems require program modification, do not ensure determinism in the presence of data races, do not work with generalpurpose multithreaded programs, or run up to 8.4x slower than pthreads.

This paper presents DTHREADS, an efficient deterministic multithreading system for unmodified C/C++ applications that replaces the pthreads library. DTHREADS enforces determinism in the face of data races and deadlocks. DTHREADS works by exploding multithreaded applications into multiple processes, with private, copy-on-write mappings to shared memory. It uses standard virtual memory protection to track writes, and deterministically orders updates by each thread. By separating updates from different threads, DTHREADS has the additional benefit of eliminating false sharing. Experimental results show that DTHREADS substantially outperforms a state-of-the-art deterministic runtime system, and for a majority of the benchmarks evaluated here, matches and occasionally exceeds the performance of pthreads.

2011
Efficient Deterministic Multithreading through Schedule Relaxation
Deterministic multithreading (DMT) eliminates many pernicious software problems caused by nondeterminism. It works by constraining a program to repeat the same thread interleavings, or schedules, when given same input. Despite much recent research, it remains an open challenge to build both deterministic and efficient DMT systems for general programs on commodity hardware. To deterministically resolve a data race, a DMT system must enforce a deterministic schedule of shared memory accesses, or mem-schedule, which can incur prohibitive overhead. By using schedules consisting only of synchronization operations, or sync-schedule, this overhead can be avoided. However, a sync-schedule is deterministic only for race-free programs, but most programs have races.

Our key insight is that races tend to occur only within minor portions of an execution, and a dominant majority of the execution is still race-free. Thus, we can resort to a mem-schedule only for the “racy” portions and enforce a sync-schedule otherwise, combining the efficiency of sync-schedules and the determinism of memschedules. We call these combined schedules hybrid schedules.

Based on this insight, we have built PEREGRINE, an efficient deterministic multithreading system. When a program first runs on an input, PEREGRINE records an execution trace. It then relaxes this trace into a hybrid schedule and reuses the schedule on future compatible inputs efficiently and deterministically. PEREGRINE further improves efficiency with two new techniques: determinism-preserving slicing to generalize a schedule to more inputs while preserving determinism, and schedule-guided simplification to precisely analyze a program according to a specific schedule. Our evaluation on a diverse set of programs shows that PEREGRINE is deterministic and efficient, and can frequently reuse schedules for half of the evaluated programs.

2011
Pervasive Detection of Process Races in Deployed Systems
Process races occur when multiple processes access shared operating system resources, such as files, without proper synchronization. We present the first study of real process races and the first system designed to detect them. Our study of hundreds of applications shows that process races are numerous, difficult to debug, and a real threat to reliability. To address this problem, we created RACEPRO, a system for automatically detecting these races. RACEPRO checks deployed systems in-vivo by recording live executions then deterministically replaying and checking them later. This approach increases checking coverage beyond the configurations or executions covered by software vendors or beta testing sites. RACEPRO records multiple processes, detects races in the recording among system calls that may concurrently access shared kernel objects, then tries different execution orderings of such system calls to determine which races are harmful and result in failures. To simplify race detection, RACEPRO models under-specified system calls based on load and store micro-operations. To reduce false positives and negatives, RACEPRO uses a replay and go-live mechanism to distill harmful races from benign ones. We have implemented RACEPRO in Linux, shown that it imposes only modest recording overhead, and used it to detect a number of previously unknown bugs in real applications caused by process races.
2011
Detecting and Surviving Data Races using Complementary Schedules
Data races are a common source of errors in multithreaded programs. In this paper, we show how to protect a program from data race errors at runtime by executing multiple replicas of the program with complementary thread schedules. Complementary schedules are a set of replica thread schedules crafted to ensure that replicas diverge only if a data race occurs and to make it very likely that harmful data races cause divergences. Our system, called Frost, uses complementary schedules to cause at least one replica to avoid the order of racing instructions that leads to incorrect program execution for most harmful data races. Frost introduces outcome-based race detection, which detects data races by comparing the state of replicas executing complementary schedules. We show that this method is substantially faster than existing dynamic race detectors for unmanaged code. To help programs survive bugs in production, Frost also diagnoses the data race bug and selects an appropriate recovery strategy, such as choosing a replica that is likely to be correct or executing more replicas to gather additional information.

Frost controls the thread schedules of replicas by running all threads of a replica non-preemptively on a single core. To scale the program to multiple cores, Frost runs a third replica in parallel to generate checkpoints of the program’s likely future states — these checkpoints let Frost divide program execution into multiple epochs, which it then runs in parallel.

We evaluate Frost using 11 real data race bugs in desktop and server applications. Frost both detects and survives all of these data races. Since Frost runs three replicas, its utilization cost is 3x. However, if there are spare cores to absorb this increased utilization, Frost adds only 3–12% overhead to application runtime.

Wednesday 26th, 11:30-12:30
Geo-Replication
2011
Transactional storage for geo-replicated systems
We describe the design and implementation of Walter, a key-value store that provides transactions and replicates data across distant sites. A key feature behind Walter is a new property called Parallel Snapshot Isolation (PSI). PSI allows Walter to replicate data asynchronously across sites, while providing strong guarantees at a single site. PSI precludes write-write conflicts, so that developers need not worry about conflict-resolution logic. To prevent write-write conflicts and implement PSI, Walter uses two new and simple techniques: preferred sites and counting sets. We use Walter to build a social networking application and port a Twitter-like application.
2011
Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an “always-on” experience where operations always complete with low latency. Today’s systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model—causal consistency with convergent conflict handling, or causal+—that is the strongest achieved under these constraints.

We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.


22nd SOSP — October 11-14, 2009 — Big Sky, Montana, USA
2009
Front Matter
Turing Award Lecture
2009
The power of abstraction
Scalability
2009
FAWN: a fast array of wimpy nodes

This paper presents a new cluster architecture for low-power data-intensive computing. FAWN couples low-power embedded CPUs to small amounts of local flash storage, and balances computation and I/O capabilities to enable efficient, massively parallel access to data.

The key contributions of this paper are the principles of the FAWN architecture and the design and implementation of FAWN-KV—a consistent, replicated, highly available, and high-performance key-value storage system built on a FAWN prototype. Our design centers around purely log-structured datastores that provide the basis for high performance on flash storage, as well as for replication and consistency obtained using chain replication on a consistent hashing ring. Our evaluation demonstrates that FAWN clusters can handle roughly 350 key-value queries per Joule of energy—two orders of magnitude more than a disk-based system.

2009
RouteBricks: exploiting parallelism to scale software routers

We revisit the problem of scaling software routers, motivated by recent advances in server technology that enable high-speed parallel processing—a feature router workloads appear ideally suited to exploit. We propose a software router architecture that parallelizes router functionality both across multiple servers and across multiple cores within a single server. By carefully exploiting parallelism at every opportunity, we demonstrate a 35Gbps parallel router prototype; this router capacity can be linearly scaled through the use of additional servers. Our prototype router is fully programmable using the familiar Click/Linux environment and is built entirely from off-the-shelf, general-purpose server hardware.

2009
The multikernel: a new OS architecture for scalable multicore systems

Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures.

We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.

We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware.

Device Drivers
2009
Fast byte-granularity software fault isolation

Bugs in kernel extensions remain one of the main causes of poor operating system reliability despite proposed techniques that isolate extensions in separate protection domains to contain faults. We believe that previous fault isolation techniques are not widely used because they cannot isolate existing kernel extensions with low overhead on standard hardware. This is a hard problem because these extensions communicate with the kernel using a complex interface and they communicate frequently. We present BGI (Byte-Granularity Isolation), a new software fault isolation technique that addresses this problem. BGI uses efficient byte-granularity memory protection to isolate kernel extensions in separate protection domains that share the same address space. BGI ensures type safety for kernel objects and it can detect common types of errors inside domains. Our results show that BGI is practical: it can isolate Windows drivers without requiring changes to the source code and it introduces a CPU overhead between 0 and 16%. BGI can also find bugs during driver testing. We found 28 new bugs in widely used Windows drivers.

2009
Tolerating hardware device failures in software

Hardware devices can fail, but many drivers assume they do not. When confronted with real devices that misbehave, these assumptions can lead to driver or system failures. While major operating system and device vendors recommend that drivers detect and recover from hardware failures, we find that there are many drivers that will crash or hang when a device fails. Such bugs cannot easily be detected by regular stress testing because the failures are induced by the device and not the software load. This paper describes Carburizer, a code-manipulation tool and associated runtime that improves system reliability in the presence of faulty devices. Carburizer analyzes driver source code to find locations where the driver incorrectly trusts the hardware to behave. Carburizer identified almost 1000 such bugs in Linux drivers with a false positive rate of less than 8 percent. With the aid of shadow drivers for recovery, Carburizer can automatically repair 840 of these bugs with no programmer involvement. To facilitate proactive management of device failures, Carburizer can also locate existing driver code that detects device failures and inserts missing failure-reporting code. Finally, the Carburizer runtime can detect and tolerate interrupt-related bugs, such as stuck or missing interrupts.

2009
Automatic device driver synthesis with termite

Faulty device drivers cause significant damage through down time and data loss. The problem can be mitigated by an improved driver development process that guarantees correctness by construction. We achieve this by synthesising drivers automatically from formal specifications of device interfaces, thus reducing the impact of human error on driver reliability and potentially cutting down on development costs.

We present a concrete driver synthesis approach and tool called Termite. We discuss the methodology, the technical and practical limitations of driver synthesis, and provide an evaluation of non-trivial drivers for Linux, generated using our tool. We show that the performance of the generated drivers is on par with the equivalent manually developed drivers. Furthermore, we demonstrate that device specifications can be reused across different operating systems by generating a driver for FreeBSD from the same specification as used for Linux.

Debugging
2009
Automatically patching errors in deployed software

We present ClearView, a system for automatically patching errors in deployed software. ClearView works on stripped Windows x86 binaries without any need for source code, debugging information, or other external information, and without human intervention.

ClearView (1) observes normal executions to learn invariants thatcharacterize the application's normal behavior, (2) uses error detectors to distinguish normal executions from erroneous executions, (3) identifies violations of learned invariants that occur during erroneous executions, (4) generates candidate repair patches that enforce selected invariants by changing the state or flow of control to make the invariant true, and (5) observes the continued execution of patched applications to select the most successful patch.

ClearView is designed to correct errors in software with high availability requirements. Aspects of ClearView that make it particularly appropriate for this context include its ability to generate patches without human intervention, apply and remove patchesto and from running applications without requiring restarts or otherwise perturbing the execution, and identify and discard ineffective or damaging patches by evaluating the continued behavior of patched applications.

ClearView was evaluated in a Red Team exercise designed to test its ability to successfully survive attacks that exploit security vulnerabilities. A hostile external Red Team developed ten code injection exploits and used these exploits to repeatedly attack an application protected by ClearView. ClearView detected and blocked all of the attacks. For seven of the ten exploits, ClearView automatically generated patches that corrected the error, enabling the application to survive the attacks and continue on to successfully process subsequent inputs. Finally, the Red Team attempted to make Clear-View apply an undesirable patch, but ClearView's patch evaluation mechanism enabled ClearView to identify and discard both ineffective patches and damaging patches.

2009
Debugging in the (very) large: ten years of implementation and experience

Windows Error Reporting (WER) is a distributed system that automates the processing of error reports coming from an installed base of a billion machines. WER has collected billions of error reports in ten years of operation. It collects error data automatically and classifies errors into buckets, which are used to prioritize developer effort and report fixes to users. WER uses a progressive approach to data collection, which minimizes overhead for most reports yet allows developers to collect detailed information when needed. WER takes advantage of its scale to use error statistics as a tool in debugging; this allows developers to isolate bugs that could not be found at smaller scale. WER has been designed for large scale: one pair of database servers can record all the errors that occur on all Windows computers worldwide.

2009
Detecting large-scale system problems by mining console logs

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.

I/O
2009
Better I/O through byte-addressable, persistent memory

Modern computer systems have been built around the assumption that persistent storage is accessed via a slow, block-based interface. However, new byte-addressable, persistent memory technologies such as phase change memory (PCM) offer fast, fine-grained access to persistent storage.

In this paper, we present a file system and a hardware architecture that are designed around the properties of persistent, byteaddressable memory. Our file system, BPFS, uses a new technique called short-circuit shadow paging to provide atomic, fine-grained updates to persistent storage. As a result, BPFS provides strong reliability guarantees and offers better performance than traditional file systems, even when both are run on top of byte-addressable, persistent memory. Our hardware architecture enforces atomicity and ordering guarantees required by BPFS while still providing the performance benefits of the L1 and L2 caches.

Since these memory technologies are not yet widely available, we evaluate BPFS on DRAM against NTFS on both a RAM disk and a traditional disk. Then, we use microarchitectural simulations to estimate the performance of BPFS on PCM. Despite providing strong safety and consistency guarantees, BPFS on DRAM is typically twice as fast as NTFS on a RAM disk and 4-10 times faster than NTFS on disk. We also show that BPFS on PCM should be significantly faster than a traditional disk-based file system.

2009
Modular data storage with Anvil

Databases have achieved orders-of-magnitude performance improvements by changing the layout of stored data — for instance, by arranging data in columns or compressing it before storage. These improvements have been implemented in monolithic new engines, however, making it difficult to experiment with feature combinations or extensions. We present Anvil, a modular and extensible toolkit for building database back ends. Anvil's storage modules, called dTables, have much finer granularity than prior work. For example, some dTables specialize in writing data, while others provide optimized read-only formats. This specialization makes both kinds of dTable simple to write and understand. Unifying dTables implement more comprehensive functionality by layering over other dTables — for instance, building a read/write store from read-only tables and a writable journal, or building a general-purpose store from optimized special-purpose stores. The dTable design leads to a flexible system powerful enough to implement many database storage layouts. Our prototype implementation of Anvil performs up to 5.5 times faster than an existing B-tree-based database back end on conventional workloads, and can easily be customized for further gains on specific data and workloads.

2009
Operating System Transactions

Applications must be able to synchronize accesses to operating system resources in order to ensure correctness in the face of concurrency and system failures. System transactions allow the programmer to specify updates to heterogeneous system resources with the OS guaranteeing atomicity, consistency, isolation, and durability (ACID). System transactions efficiently and cleanly solve persistent concurrency problems that are difficult to address with other techniques. For example, system transactions eliminate security vulnerabilities in the file system that are caused by time-of-check-to-time-of-use (TOCTTOU) race conditions. System transactions enable an unsuccessful software installation to roll back without disturbing concurrent, independent updates to the file system.

This paper describes TxOS, a variant of Linux 2.6.22 that implements system transactions. TxOS uses new implementation techniques to provide fast, serializable transactions with strong isolation and fairness between system transactions and non-transactional activity. The prototype demonstrates that a mature OS running on commodity hardware can provide system transactions at a reasonable performance cost. For instance, a transactional installation of OpenSSH incurs only 10% overhead, and a non-transactional compilation of Linux incurs negligible overhead on TxOS. By making transactions a central OS abstraction, TxOS enables new transactional services. For example, one developer prototyped a transactional ext3 file system in less than one month.

Parallel Debugging
2009
PRES: probabilistic replay with execution sketching on multiprocessors

Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions.

This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of "reproducing the bug on the first replay attempt" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as "sketches") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time.

We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.

2009
ODR: output-deterministic replay for multicore debugging

Reproducing bugs is hard. Deterministic replay systems address this problem by providing a high-fidelity replica of an original program run that can be repeatedly executed to zero-in on bugs. Unfortunately, existing replay systems for multiprocessor programs fall short. These systems either incur high overheads, rely on non-standard multiprocessor hardware, or fail to reliably reproduce executions. Their primary stumbling block is data races — a source of nondeterminism that must be captured if executions are to be faithfully reproduced.

In this paper, we present ODR—a software-only replay system that reproduces bugs and provides low-overhead multiprocessor recording. The key observation behind ODR is that, for debugging purposes, a replay system does not need to generate a high-fidelity replica of the original execution. Instead, it suffices to produce any execution that exhibits the same outputs as the original. Guided by this observation, ODR relaxes its fidelity guarantees to avoid the problem of reproducing data-races altogether. The result is a system that replays real multiprocessor applications, such as Apache, MySQL, and the Java Virtual Machine, and provides low record-mode overhead.

Kernels
2009
seL4: formal verification of an OS kernel

Complete formal verification is the only known way to guarantee that a system is free of programming errors.

We present our experience in performing the formal, machine-checked verification of the seL4 microkernel from an abstract specification down to its C implementation. We assume correctness of compiler, assembly code, and hardware, and we used a unique design approach that fuses formal and operating systems techniques. To our knowledge, this is the first formal proof of functional correctness of a complete, general-purpose operating-system kernel. Functional correctness means here that the implementation always strictly follows our high-level abstract specification of kernel behaviour. This encompasses traditional design and implementation safety properties such as the kernel will never crash, and it will never perform an unsafe operation. It also proves much more: we can predict precisely how the kernel will behave in every possible situation.

seL4, a third-generation microkernel of L4 provenance, comprises 8,700 lines of C code and 600 lines of assembler. Its performance is comparable to other high-performance L4 kernels.

2009
Helios: heterogeneous multiprocessing with satellite kernels

Helios is an operating system designed to simplify the task of writing, deploying, and tuning applications for heterogeneous platforms. Helios introduces satellite kernels, which export a single, uniform set of OS abstractions across CPUs of disparate architectures and performance characteristics. Access to I/O services such as file systems are made transparent via remote message passing, which extends a standard microkernel message-passing abstraction to a satellite kernel infrastructure. Helios retargets applications to available ISAs by compiling from an intermediate language. To simplify deploying and tuning application performance, Helios exposes an affinity metric to developers. Affinity provides a hint to the operating system about whether a process would benefit from executing on the same platform as a service it depends upon.

We developed satellite kernels for an XScale programmable I/O card and for cache-coherent NUMA architectures. We offloaded several applications and operating system components, often by changing only a single line of metadata. We show up to a 28% performance improvement by offloading tasks to the XScale I/O card. On a mail-server benchmark, we show a 39% improvement in performance by automatically splitting the application among multiple NUMA domains.

2009
Surviving sensor network software faults

We describe Neutron, a version of the TinyOS operating system that efficiently recovers from memory safety bugs. Where existing schemes reboot an entire node on an error, Neutron's compiler and runtime extensions divide programs into recovery units and reboot only the faulting unit. The TinyOS kernel itself is a recovery unit: a kernel safety violation appears to applications as the processor being unavailable for 10-20 milliseconds.

Neutron further minimizes safety violation cost by supporting "precious" state that persists across reboots. Application data, time synchronization state, and routing tables can all be declared as precious. Neutron's reboot sequence conservatively checks that precious state is not the source of a fault before preserving it. Together, recovery units and precious state allow Neutron to reduce a safety violation's cost to time synchronization by 94% and to a routing protocol by 99.5%. Neutron also protects applications from losing data. Neutron provides this recovery on the very limited resources of a tiny, low-power microcontroller.

Clusters
2009
Distributed aggregation for data-parallel computing: interfaces and implementations

Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non-standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest.

This paper evaluates the interfaces and implementations for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user-defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worst-performing choices.

2009
Quincy: fair scheduling for distributed computing clusters

This paper addresses the problem of scheduling concurrent jobs on clusters where application data is stored on the computing nodes. This setting, in which scheduling computations close to their data is crucial for performance, is increasingly common and arises in systems such as MapReduce, Hadoop, and Dryad as well as many grid-computing environments. We argue that data-intensive computation benefits from a fine-grain resource sharing model that differs from the coarser semi-static resource allocations implemented by most existing cluster computing architectures. The problem of scheduling with locality and fairness constraints has not previously been extensively studied under this resource-sharing model.

We introduce a powerful and flexible new framework for scheduling concurrent distributed jobs with fine-grain resource sharing. The scheduling problem is mapped to a graph datastructure, where edge weights and capacities encode the competing demands of data locality, fairness, and starvation-freedom, and a standard solver computes the optimal online schedule according to a global cost model. We evaluate our implementation of this framework, which we call Quincy, on a cluster of a few hundred computers using a varied workload of data-and CPU-intensive jobs. We evaluate Quincy against an existing queue-based algorithm and implement several policies for each scheduler, with and without fairness constraints. Quincy gets better fairness when fairness is requested, while substantially improving data locality. The volume of data transferred across the cluster is reduced by up to a factor of 3.9 in our experiments, leading to a throughput increase of up to 40%.

2009
Upright cluster services

The UpRight library seeks to make Byzantine fault tolerance (BFT) a simple and viable alternative to crash fault tolerance for a range of cluster services. We demonstrate UpRight by producing BFT versions of the Zookeeper lock service and the Hadoop Distributed File System (HDFS). Our design choices in UpRight favor simplifying adoption by existing applications; performance is a secondary concern. Despite these priorities, our BFT Zookeeper and BFT HDFS implementations have performance comparable with the originals while providing additional robustness.

Security
2009
Improving application security with data flow assertions

Resin is a new language runtime that helps prevent security vulnerabilities, by allowing programmers to specify application-level data flow assertions. Resin provides policy objects, which programmers use to specify assertion code and metadata; data tracking, which allows programmers to associate assertions with application data, and to keep track of assertions as the data flow through the application; and filter objects, which programmers use to define data flow boundaries at which assertions are checked. Resin's runtime checks data flow assertions by propagating policy objects along with data, as that data moves through the application, and then invoking filter objects when data crosses a data flow boundary, such as when writing data to the network or a file.

Using Resin, Web application programmers can prevent a range of problems, from SQL injection and cross-site scripting, to inadvertent password disclosure and missing access control checks. Adding a Resin assertion to an application requires few changes to the existing application code, and an assertion can reuse existing code and data structures. For instance, 23 lines of code detect and prevent three previously-unknown missing access control vulnerabilities in phpBB, a popular Web forum application. Other assertions comprising tens of lines of code prevent a range of vulnerabilities in Python and PHP applications. A prototype of Resin incurs a 33% CPU overhead running the HotCRP conference management application.

2009
Heat-ray: combating identity snowball attacks using machine learning, combinatorial optimization and attack graphs

As computers have become ever more interconnected, the complexity of security configuration has exploded. Management tools have not kept pace, and we show that this has made identity snowball attacks into a critical danger. Identity snowball attacks leverage the users logged in to a first compromised host to launch additional attacks with those users' privileges on other hosts. To combat such attacks, we present Heat-ray, a system that combines machine learning, combinatorial optimization and attack graphs to scalably manage security configuration. Through evaluation on an organization with several hundred thousand users and machines, we show that Heat-ray allows IT administrators to reduce by 96% the number of machines that can be used to launch a large-scale identity snowball attack.

2009
Fabric: a platform for secure distributed computation and storage

Fabric is a new system and language for building secure distributed information systems. It is a decentralized system that allows heterogeneous network nodes to securely share both information and computation resources despite mutual distrust. Its high-level programming language makes distribution and persistence largely transparent to programmers. Fabric supports data-shipping and function-shipping styles of computation: both computation and information can move between nodes to meet security requirements or to improve performance. Fabric provides a rich, Java-like object model, but data resources are labeled with confidentiality and integrity policies that are enforced through a combination of compile-time and run-time mechanisms. Optimistic, nested transactions ensure consistency across all objects and nodes. A peer-to-peer dissemination layer helps to increase availability and to balance load. Results from applications built using Fabric suggest that Fabric has a clean, concise programming model, offers good performance, and enforces security.


21st SOSP — October 14-17, 2007 — Stevenson, Washington, USA
2007
Front Matter
Web Meets Operating Systems
2007
Protection and communication abstractions for web browsers in MashupOS

Web browsers have evolved from a single-principal platform on which one site is browsed at a time into a multi-principal platform on which data and code from mutually distrusting sites interact programmatically in a single page at the browser. Today's "Web 2.0" applications (or mashups) offer rich services, rivaling those of desktop PCs. However, the protection andcommunication abstractions offered by today's browsers remain suitable onlyfor a single-principal system—either no trust through completeisolation between principals (sites) or full trust by incorporating third party code as libraries. In this paper, we address this deficiency by identifying and designing the missing abstractions needed for a browser-based multi-principal platform. We have designed our abstractions to be backward compatible and easily adoptable. We have built a prototype system that realizes almost all of our abstractions and their associated properties. Our evaluation shows that our abstractions make it easy to build more secure and robust client-side Web mashups and can be easily implemented with negligible performance overhead.

2007
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications

The rise of the software-as-a-service paradigm has led to the development of a new breed of sophisticated, interactive applications often called Web 2.0. While web applications have become larger and more complex, web application developers today have little visibility into the end-to-end behavior of their systems. This paper presents AjaxScope, a dynamic instrumentation platform that enables cross-user monitoring and just-in-time control of web application behavior on end-user desktops. AjaxScope is a proxy that performs on-the-fly parsing and instrumentation of JavaScript code as it is sent to users' browsers. AjaxScope provides facilities for distributed and adaptive instrumentation in order to reduce the client-side overhead, while giving fine-grained visibility into the code-level behavior of web applications. We present a variety of policies demonstrating the power of AjaxScope, ranging from simple error reporting and performance profiling to more complex memory leak detection and optimization analyses. We also apply our prototype to analyze the behavior of over 90 Web 2.0 applications and sites that use large amounts of JavaScript.

2007
Secure web applications via automatic partitioning

Swift is a new, principled approach to building web applications that are secure by construction. In modern web applications, some application functionality is usually implemented as client-side code written in JavaScript. Moving code and data to the client can create security vulnerabilities, but currently there are no good methods for deciding when it is secure to do so. Swift automatically partitions application code while providing assurance that the resulting placement is secure and efficient. Application code is written as Java-like code annotated with information flow policies that specify the confidentiality and integrity of web application information. The compiler uses these policies to automatically partition the program into JavaScript code running in the browser, and Java code running on the server. To improve interactive performance, code and data are placed on the client side. However, security-critical code and data are always placed on the server. Code and data can also be replicated across the client and server, to obtain both security and performance. A max-flow algorithm is used to place code and data in a way that minimizes client-server communication.

Byzantine Fault Tolerance
2007
Zyzzyva: speculative byzantine fault tolerance

We present Zyzzyva, a protocol that uses speculation to reduce the cost and simplify the design of Byzantine fault tolerant state machine replication. In Zyzzyva, replicas respond to a client's request without first running an expensive three-phase commit protocol to reach agreement on the order in which the request must be processed. Instead, they optimistically adopt the order proposed by the primary and respond immediately to the client. Replicas can thus become temporarily inconsistent with one another, but clients detect inconsistencies, help correct replicas converge on a single total ordering of requests, and only rely on responses that are consistent with this total order. This approach allows Zyzzyva to reduce replication overheads to near their theoretical minimal.

2007
Tolerating byzantine faults in transaction processing systems using commit barrier scheduling

This paper describes the design, implementation, and evaluation of areplication scheme to handle Byzantine faults in transaction processing database systems. The scheme compares answers from queries and updates on multiple replicas which are unmodified, off-the-shelf systems, to provide a single database that is Byzantine fault tolerant. The scheme works when the replicas are homogeneous, but it also allows heterogeneous replication in which replicas come from different vendors. Heterogeneous replicas reduce the impact of bugs and security compromises because they are implemented independently and are thus less likely to suffer correlated failures.

The main challenge in designing a replication scheme for transactionprocessing systems is ensuring that the different replicas execute transactions in equivalent serial orders while allowing a high degreeof concurrency. Our scheme meets this goal using a novel concurrency control protocol, commit barrier scheduling (CBS). We have implemented CBS in the context of a replicated SQL database, HRDB(Heterogeneous Replicated DB), which has been tested with unmodified production versions of several commercial and open source databases as replicas. Our experiments show an HRDB configuration that can tolerate one faulty replica has only a modest performance overhead(about 17% for the TPC-C benchmark). HRDB successfully masks several Byzantine faults observed in practice and we have used it to find a new bug in MySQL.

2007
Low-overhead byzantine fault-tolerant storage

This paper presents an erasure-coded Byzantine fault-tolerant block storage protocol that is nearly as efficient as protocols that tolerate only crashes. Previous Byzantine fault-tolerant block storage protocols have either relied upon replication, which is inefficient for large blocks of data when tolerating multiple faults, or a combination of additional servers, extra computation, and versioned storage. To avoid these expensive techniques, our protocol employs novel mechanisms to optimize for the common case when faults and concurrency are rare. In the common case, a write operation completes in two rounds of communication and a read completes in one round. The protocol requires a short checksum comprised of cryptographic hashes and homomorphic fingerprints. It achieves throughput within 10% of the crash-tolerant protocol for writes and reads in failure-free runs when configured to tolerate up to 6 faulty servers and any number of faulty clients.

Concurrency
2007
TxLinux: using and managing hardware transactional memory in an operating system

TxLinux is a variant of Linux that is the first operating system to use hardware transactional memory (HTM) as a synchronization primitive, and the first to manage HTM in the scheduler. This paper describes and measures TxLinux and discusses two innovations in detail: cooperation between locks and transactions, and theintegration of transactions with the OS scheduler. Mixing locks and transactions requires a new primitive, cooperative transactional spinlocks (cxspinlocks) that allow locks and transactions to protect the same data while maintaining the advantages of both synchronization primitives. Cxspinlocks allow the system to attemptexecution of critical regions with transactions and automatically roll back to use locking if the region performs I/O. Integrating the scheduler with HTM eliminates priority inversion. On a series ofreal-world benchmarks TxLinux has similar performance to Linux, exposing concurrency with as many as 32 concurrent threads on 32 CPUs in the same critical region.

2007
MUVI: automatically inferring multi-variable access correlations and detecting related semantic and concurrency bugs

Software defects significantly reduce system dependability. Among various types of software bugs, semantic and concurrency bugs are two of the most difficult to detect. This paper proposes a novel method, called MUVI, that detects an important class of semantic and concurrency bugs. MUVI automatically infers commonly existing multi-variable access correlations through code analysis and then detects two types of related bugs: (1) inconsistent updates—correlated variables are not updated in a consistent way, and (2) multi-variable concurrency bugs—correlated accesses are not protected in the same atomic sections in concurrent programs.We evaluate MUVI on four large applications: Linux, Mozilla,MySQL, and PostgreSQL. MUVI automatically infers more than 6000 variable access correlations with high accuracy (83%).Based on the inferred correlations, MUVI detects 39 new inconsistent update semantic bugs from the latest versions of these applications, with 17 of them recently confirmed by the developers based on our reports.We also implemented MUVI multi-variable extensions to tworepresentative data race bug detection methods (lock-set and happens-before). Our evaluation on five real-world multi-variable concurrency bugs from Mozilla and MySQL shows that the MUVI-extension correctly identifies the root causes of four out of the five multi-variable concurrency bugs with 14% additional overhead on average. Interestingly, MUVI also helps detect four new multi-variable concurrency bugs in Mozilla that have never been reported before. None of the nine bugs can be identified correctly by the original race detectors without our MUVI extensions.

Software Robustness
2007
Bouncer: securing software by blocking bad input

Attackers exploit software vulnerabilities to control or crash programs. Bouncer uses existing software instrumentation techniques to detect attacks and it generates filters automatically to block exploits of the target vulnerabilities. The filters are deployed automatically by instrumenting system calls to drop exploit messages. These filters introduce low overhead and they allow programs to keep running correctly under attack. Previous work computes filters using symbolic execution along the path taken by a sample exploit, but attackers can bypass these filters by generating exploits that follow a different execution path. Bouncer introduces three techniques to generalize filters so that they are harder to bypass: a new form of program slicing that uses a combination of static and dynamic analysis to remove unnecessary conditions from the filter; symbolic summaries for common library functions that characterize their behavior succinctly as a set of conditions on the input; and generation of alternative exploits guided by symbolic execution. Bouncer filters have low overhead, they do not have false positives by design, and our results show that Bouncer can generate filters that block all exploits of some real-world vulnerabilities.

2007
Triage: diagnosing production run failures at the user's site

Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers.

To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically—using different diagnosis techniques—analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables.

We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half.

2007
/​*icomment: bugs or bad comments?*​/

Commenting source code has long been a common practice in software development. Compared to source code, comments are more direct, descriptive and easy-to-understand. Comments and sourcecode provide relatively redundant and independent information regarding a program's semantic behavior. As software evolves, they can easily grow out-of-sync, indicating two problems: (1) bugs -the source code does not follow the assumptions and requirements specified by correct program comments; (2) bad comments - comments that are inconsistent with correct code, which can confuse and mislead programmers to introduce bugs in subsequent versions. Unfortunately, as most comments are written in natural language, no solution has been proposed to automatically analyze commentsand detect inconsistencies between comments and source code. This paper takes the first step in automatically analyzing commentswritten in natural language to extract implicit program rulesand use these rules to automatically detect inconsistencies between comments and source code, indicating either bugs or bad comments. Our solution, iComment, combines Natural Language Processing(NLP), Machine Learning, Statistics and Program Analysis techniques to achieve these goals. We evaluate iComment on four large code bases: Linux, Mozilla, Wine and Apache. Our experimental results show that iComment automatically extracts 1832 rules from comments with 90.8-100% accuracy and detects 60 comment-code inconsistencies, 33 newbugs and 27 bad comments, in the latest versions of the four programs. Nineteen of them (12 bugs and 7 bad comments) have already been confirmed by the corresponding developers while the others are currently being analyzed by the developers.

Distributed Systems
2007
Sinfonia: a new paradigm for building scalable distributed systems

We propose a new paradigm for building scalable distributed systems. Our approach does not require dealing with message-passing protocols — a major complication in existing distributed systems. Instead, developers just design and manipulate data structures within our service called Sinfonia. Sinfonia keeps data for applications on a set of memory nodes, each exporting a linear address space. At the core of Sinfonia is a novel minitransaction primitive that enables efficient and consistent access to data, while hiding the complexities that arise from concurrency and failures. Using Sinfonia, we implemented two very different and complex applications in a few months: a cluster file system and a group communication service. Our implementations perform well and scale to hundreds of machines.

2007
PeerReview: practical accountability for distributed systems

We describe PeerReview, a system that provides accountability in distributed systems. PeerReview ensures that Byzantine faults whose effects are observed by a correct node are eventually detected and irrefutably linked to a faulty node. At the same time, PeerReview ensures that a correct node can always defend itself against false accusations. These guarantees are particularly important for systems that span multiple administrative domains, which may not trust each other.PeerReview works by maintaining a secure record of the messages sent and received by each node. The record isused to automatically detect when a node's behavior deviates from that of a given reference implementation, thus exposing faulty nodes. PeerReview is widely applicable: it only requires that a correct node's actions are deterministic, that nodes can sign messages, and that each node is periodically checked by a correct node. We demonstrate that PeerReview is practical by applying it to three different types of distributed systems: a network filesystem, a peer-to-peer system, and an overlay multicast system.

2007
Attested append-only memory: making adversaries stick to their word

Researchers have made great strides in improving the fault tolerance of both centralized and replicated systems against arbitrary (Byzantine) faults. However, there are hard limits to how much can be done with entirely untrusted components; for example, replicated state machines cannot tolerate more than a third of their replica population being Byzantine. In this paper, we investigate how minimal trusted abstractions can push through these hard limits in practical ways. We propose Attested Append-Only Memory (A2M), a trusted system facility that is small, easy to implement and easy to verify formally. A2M provides the programming abstraction of a trusted log, which leads to protocol designs immune to equivocation — the ability of a faulty host to lie in different ways to different clients or servers — which is a common source of Byzantine headaches. Using A2M, we improve upon the state of the art in Byzantine-fault tolerant replicated state machines, producing A2M-enabled protocols (variants of Castro and Liskov's PBFT) that remain correct (linearizable) and keep making progress (live) even when half the replicas are faulty, in contrast to the previous upper bound. We also present an A2M-enabled single-server shared storage protocol that guarantees linearizability despite server faults. We implement A2M and our protocols, evaluate them experimentally through micro- and macro-benchmarks, and argue that the improved fault tolerance is cost-effective for a broad range of uses, opening up new avenues for practical, more reliable services.

2007
Dynamo: amazon's highly available key-value store

Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.

This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

System Maintenance
2007
Staged deployment in mirage, an integrated software upgrade testing and distribution system

Despite major advances in the engineering of maintainable and robust software over the years, upgrading software remains a primitive and error-prone activity. In this paper, we argue that several problems with upgrading software are caused by a poor integration between upgrade deployment, user-machine testing, and problem reporting. To support this argument, we present a characterization of softwareupgrades resulting from a survey we conducted of 50 system administrators. Motivated by the survey results, we present Mirage, a distributed framework for integrating upgrade deployment, user-machine testing, and problem reporting into the overall upgrade development process. Our evaluation focuses on the most novel aspect of Mirage, namely its staged upgrade deployment based on the clustering of usermachines according to their environments and configurations. Our results suggest that Mirage's staged deployment is effective for real upgrade problems.

2007
AutoBash: improving configuration management with operating system causality analysis

AutoBash is a set of interactive tools that helps users and system administrators manage configurations. AutoBash leverages causal tracking support implemented within our modified Linux kernel to understand the inputs (causal dependencies) and outputs (causal effects) of configuration actions. It uses OS-level speculative execution to try possible actions, examine their effects, and roll them back when necessary. AutoBash automates many of the tedious parts of trying to fix a misconfiguration, including searching through possible solutions, testing whether a particular solution fixes a problem, and undoing changes to persistent and transient state when a solution fails. Our results show that AutoBash correctly identifies the solution to several CVS, gcc cross-compiler, and Apache configuration errors. We also show that causal analysis reduces AutoBash's search time by an average of 35% and solution verification time by an average of 70%.

Energy
2007
Integrating concurrency control and energy management in device drivers

Energy management is a critical concern in wireless sensornets. Despite its importance, sensor network operating systems today provide minimal energy management support, requiring applications to explicitly manage system power states. To address this problem, we present ICEM, a device driver architecture that enables simple, energy efficient wireless sensornet applications. The key insight behind ICEMis that the most valuable information an application can give the OS for energy management is its concurrency. Using ICEM, a low-rate sensing application requires only a single line of energy management code and has an efficiency within 1.6% of a hand-tuned implementation. ICEM's effectiveness questions the assumption that sensornet applications must be responsible for all power management and sensornets cannot have a standardized OS with a simple API.

2007
VirtualPower: coordinated power management in virtualized enterprise systems

Power management has become increasingly necessary in large-scale datacenters to address costs and limitations in cooling or power delivery. This paper explores how to integrate power management mechanisms and policies with the virtualization technologies being actively deployed in these environments. The goals of the proposed VirtualPower approach to online power management are (i) to support the isolated and independent operation assumed by guest virtual machines (VMs) running on virtualized platforms and (ii) to make it possible to control and globally coordinate the effects of the diverse power management policies applied by these VMs to virtualized resources. To attain these goals, VirtualPower extends to guest VMs `soft' versions of the hardware power states for which their policies are designed. The resulting technical challenge is to appropriately map VM-level updates made to soft power states to actual changes in the states or in the allocation of underlying virtualized hardware. An implementation of VirtualPower Management (VPM) for the Xen hypervisor addresses this challenge by provision of multiple system-level abstractions including VPM states, channels, mechanisms, and rules. Experimental evaluations on modern multicore platforms highlight resulting improvements in online power management capabilities, including minimization of power consumption with little or no performance penalties and the ability to throttle power consumption while still meeting application requirements. Finally, coordination of online methods for server consolidation with VPM management techniques in heterogeneous server systems is shown to provide up to 34% improvements in power consumption.

Storage
2007
DejaView: a personal virtual computer recorder

As users interact with the world and their peers through their computers, it is becoming important to archive and later search the information that they have viewed. We present DejaView, a personal virtual computer recorder that provides a complete record of a desktop computing experience that a user can playback, browse, search, and revive seamlessly. DejaView records visual output, checkpoints corresponding application and file system state, and captures displayed text with contextual information to index the record. A user can then browse and search the record for any visual information that has been displayed on the desktop, and revive and interact with the desktop computing state corresponding to any point in the record. DejaView combines display, operating system, and file system virtualization to provide its functionality transparently without any modifications to applications, window systems, or operating system kernels. We have implemented DejaView and evaluated its performance on real-world desktop applications. Our results demonstrate that DejaView can provide continuous low-overhead recording without any user noticeable performance degradation, and allows browsing, search and playback of records fast enough for interactive use.

2007
Improving file system reliability with I/O shepherding

We introduce a new reliability infrastructure for file systems called I/O shepherding. I/O shepherding allows a file system developer to craft nuanced reliability policies to detect and recover from a wide range of storage system failures. We incorporate shepherding into the Linux ext3 file system through a set of changes to the consistency management subsystem, layout engine, disk scheduler, and buffer cache. The resulting file system, CrookFS, enables a broad class of policies to be easily and correctly specified. We implement numerous policies, incorporating data protection techniques such as retry, parity, mirrors, checksums, sanity checks, and data structure repairs; even complex policies can be implemented in less than 100 lines of code, confirming the power and simplicity of the shepherding framework. We also demonstrate that shepherding is properly integrated, adding less than 5% overhead to the I/O path.

2007
Generalized file system dependencies

Reliable storage systems depend in part on "write-before" relationships where some changes to stable storage are delayed until other changes commit. A journaled file system, for example, must commit a journal transaction before applying that transaction's changes, and soft updates and other consistency enforcement mechanisms have similar constraints, implemented in each case in system-dependent ways. We present a general abstraction, the patch, that makes write-before relationships explicit and file system agnostic. A patch-based file system implementation expresses dependencies among writes, leaving lower system layers to determine write orders that satisfy those dependencies. Storage system modules can examine and modify the dependency structure, and generalized file system dependencies are naturally exportable to user level. Our patch-based storage system, Feather stitch, includes several important optimizations that reduce patch overheads by orders of magnitude. Our ext2 prototype runs in the Linux kernel and supports a synchronous writes, soft updates-like dependencies, and journaling. It outperforms similarly reliable ext2 and ext3 configurations on some, but not all, benchmarks. It also supports unusual configurations, such as correct dependency enforcement within a loopback file system, and lets applications define consistency requirements without micromanaging how those requirements are satisfied.

Operating System Security
2007
Information flow control for standard OS abstractions

Decentralized Information Flow Control (DIFC) is an approach to security that allows application writers to control how data flows between the pieces of an application and the outside world. As applied to privacy, DIFC allows untrusted software to compute with private data while trusted security code controls the release of that data. As applied to integrity, DIFC allows trusted code to protect untrusted software from unexpected malicious inputs. In either case, only bugs in the trusted code, which tends to be small and isolated, can lead to security violations.

We present Flume, a new DIFC model that applies at the granularity of operating system processes and standard OS abstractions (e.g., pipes and file descriptors). Flume was designed for simplicity of mechanism, to ease DIFC's use in existing applications, and to allow safe interaction between conventional and DIFC-aware processes. Flume runs as a user-level reference monitor onLinux. A process confined by Flume cannot perform most system calls directly; instead, an interposition layer replaces system calls with IPCto the reference monitor, which enforces data flowpolicies and performs safe operations on the process's behalf. We ported a complex web application (MoinMoin Wiki) to Flume, changingonly 2% of the original code. Performance measurements show a 43% slowdown on read workloadsand a 34% slowdown on write workloads, which aremostly due to Flume's user-level implementation.

2007
SecVisor: a tiny hypervisor to provide lifetime kernel code integrity for commodity OSes

We propose SecVisor, a tiny hypervisor that ensures code integrity for commodity OS kernels. In particular, SecVisor ensures that only user-approved code can execute in kernel mode over the entire system lifetime. This protects the kernel against code injection attacks, such as kernel rootkits. SecVisor can achieve this propertyeven against an attacker who controls everything but the CPU, the memory controller, and system memory chips. Further, SecVisor can even defend against attackers with knowledge of zero-day kernel exploits.

Our goal is to make SecVisor amenable to formal verificationand manual audit, thereby making it possible to rule out known classes of vulnerabilities. To this end, SecVisor offers small code size and small external interface. We rely on memory virtualization to build SecVisor and implement two versions, one using software memory virtualization and the other using CPU-supported memory virtualization. The code sizes of the runtime portions of these versions are 1739 and 1112 lines, respectively. The size of the external interface for both versions of SecVisor is 2 hypercalls. It is easy to port OS kernels to SecVisor. We port the Linux kernel version 2.6.20 by adding 12 lines and deleting 81 lines, out of a total of approximately 4.3 million lines of code in the kernel.

2007
Secure virtual architecture: a safe execution environment for commodity operating systems

This paper describes an efficient and robust approach to provide a safe execution environment for an entire operating system, such as Linux, and all its applications. The approach, which we call Secure Virtual Architecture (SVA), defines a virtual, low-level, typed instruction set suitable for executing all code on a system, including kernel and application code. SVA code is translated for execution by a virtual machine transparently, offline or online. SVA aims to enforce fine-grained (object level) memory safety, control-flow integrity, type safety for a subset of objects, and sound analysis. A virtual machine implementing SVA achieves these goals by using a novel approach that exploits properties of existing memory pools in the kernel and by preserving the kernel's explicit control over memory, including custom allocators and explicit deallocation. Furthermore, the safety properties can be encoded compactly as extensions to the SVA type system, allowing the (complex) safety checking compiler to be outside the trusted computing base. SVA also defines a set of OS interface operations that abstract all privileged hardware instructions, allowing the virtual machine to monitor all privileged operations and control the physical resources on a given hardware platform. We have ported the Linux kernel to SVA, treating it as a new architecture, and made only minimal code changes (less than 300 lines of code) to the machine-independent parts of the kernel and device drivers. SVA is able to prevent 4 out of 5 memory safety exploits previously reported for the Linux 2.4.22 kernel for which exploit code is available, and would prevent the fifth one simply by compiling an additional kernel library.


20th SOSP — October 23-26, 2005 — Brighton, United Kingdom
Keynote Address
2005
Untitled
Integrity and Isolation
2005
Pioneer: verifying code integrity and enforcing untampered code execution on legacy systems
We propose a primitive, called Pioneer, as a first step towards verifiable code execution on untrusted legacy hosts. Pioneer does not require any hardware support such as secure co-processors or CPU-architecture extensions. We implement Pioneer on an Intel Pentium IV Xeon processor. Pioneer can be used as a basic building block to build security systems. We demonstrate this by building a kernel rootkit detector.
2005
Labels and event processes in the asbestos operating system
Asbestos, a new prototype operating system, provides novel labeling and isolation mechanisms that help contain the effects of exploitable software flaws. Applications can express a wide range of policies with Asbestos's kernel-enforced label mechanism, including controls on inter-process communication and system-wide information flow. A new event process abstraction provides lightweight, isolated contexts within a single process, allowing the same process to act on behalf of multiple users while preventing it from leaking any single user's data to any other user. A Web server that uses Asbestos labels to isolate user data requires about 1.5 memory pages per user, demonstrating that additional security can come at an acceptable cost.
2005
Mondrix: memory isolation for linux using mondriaan memory protection
This paper presents the design and an evaluation of Mondrix, a version of the Linux kernel with Mondriaan Memory Protection (MMP). MMP is a combination of hardware and software that provides efficient fine-grained memory protection between multiple protection domains sharing a linear address space. Mondrix uses MMP to enforce isolation between kernel modules which helps detect bugs, limits their damage, and improves kernel robustness and maintainability. During development, MMP exposed two kernel bugs in common, heavily-tested code, and during fault injection experiments, it prevented three of five file system corruptions.The Mondrix implementation demonstrates how MMP can bring memory isolation to modules that already exist in a large software application. It shows the benefit of isolation for robustness and error detection and prevention, while validating previous claims that the protection abstractions MMP offers are a good fit for software. This paper describes the design of the memory supervisor, the kernel module which implements permissions policy.We present an evaluation of Mondrix using full-system simulation of large kernel-intensive workloads. Experiments with several benchmarks where MMP was used extensively indicate the additional space taken by the MMP data structures reduce the kernel's free memory by less than 10%, and the kernel's runtime increases less than 15% relative to an unmodified kernel.
Distributed Systems
2005
BAR fault tolerance for cooperative services
This paper describes a general approach to constructing cooperative services that span multiple administrative domains. In such environments, protocols must tolerate both Byzantine behaviors when broken, misconfigured, or malicious nodes arbitrarily deviate from their specification and rational behaviors when selfish nodes deviate from their specification to increase their local benefit. The paper makes three contributions: (1) It introduces the BAR (Byzantine, Altruistic, Rational) model as a foundation for reasoning about cooperative services; (2) It proposes a general three-level architecture to reduce the complexity of building services under the BAR model; and (3) It describes an implementation of BAR-B the first cooperative backup service to tolerate both Byzantine users and an unbounded number of rational users. At the core of BAR-B is an asynchronous replicated state machine that provides the customary safety and liveness guarantees despite nodes exhibiting both Byzantine and rational behaviors. Our prototype provides acceptable performance for our application: our BAR-tolerant state machine executes 15 requests per second, and our BAR-B backup service can back up 100MB of data in under 4 minutes.
2005
Fault-scalable Byzantine fault-tolerant services
A fault-scalable service can be configured to tolerate increasing numbers of faults without significant decreases in performance. The Query/Update (Q/U) protocol is a new tool that enables construction of fault-scalable Byzantine fault-tolerant services. The optimistic quorum-based nature of the Q/U protocol allows it to provide better throughput and fault-scalability than replicated state machines using agreement-based protocols. A prototype service built using the Q/U protocol outperforms the same service built using a popular replicated state machine implementation at all system sizes in experiments that permit an optimistic execution. Moreover, the performance of the Q/U protocol decreases by only 36% as the number of Byzantine faults tolerated increases from one to five, whereas the performance of the replicated state machine decreases by 83%.
2005
Implementing declarative overlays
Overlay networks are used today in a variety of distributed systems ranging from file-sharing and storage systems to communication infrastructures. However, designing, building and adapting these overlays to the intended application and the target environment is a difficult and time consuming process.To ease the development and the deployment of such overlay networks we have implemented P2, a system that uses a declarative logic language to express overlay networks in a highly compact and reusable form. P2 can express a Narada-style mesh network in 16 rules, and the Chord structured overlay in only 47 rules. P2 directly parses and executes such specifications using a dataflow architecture to construct and maintain overlay networks. We describe the P2 approach, how our implementation works, and show by experiment its promising trade-off point between specification complexity and performance.
History and Context
2005
Detecting past and present intrusions through vulnerability-specific predicates
Most systems contain software with yet-to-be-discovered security vulnerabilities. When a vulnerability is disclosed, administrators face the grim reality that they have been running software which was open to attack. Sites that value availability may be forced to continue running this vulnerable software until the accompanying patch has been tested. Our goal is to improve security by detecting intrusions that occurred before the vulnerability was disclosed and by detecting and responding to intrusions that are attempted after the vulnerability is disclosed. We detect when a vulnerability is triggered by executing vulnerability-specific predicates as the system runs or replays. This paper describes the design, implementation and evaluation of a system that supports the construction and execution of these vulnerability-specific predicates. Our system, called IntroVirt, uses virtual-machine introspection to monitor the execution of application and operating system software. IntroVirt executes predicates over past execution periods by combining virtual-machine introspection with virtual-machine replay. IntroVirt eases the construction of powerful predicates by allowing predicates to run existing target code in the context of the target system, and it uses checkpoints so that predicates can execute target code without perturbing the state of the target system. IntroVirt allows predicates to refresh themselves automatically so they work in the presence of preemptions. We show that vulnerability-specific predicates can be written easily for a wide variety of real vulnerabilities, can detect and respond to intrusions over both the past and present time intervals, and add little overhead for most vulnerabilities.
2005
Capturing, indexing, clustering, and retrieving system history
We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.
2005
Connections: using context to enhance file search
Connections is a file system search tool that combines traditional content-based search with context information gathered from user activity. By tracing file system calls, Connections can identify temporal relationships between files and use them to expand and reorder traditional content search results. Doing so improves both recall (reducing false-positives) and precision (reducing false-negatives). For example, Connections improves the average recall (from 13% to 22%) and precision (from 23% to 29%) on the first ten results. When averaged across all recall levels, Connections improves precision from 17% to 28%. Connections provides these benefits with only modest increases in average query time (2 seconds), indexing time (23 seconds daily), and index size(under 1% of the user's data set).
Containment
2005
Vigilante: end-to-end containment of internet worms
Worm containment must be automatic because worms can spread too fast for humans to respond. Recent work has proposed network-level techniques to automate worm containment; these techniques have limitations because there is no information about the vulnerabilities exploited by worms at the network level. We propose Vigilante, a new end-to-end approach to contain worms automatically that addresses these limitations. Vigilante relies on collaborative worm detection at end hosts, but does not require hosts to trust each other. Hosts run instrumented software to detect worms and broadcast self-certifying alerts (SCAs) upon worm detection. SCAs are proofs of vulnerability that can be inexpensively verified by any vulnerable host. When hosts receive an SCA, they generate filters that block infection by analysing the SCA-guided execution of the vulnerable software. We show that Vigilante can automatically contain fast-spreading worms that exploit unknown vulnerabilities without blocking innocuous traffic.
2005
Scalability, fidelity, and containment in the potemkin virtual honeyfarm
The rapid evolution of large-scale worms, viruses and bot-nets have made Internet malware a pressing concern. Such infections are at the root of modern scourges including DDoS extortion, on-line identity theft, SPAM, phishing, and piracy. However, the most widely used tools for gathering intelligence on new malware — network honeypots — have forced investigators to choose between monitoring activity at a large scale or capturing behavior with high fidelity. In this paper, we describe an approach to minimize this tension and improve honeypot scalability by up to six orders of magnitude while still closely emulating the execution behavior of individual Internet hosts. We have built a prototype honeyfarm system, called Potemkin, that exploits virtual machines, aggressive memory sharing, and late binding of resources to achieve this goal. While still an immature implementation, Potemkin has emulated over 64,000 Internet honeypots in live test runs, using only a handful of physical servers.
2005
The taser intrusion recovery system
Recovery from intrusions is typically a very time-consuming operation in current systems. At a time when the cost of human resources dominates the cost of computing resources, we argue that next generation systems should be built with automated intrusion recovery as a primary goal. In this paper, we describe the design of Taser, a system that helps in selectively recovering legitimate file-system data after an attack or local damage occurs. Taser reverts tainted, i.e. attack-dependent, file-system operations but preserves legitimate operations. This process is difficult for two reasons. First, the set of tainted operations is not known precisely. Second, the recovery process can cause conflicts when legitimate operations depend on tainted operations. Taser provides several analysis policies that aid in determining the set of tainted operations. To handle conflicts, Taser uses automated resolution policies that isolate the tainted operations. Our evaluation shows that Taser is effective in recovering from a wide range of intrusions as well as damage caused by system management errors.
Panel
2005
Peer-to-peer: still useless?
Filesystems
2005
Hibernator: helping disk arrays sleep through the winter
Energy consumption has become an important issue in high-end data centers, and disk arrays are one of the largest energy consumers within them. Although several attempts have been made to improve disk array energy management, the existing solutions either provide little energy savings or significantly degrade performance for data center workloads.Our solution, Hibernator, is a disk array energy management system that provides improved energy savings while meeting performance goals. Hibernator combines a number of techniques to achieve this: the use of disks that can spin at different speeds, a coarse-grained approach for dynamically deciding which disks should spin at which speeds, efficient ways to migrate the right data to an appropriate-speed disk automatically, and automatic performance boosts if there is a risk that performance goals might not be met due to disk energy management.In this paper, we describe the Hibernator design, and present evaluations of it using both trace-driven simulations and a hybrid system comprised of a real database server (IBM DB2) and an emulated storage server with multi-speed disks. Our file-system and on-line transaction processing (OLTP) simulation results show that Hibernator can provide up to 65% energy savings while continuing to satisfy performance goals (6.5-26 times better than previous solutions). Our OLTP emulated system results show that Hibernator can save more energy (29%) than previous solutions, while still providing an OLTP transaction rate comparable to a RAID5 array with no energy management.
2005
Speculative execution in a distributed file system
Speculator provides Linux kernel support for speculative execution. It allows multiple processes to share speculative state by tracking causal dependencies propagated through inter-process communication. It guarantees correct execution by preventing speculative processes from externalizing output, e.g., sending a network message or writing to the screen, until the speculations on which that output depends have proven to be correct. Speculator improves the performance of distributed file systems by masking I/O latency and increasing I/O throughput. Rather than block during a remote operation, a file system predicts the operation's result, then uses Speculator to checkpoint the state of the calling process and speculatively continue its execution based on the predicted result. If the prediction is correct, the checkpoint is discarded; if it is incorrect, the calling process is restored to the checkpoint, and the operation is retried. We have modified the client, server, and network protocol of two distributed file systems to use Speculator. For PostMark and Andrew-style benchmarks, speculative execution results in a factor of 2 performance improvement for NFS over local-area networks and an order of magnitude improvement over wide-area networks. For the same benchmarks, Speculator enables the Blue File System to provide the consistency of single-copy file semantics and the safety of synchronous I/O, yet still outperform current distributed file systems with weaker consistency and safety.
2005
IRON file systems
Commodity file systems trust disks to either work or fail completely, yet modern disks exhibit more complex failure modes. We suggest a new fail-partial failure model for disks, which incorporates realistic localized faults such as latent sector errors and block corruption. We then develop and apply a novel failure-policy fingerprinting framework, to investigate how commodity file systems react to a range of more realistic disk failures. We classify their failure policies in a new taxonomy that measures their Internal RObustNess (IRON), which includes both failure detection and recovery techniques. We show that commodity file system failure policies are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures. Finally, we design, implement, and evaluate a prototype IRON file system, Linux ixt3, showing that techniques such as in-disk checksumming, replication, and parity greatly enhance file system robustness while incurring minimal time and space overheads.
Bugs
2005
RaceTrack: efficient detection of data race conditions via adaptive tracking
Bugs due to data races in multithreaded programs often exhibit non-deterministic symptoms and are notoriously difficult to find. This paper describes RaceTrack, a dynamic race detection tool that tracks the actions of a program and reports a warning whenever a suspicious pattern of activity has been observed. RaceTrack uses a novel hybrid detection algorithm and employs an adaptive approach that automatically directs more effort to areas that are more suspicious, thus providing more accurate warnings for much less over-head. A post-processing step correlates warnings and ranks code segments based on how strongly they are implicated in potential data races. We implemented RaceTrack inside the virtual machine of Microsoft's Common Language Runtime (product version v1.1.4322) and monitored several major, real-world applications directly out-of-the-box,without any modification. Adaptive tracking resulted in a slowdown ratio of about 3x on memory-intensive programs and typically much less than 2x on other programs,and a memory ratio of typically less than 1.2x. Several serious data race bugs were revealed, some previously unknown.
2005
Rx: treating bugs as allergies—a safe method to survive software failures
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to re-execute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.We have implemented RX on Linux. Our experiments with four server applications that contain six bugs of various types show that RX can survive all the six software failures and provide transparent fast recovery within 0.017-0.16 seconds, 21-53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and re-execution without environmental changes, cannot successfully recover the three servers (Squid, Apache, and CVS) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a non-deterministic concurrency bug. Additionally, RX's checkpointing system is lightweight, imposing small time and space overheads.
Optimization
2005
Idletime scheduling with preemption intervals
This paper presents the idletime scheduler; a generic, kernel-level mechanism for using idle resource capacity in the background without slowing down concurrent foreground use. Many operating systems fail to support transparent background use and concurrent foreground performance can decrease by 50% or more. The idletime scheduler minimizes this interference by partially relaxing the work conservation principle during preemption intervals, during which it serves no background requests even if the resource is idle. The length of preemption intervals is a controlling parameter of the scheduler: short intervals aggressively utilize idle capacity; long intervals reduce the impact of background use on foreground performance. Unlike existing approaches to establish prioritized resource use, idletime scheduling requires only localized modifications to a limited number of system schedulers. In experiments, a FreeBSD implementation for idletime network scheduling maintains over 90% of foreground TCP throughput, while allowing concurrent, high-rate UDP background flows to consume up to 80% of remaining link capacity. A FreeBSD disk scheduler implementation maintains 80% of foreground read performance, while enabling concurrent background operations to reach 70% throughput.
2005
FS2: dynamic data replication in free disk space for improving disk performance and energy consumption
Disk performance is increasingly limited by its head positioning latencies, i.e., seek time and rotational delay. To reduce the head positioning latencies, we propose a novel technique that dynamically places copies of data in file system's free blocks according to the disk access patterns observed at runtime. As one or more replicas can now be accessed in addition to their original data block, choosing the "nearest" replica that provides fastest access can significantly improve performance for disk I/O operations.We implemented and evaluated a prototype based on the popular Ext2 file system. In our prototype, since the file system layout is modified only by using the free/unused disk space (hence the name Free Space File System, or FS2), users are completely oblivious to how the file system layout is modified in the background; they will only notice performance improvements over time. For a wide range of workloads running under Linux, FS2 is shown to reduce disk access time by 41-68% (as a result of a 37-78% shorter seek time and a 31-68% shorter rotational delay) making a 16-34% overall user-perceived performance improvement. The reduced disk access time also leads to a 40-71% energy savings per access.
2005
THINC: a virtual display architecture for thin-client computing
Rapid improvements in network bandwidth, cost, and ubiquity combined with the security hazards and high total cost of ownership of personal computers have created a growing market for thin-client computing. We introduce THINC, a virtual display architecture for high-performance thin-client computing in both LAN and WAN environments. THINC virtualizes the display at the device driver interface to transparently intercept application display commands and translate them into a few simple low-level commands that can be easily supported by widely used client hardware. THINC's translation mechanism efficiently leverages display semantic information through novel optimizations such as offscreen drawing awareness, native video support, and server-side screen scaling. This is integrated with an update delivery architecture that uses shortest command first scheduling and non-blocking operation. THINC leverages existing display system functionality and works seamlessly with unmodified applications, window systems, and operating systems.We have implemented THINC in an X/Linux environment and compared its performance against widely used commercial approaches, including Citrix MetaFrame, Microsoft RDP, GoToMyPC, X, NX, VNC, and Sun Ray. Our experimental results on web and audio/video applications demonstrate that THINC can provide up to 4.8 times faster web browsing performance and two orders of magnitude better audio/video performance. THINC is the only thin client capable of transparently playing full-screen video and audio at full frame rate in both LAN and WAN environments. Our results also show for the first time that thin clients can even provide good performance using remote clients located in other countries around the world.

19th SOSP — October 19-22, 2003 — Bolton Landing, New York, USA
2003
Front Matter
Safely Executing Untrusted Code
2003
Upgrading transport protocols using untrusted mobile code
In this paper, we present STP, a system in which communicating end hosts use untrusted mobile code to remotely upgrade each other with the transport protocols that they use to communicate. New transport protocols are written in a type-safe version of C, distributed out-of-band, and run in-kernel. Communicating peers select a transport protocol to use as part of a TCP-like connection setup handshake that is backwards-compatible with TCP and incurs minimum connection setup latency. New transports can be invoked by unmodified applications. By providing a late binding of protocols to hosts, STP removes many of the delays and constraints that are otherwise commonplace when upgrading the transport protocols deployed on the Internet. STP is simultaneously able to provide a high level of security and performance. It allows each host to protect itself from untrusted transport code and to ensure that this code does not harm other network users by sending significantly faster than a compliant TCP. It runs untrusted code with low enough overhead that new transport protocols can sustain near gigabit rates on commodity hardware. We believe that these properties, plus compatibility with existing applications and transports, complete the features that are needed to make STP useful in practice.
2003
Model-carrying code: a practical approach for safe execution of untrusted applications
This paper presents a new approach called model-carrying code (MCC) for safe execution of untrusted code. At the heart of MCC is the idea that untrusted code comes equipped with a concise high-level model of its security-relevant behavior. This model helps bridge the gap between high-level security policies and low-level binary code, thereby enabling analyses which would otherwise be impractical. For instance, users can use a fully automated verification procedure to determine if the code satisfies their security policies. Alternatively, an automated procedure can sift through a catalog of acceptable policies to identify one that is compatible with the model. Once a suitable policy is selected, MCC guarantees that the policy will not be violated by the code. Unlike previous approaches, the MCC framework enables code producers and consumers to collaborate in order to achieve safety. Moreover, it provides support for policy selection as well as enforcement. Finally, MCC makes no assumptions regarding the inherent risks associated with untrusted code. It simply provides the tools that enable a consumer to make informed decisions about the risk that he/she is willing to tolerate so as to benefit from the functionality offered by an untrusted application.
File and Storage Systems
2003
The Google file system
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

2003
Preserving peer replicas by rate-limited sampled voting
The LOCKSS project has developed and deployed in a world-wide test a peer-to-peer system for preserving access to journals and other archival information published on the Web. It consists of a large number of independent, low-cost, persistent web caches that cooperate to detect and repair damage to their content by voting in "opinion polls." Based on this experience, we present a design for and simulations of a novel protocol for voting in systems of this kind. It incorporates rate limitation and intrusion detection to ensure that even some very powerful adversaries attacking over many years have only a small probability of causing irrecoverable damage before being detected.
2003
Decentralized user authentication in a global file system
The challenge for user authentication in a global file system is allowing people to grant access to specific users and groups in remote administrative domains, without assuming any kind of pre-existing administrative relationship. The traditional approach to user authentication across administrative domains is for users to prove their identities through a chain of certificates. Certificates allow for general forms of delegation, but they often require more infrastructure than is necessary to support a network file system.This paper introduces an approach without certificates. Local authentication servers pre-fetch and cache remote user and group definitions from remote authentication servers. During a file access, an authentication server can establish identities for users based just on local information. This approach is particularly well-suited to file systems, and it provides a simple and intuitive interface that is similar to those found in local access control mechanisms. An implementation of the authentication server and a file server supporting access control lists demonstrate the viability of this design in the context of the Self-certifying File System (SFS). Experiments demonstrate that the authentication server can scale to groups with tens of thousands of members.
Probing the Black Box
2003
Performance debugging for distributed systems of black boxes
Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.
2003
Transforming policies into mechanisms with infokernel
We describe an evolutionary path that allows operating systems to be used in a more flexible and appropriate manner by higher-level services. An infokernel exposes key pieces of information about its algorithms and internal state; thus, its default policies become mechanisms, which can be controlled from user-level. We have implemented two prototype infokernels based on the linuxtwofour and netbsdver kernels, called infolinux and infobsd, respectively. The infokernels export key abstractions as well as basic information primitives. Using infolinux, we have implemented four case studies showing that policies within Linux can be manipulated outside of the kernel. Specifically, we show that the default file cache replacement algorithm, file layout policy, disk scheduling algorithm, and TCP congestion control algorithm can each be turned into base mechanisms. For each case study, we have found that infokernel abstractions can be implemented with little code and that the overhead and accuracy of synthesizing policies at user-level is acceptable.
2003
User-level internet path diagnosis
Diagnosing faults in the Internet is arduous and time-consuming, in part because the network is composed of diverse components spread across many administrative domains. We consider an extreme form of this problem: can end users, with no special privileges, identify and pinpoint faults inside the network that degrade the performance of their applications? To answer this question, we present both an architecture for user-level Internet path diagnosis and a practical tool to diagnose paths in the current Internet. Our architecture requires only a small amount of network support, yet it is nearly as complete as analyzing a packet trace collected at all routers along the path. Our tool, tulip, diagnoses reordering, loss and significant queuing events by leveraging well deployed but little exploited router features that approximate our architecture. Tulip can locate points of reordering and loss to within three hops and queuing to within four hops on most paths that we measured. This granularity is comparable to that of a hypothetical network tomography tool that uses 65 diverse hosts to localize faults on a given path. We conclude by proposing several simple changes to the Internet to further improve its diagnostic capabilities.
Scheduling and Resource Allocation
2003
Samsara: honor among thieves in peer-to-peer storage
Peer-to-peer storage systems assume that their users consume resources in proportion to their contribution. Unfortunately, users are unlikely to do this without some enforcement mechanism. Prior solutions to this problem require centralized infrastructure, constraints on data placement, or ongoing administrative costs. All of these run counter to the design philosophy of peer-to-peer systems.Samsara enforces fairness in peer-to-peer storage systems without requiring trusted third parties, symmetric storage relationships, monetary payment, or certified identities. Each peer that requests storage of another must agree to hold a claim in return—a placeholder that accounts for available space. After an exchange, each partner checks the other to ensure faithfulness. Samsara punishes unresponsive nodes probabilistically. Because objects are replicated, nodes with transient failures are unlikely to suffer data loss, unlike those that are dishonest or chronically unavailable. Claim storage overhead can be reduced when necessary by forwarding among chains of nodes, and eliminated when cycles are created. Forwarding chains increase the risk of exposure to failure, but such risk is modest under reasonable assumptions of utilization and simultaneous, persistent failure.
2003
SHARP: an architecture for secure resource peering
This paper presents Sharp, a framework for secure distributed resource management in an Internet-scale computing infrastructure. The cornerstone of Sharp is a construct to represent cryptographically protected resource claims—promises or rights to control resources for designated time intervals—together with secure mechanisms to subdivide and delegate claims across a network of resource managers. These mechanisms enable flexible resource peering: sites may trade their resources with peering partners or contribute them to a federation according to local policies. A separation of claims into tickets and leases allows coordinated resource management across the system while preserving site autonomy and local control over resources. Sharp also introduces mechanisms for controlled, accountable oversubscription of resource claims as a fundamental tool for dependable, efficient resource management. We present experimental results from a Sharp prototype for PlanetLab, and illustrate its use with a decentralized barter economy for global PlanetLab resources. The results demonstrate the power and practicality of the architecture, and the effectiveness of oversubscription for protecting resource availability in the presence of failures.
2003
Energy-efficient soft real-time CPU scheduling for mobile multimedia systems
This paper presents GRACE-OS, an energy-efficient soft real-time CPU scheduler for mobile devices that primarily run multimedia applications. The major goal of GRACE-OS is to support application quality of service and save energy. To achieve this goal, GRACE-OS integrates dynamic voltage scaling into soft real-time scheduling and decides how fast to execute applications in addition to when and how long to execute them. GRACE-OS makes such scheduling decisions based on the probability distribution of application cycle demands, and obtains the demand distribution via online profiling and estimation. We have implemented GRACE-OS in the Linux kernel and evaluated it on an HP laptop with a variable-speed CPU and multimedia codecs. Our experimental results show that (1) the demand distribution of the studied codecs is stable or changes smoothly. This stability implies that it is feasible to perform stochastic scheduling and voltage scaling with low overhead; (2) GRACE-OS delivers soft performance guarantees by bounding the deadline miss ratio under application-specific requirements; and (3) GRACE-OS reduces CPU idle time and spends more busy time in lower-power speeds. Our measurement indicates that compared to deterministic scheduling and voltage scaling, GRACE-OS saves energy by 7% to 72% while delivering statistical performance guarantees.
Virtual Machine Monitors
2003
Xen and the art of virtualization
Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating systems. Some target 100% binary compatibility at the expense of performance. Others sacrifice security or functionality for speed. Few offer resource isolation or performance guarantees; most provide only best-effort provisioning, risking denial of service.This paper presents Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sacrificing either performance or functionality. This is achieved by providing an idealized virtual machine abstraction to which operating systems such as Linux, BSD and Windows XP, can be ported with minimal effort.Our design is targeted at hosting up to 100 virtual machine instances simultaneously on a modern server. The virtualization approach taken by Xen is extremely efficient: we allow operating systems such as Linux and Windows XP to be hosted simultaneously for a negligible performance overhead — at most a few percent compared with the unvirtualized case. We considerably outperform competing commercial and freely available solutions in a range of microbenchmarks and system-wide tests.
2003
Implementing an untrusted operating system on trusted hardware
Recently, there has been considerable interest in providing "trusted computing platforms" using hardware~—~TCPA and Palladium being the most publicly visible examples. In this paper we discuss our experience with building such a platform using a traditional time-sharing operating system executing on XOM~—~a processor architecture that provides copy protection and tamper-resistance functions. In XOM, only the processor is trusted; main memory and the operating system are not trusted.Our operating system (XOMOS) manages hardware resources for applications that don't trust it. This requires a division of responsibilities between the operating system and hardware that is unlike previous systems. We describe techniques for providing traditional operating systems services in this context.Since an implementation of a XOM processor does not exist, we use SimOS to simulate the hardware. We modify IRIX 6.5, a commercially available operating system to create xomos. We are then able to analyze the performance and implementation overheads of running an untrusted operating system on trusted hardware.
2003
Terra: a virtual machine-based platform for trusted computing
We present a flexible architecture for trusted computing, called Terra, that allows applications with a wide range of security requirements to run simultaneously on commodity hardware. Applications on Terra enjoy the semantics of running on a separate, dedicated, tamper-resistant hardware platform, while retaining the ability to run side-by-side with normal applications on a general-purpose computing platform. Terra achieves this synthesis by use of a trusted virtual machine monitor (TVMM) that partitions a tamper-resistant hardware platform into multiple, isolated virtual machines (VM), providing the appearance of multiple boxes on a single, general-purpose platform. To each VM, the TVMM provides the semantics of either an "open box," i.e. a general-purpose hardware platform like today's PCs and workstations, or a "closed box," an opaque special-purpose platform that protects the privacy and integrity of its contents like today's game consoles and cellular phones. The software stack in each VM can be tailored from the hardware interface up to meet the security requirements of its application(s). The hardware and TVMM can act as a trusted party to allow closed-box VMs to cryptographically identify the software they run, i.e. what is in the box, to remote parties. We explore the strengths and limitations of this architecture by describing our prototype implementation and several applications that we developed for it.
Making Operating Systems More Robust
2003
Improving the reliability of commodity operating systems
Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85% of recently reported failures. This paper describes Nooks, a reliability subsystem that seeks to greatly enhance OS reliability by isolating the OS from driver failures. The Nooks approach is practical: rather than guaranteeing complete fault tolerance through a new (and incompatible) OS or driver architecture, our goal is to prevent the vast majority of driver-caused crashes with little or no change to existing driver and system code. To achieve this, Nooks isolates drivers within lightweight protection domains inside the kernel address space, where hardware and software prevent them from corrupting the kernel. Nooks also tracks a driver's use of kernel resources to hasten automatic clean-up during recovery.To prove the viability of our approach, we implemented Nooks in the Linux operating system and used it to fault-isolate several device drivers. Our results show that Nooks offers a substantial increase in the reliability of operating systems, catching and quickly recovering from many faults that would otherwise crash the system. In a series of 2000 fault-injection tests, Nooks recovered automatically from 99% of the faults that caused Linux to crash.While Nooks was designed for drivers, our techniques generalize to other kernel extensions, as well. We demonstrate this by isolating a kernel-mode file system and an in-kernel Internet service. Overall, because Nooks supports existing C-language extensions, runs on a commodity operating system and hardware, and enables automated recovery, it represents a substantial step beyond the specialized architectures and type-safe languages required by previous efforts directed at safe extensibility.
2003
Backtracking intrusions
Analyzing intrusions today is an arduous, largely manual task because system administrators lack the information and tools needed to understand easily the sequence of steps that occurred in an attack. The goal of BackTracker is to identify automatically potential sequences of steps that occurred in an intrusion. Starting with a single detection point (e.g., a suspicious file), BackTracker identifies files and processes that could have affected that detection point and displays chains of events in a dependency graph. We use BackTracker to analyze several real attacks against computers that we set up as honeypots. In each case, BackTracker is able to highlight effectively the entry point used to gain access to the system and the sequence of steps from that entry point to the point at which we noticed the intrusion. The logging required to support BackTracker added 9% overhead in running time and generated 1.2 GB per day of log data for an operating-system intensive workload.
2003
RacerX: effective, static detection of race conditions and deadlocks
This paper describes RacerX, a static tool that uses flow-sensitive, interprocedural analysis to detect both race conditions and deadlocks. It is explicitly designed to find errors in large, complex multithreaded systems. It aggressively infers checking information such as which locks protect which operations, which code contexts are multithreaded, and which shared accesses are dangerous. It tracks a set of code features which it uses to sort errors both from most to least severe. It uses novel techniques to counter the impact of analysis mistakes. The tool is fast, requiring between 2-14 minutes to analyze a 1.8 million line system. We have applied it to Linux, FreeBSD, and a large commercial code base, finding serious errors in all of them. RacerX is a static tool that uses flow-sensitive, interprocedural analysis to detect both race conditions and deadlocks. It uses novel strategies to infer checking information such as which locks protect which operations, which code contexts are multithreaded, and which shared accesses are dangerous. We applied it to FreeBSD, Linux and a large commercial code base and found serious errors in all of them.
Revising Old Friends
2003
Separating agreement from execution for byzantine fault tolerant services
We describe a new architecture for Byzantine fault tolerant state machine replication that separates agreement that orders requests from execution that processes requests. This separation yields two fundamental and practically significant advantages over previous architectures. First, it reduces replication costs because the new architecture can tolerate faults in up to half of the state machine replicas that execute requests. Previous systems can tolerate faults in at most a third of the combined agreement/state machine replicas. Second, separating agreement from execution allows a general privacy firewall architecture to protect confidentiality through replication. In contrast, replication in previous systems hurts confidentiality because exploiting the weakest replica can be sufficient to compromise the system. We have constructed a prototype and evaluated it running both microbenchmarks and an NFS server. Overall, we find that the architecture adds modest latencies to unreplicated systems and that its performance is competitive with existing Byzantine fault tolerant systems.
2003
Capriccio: scalable threads for internet services
This paper presents Capriccio, a scalable thread package for use with high-concurrency servers. While recent work has advocated event-based systems, we believe that thread-based systems can provide a simpler programming model that achieves equivalent or superior performance. By implementing Capriccio as a user-level thread package, we have decoupled the thread package implementation from the underlying operating system. As a result, we can take advantage of cooperative threading, new asynchronous I/O mechanisms, and compiler support. Using this approach, we are able to provide three key features: (1) scalability to 100,000 threads, (2) efficient stack management, and (3) resource-aware scheduling.We introduce linked stack management, which minimizes the amount of wasted stack space by providing safe, small, and non-contiguous stacks that can grow or shrink at run time. A compiler analysis makes our stack implementation efficient and sound. We also present resource-aware scheduling, which allows thread scheduling and admission control to adapt to the system's current resource usage. This technique uses a blocking graph that is automatically derived from the application to describe the flow of control between blocking points in a cooperative thread package. We have applied our techniques to the Apache 2.0.44 web server, demonstrating that we can achieve high performance and scalability despite using a simple threaded programming model.
Overlay & Peer-to-Peer Networks
2003
Bullet: high bandwidth data dissemination using an overlay mesh
In recent years, overlay networks have become an effective alternative to IP multicast for efficient point to multipoint communication across the Internet. Typically, nodes self-organize with the goal of forming an efficient overlay tree, one that meets performance targets without placing undue burden on the underlying network. In this paper, we target high-bandwidth data distribution from a single source to a large number of receivers. Applications include large-file transfers and real-time multimedia streaming. For these applications, we argue that an overlay mesh, rather than a tree, can deliver fundamentally higher bandwidth and reliability relative to typical tree structures. This paper presents Bullet, a scalable and distributed algorithm that enables nodes spread across the Internet to self-organize into a high bandwidth overlay mesh. We construct Bullet around the insight that data should be distributed in a disjoint manner to strategic points in the network. Individual Bullet receivers are then responsible for locating and retrieving the data from multiple points in parallel.Key contributions of this work include: i) an algorithm that sends data to different points in the overlay such that any data object is equally likely to appear at any node, ii) a scalable and decentralized algorithm that allows nodes to locate and recover missing data items, and iii) a complete implementation and evaluation of Bullet running across the Internet and in a large-scale emulation environment reveals up to a factor two bandwidth improvements under a variety of circumstances. In addition, we find that, relative to tree-based solutions, Bullet reduces the need to perform expensive bandwidth probing. In a tree, it is critical that a node's parent delivers a high rate of application data to each child. In Bullet however, nodes simultaneously receive data from multiple sources in parallel, making it less important to locate any single source capable of sustaining a high transmission rate.
2003
SplitStream: high-bandwidth multicast in cooperative environments
In tree-based multicast systems, a relatively small number of interior nodes carry the load of forwarding multicast messages. This works well when the interior nodes are highly-available, dedicated infrastructure routers but it poses a problem for application-level multicast in peer-to-peer systems. SplitStream addresses this problem by striping the content across a forest of interior-node-disjoint multicast trees that distributes the forwarding load among all participating peers. For example, it is possible to construct efficient SplitStream forests in which each peer contributes only as much forwarding bandwidth as it receives. Furthermore, with appropriate content encodings, SplitStream is highly robust to failures because a node failure causes the loss of a single stripe on average. We present the design and implementation of SplitStream and show experimental results obtained on an Internet testbed and via large-scale network simulation. The results show that SplitStream distributes the forwarding load among all peers and can accommodate peers with different bandwidth capacities while imposing low overhead for forest construction and maintenance.
2003
Measurement, modeling, and analysis of a peer-to-peer file-sharing workload
Peer-to-peer (P2P) file sharing accounts for an astonishing volume of current Internet traffic. This paper probes deeply into modern P2P file sharing systems and the forces that drive them. By doing so, we seek to increase our understanding of P2P file sharing workloads and their implications for future multimedia workloads. Our research uses a three-tiered approach. First, we analyze a 200-day trace of over 20 terabytes of Kazaa P2P traffic collected at the University of Washington. Second, we develop a model of multimedia workloads that lets us isolate, vary, and explore the impact of key system parameters. Our model, which we parameterize with statistics from our trace, lets us confirm various hypotheses about file-sharing behavior observed in the trace. Third, we explore the potential impact of locality-awareness in Kazaa.Our results reveal dramatic differences between P2P file sharing and Web traffic. For example, we show how the immutability of Kazaa's multimedia objects leads clients to fetch objects at most once; in contrast, a World-Wide Web client may fetch a popular page (e.g., CNN or Google) thousands of times. Moreover, we demonstrate that: (1) this "fetch-at-most-once" behavior causes the Kazaa popularity distribution to deviate substantially from Zipf curves we see for the Web, and (2) this deviation has significant implications for the performance of multimedia file-sharing systems. Unlike the Web, whose workload is driven by document change, we demonstrate that clients' fetch-at-most-once behavior, the creation of new objects, and the addition of new clients to the system are the primary forces that drive multimedia workloads such as Kazaa. We also show that there is substantial untapped locality in the Kazaa workload. Finally, we quantify the potential bandwidth savings that locality-aware P2P file-sharing architectures would achieve.

18th SOSP — October 21-24, 2001 — Lake Louise, Banff, Canada
2001
Front Matter
Trust and Dependability
2001
Untrusted hosts and confidentiality: secure program partitioning
This paper presents secure program partitioning, a language-based technique for protecting confidential data during computation in distributed systems containing mutually untrusted hosts. Confidentiality and integrity policies can be expressed by annotating programs with security types that constrain information flow; these programs can then be partitioned automatically to run securely on heterogeneously trusted hosts. The resulting communicating subprograms collectively implement the original program, yet the system as a whole satisfies the security requirements of participating principals without requiring a universally trusted host machine. The experience in applying this methodology and the performance of the resulting distributed code suggest that this is a promising way to obtain secure distributed computation.
2001
BASE: using abstraction to improve fault tolerance
Software errors are a major cause of outages and they are increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication technique, BASE, which uses abstraction to reduce the cost of Byzantine fault tolerance and to improve its ability to mask software errors. BASE reduces cost because it enables reuse of off-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or non-deterministic service implementations, which reduces the probability of common mode failures. We built an NFS service where each replica can run a different off-the-shelf file system implementation, and an object-oriented database where the replicas ran the same, non-deterministic implementation. These examples suggest that our technique can be used in practice — in both cases, the implementation required only a modest amount of new code, and our performance results indicate that the replicated services perform comparably to the implementations that they reuse.
2001
The costs and limits of availability for replicated services
As raw system and network performance continues to improve at exponential rates, the utility of many services is increasingly limited by availability rather than performance. A key approach to improving availability involves replicating the service across multiple, wide-area sites. However, replication introduces well-known tradeoffs between service consistency and availability. Thus, this paper explores the benefits of dynamically trading consistency for availability using a continuous consistency model. In this model, applications specify a maximum deviation from strong consistency on a per-replica basis. In this paper, we: i) evaluate availability of a prototype replication system running across the Internet as a function of consistency level, consistency protocol, and failure characteristics, ii) demonstrate that simple optimizations to existing consistency protocols result in significant availability improvements (more than an order of magnitude in some scenarios), iii) use our experience with these optimizations to prove tight upper bounds on the availability of services, and iv) show that maximizing availability typically entails remaining as close to strong consistency as possible during times of good connectivity, resulting in a communication versus availability trade-off.
Deconstructing the OS
2001
Information and control in gray-box systems
In modern systems, developers are often unable to modify the underlying operating system. To build services in such an environment, we advocate the use of gray-box techniques. When treating the operating system as a gray-box, one recognizes that not changing the OS restricts, but does not completely obviate, both the information one can acquire about the internal state of the OS and the control one can impose on the OS. In this paper, we develop and investigate three gray-box Information and Control Layers (ICLs) for determining the contents of the file-cache, controlling the layout of files across local disk, and limiting process execution based on available memory. A gray-box ICL sits between a client and the OS and uses a combination of algorithmic knowledge, observations, and inferences to garner information about or control the behavior of a gray-box system. We summarize a set of techniques that are helpful in building gray-box ICLs and have begun to organize a "gray toolbox" to ease the construction of ICLs. Through our case studies, we demonstrate the utility of gray-box techniques, by implementing three useful "OS-like" services without the modification of a single line of OS source code.
2001
Bugs as deviant behavior: a general approach to inferring errors in systems code
A major obstacle to finding program errors in a real system is knowing what correctness rules the system must obey. These rules are often undocumented or specified in an ad hoc manner. This paper demonstrates techniques that automatically extract such checking information from the source code itself, rather than the programmer, thereby avoiding the need for a priori knowledge of system rules.The cornerstone of our approach is inferring programmer "beliefs" that we then cross-check for contradictions. Beliefs are facts implied by code: a dereference of a pointer, p, implies a belief that p is non-null, a call to "unlock(1)" implies that 1 was locked, etc. For beliefs we know the programmer must hold, such as the pointer dereference above, we immediately flag contradictions as errors. For beliefs that the programmer may hold, we can assume these beliefs hold and use a statistical analysis to rank the resulting errors from most to least likely. For example, a call to "spin_lock" followed once by a call to "spin_unlock" implies that the programmer may have paired these calls by coincidence. If the pairing happens 999 out of 1000 times, though, then it is probably a valid belief and the sole deviation a probable error. The key feature of this approach is that it requires no a priori knowledge of truth: if two beliefs contradict, we know that one is an error without knowing what the correct belief is.Conceptually, our checkers extract beliefs by tailoring rule "templates" to a system — for example, finding all functions that fit the rule template "a must be paired with b." We have developed six checkers that follow this conceptual framework. They find hundreds of bugs in real systems such as Linux and OpenBSD. From our experience, they give a dramatic reduction in the manual effort needed to check a large system. Compared to our previous work [9], these template checkers find ten to one hundred times more rule instances and derive properties we found impractical to specify manually.
2001
An empirical study of operating systems errors
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previous studies that consider errors found by manual inspection of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, automation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed.We found that device drivers have error rates up to three to seven times higher than the rest of the kernel. We found that the largest quartile of functions have error rates two to six times higher than the smallest quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens" over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed.
Invited Talk
2001
Why information security is hard — an economic perspective
Resource Management
2001
Real-time dynamic voltage scaling for low-power embedded operating systems
In recent years, there has been a rapid and wide spread of non-traditional computing platforms, especially mobile and portable computing devices. As applications become increasingly sophisticated and processing power increases, the most serious limitation on these devices is the available battery life. Dynamic Voltage Scaling (DVS) has been a key technique in exploiting the hardware characteristics of processors to reduce energy dissipation by lowering the supply voltage and operating frequency. The DVS algorithms are shown to be able to make dramatic energy savings while providing the necessary peak computation power in general-purpose systems. However, for a large class of applications in embedded real-time systems like cellular phones and camcorders, the variable operating frequency interferes with their deadline guarantee mechanisms, and DVS in this context, despite its growing importance, is largely overlooked/under-developed. To provide real-time guarantees, DVS must consider deadlines and periodicity of real-time tasks, requiring integration with the real-time scheduler. In this paper, we present a class of novel algorithms called real-time DVS (RT-DVS) that modify the OS's real-time scheduler and task management service to provide significant energy savings while maintaining real-time deadline guarantees. We show through simulations and a working prototype implementation that these RT-DVS algorithms closely approach the theoretical lower bound on energy consumption, and can easily reduce energy consumption 20% to 40% in an embedded real-time system.
2001
Managing energy and server resources in hosting centers
Internet hosting centers serve multiple service sites from a common hardware base. This paper presents the design and implementation of an architecture for resource management in a hosting center operating system, with an emphasis on energy as a driving resource management issue for large server clusters. The goals are to provision server resources for co-hosted services in a way that automatically adapts to offered load, improve the energy efficiency of server clusters by dynamically resizing the active server set, and respond to power supply disruptions or thermal events by degrading service in accordance with negotiated Service Level Agreements (SLAs).Our system is based on an economic approach to managing shared server resources, in which services "bid" for resources as a function of delivered performance. The system continuously monitors load and plans resource allotments by estimating the value of their effects on service performance. A greedy resource allocation algorithm adjusts resource prices to balance supply and demand, allocating resources to their most efficient use. A reconfigurable server switching infrastructure directs request traffic to the servers assigned to each service. Experimental results from a prototype confirm that the system adapts to offered load and resource availability, and can reduce server energy usage by 29% or more for a typical Web workload.
2001
Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O
Disk schedulers in current operating systems are generally work-conserving, i.e., they schedule a request as soon as the previous request has finished. Such schedulers often require multiple outstanding requests from each process to meet system-level goals of performance and quality of service. Unfortunately, many common applications issue disk read requests in a synchronous manner, interspersing successive requests with short periods of computation. The scheduler chooses the next request too early; this induces deceptive idleness, a condition where the scheduler incorrectly assumes that the last request issuing process has no further requests, and becomes forced to switch to a request from another process.We propose the anticipatory disk scheduling framework to solve this problem in a simple, general and transparent way, based on the non-work-conserving scheduling discipline. Our FreeBSD implementation is observed to yield large benefits on a range of microbenchmarks and real workloads. The Apache webserver delivers between 29% and 71% more throughput on a disk-intensive workload. The Andrew filesystem benchmark runs faster by 8%, due to a speedup of 54% in its read-intensive phase. Variants of the TPC-B database benchmark exhibit improvements between 2% and 60%. Proportional-share schedulers are seen to achieve their contracts accurately and efficiently.
Networking
2001
Resilient overlay networks
A Resilient Overlay Network (RON) is an architecture that allows distributed Internet applications to detect and recover from path outages and periods of degraded performance within several seconds, improving over today's wide-area routing protocols that take at least several minutes to recover. A RON is an application-layer overlay on top of the existing Internet routing substrate. The RON nodes monitor the functioning and quality of the Internet paths among themselves, and use this information to decide whether to route packets directly over the Internet or by way of other RON nodes, optimizing application-specific routing metrics.Results from two sets of measurements of a working RON deployed at sites scattered across the Internet demonstrate the benefits of our architecture. For instance, over a 64-hour sampling period in March 2001 across a twelve-node RON, there were 32 significant outages, each lasting over thirty minutes, over the 132 measured paths. RON's routing mechanism was able to detect, recover, and route around all of them, in less than twenty seconds on average, showing that its methods for fault detection and recovery work well at discovering alternate paths in the Internet. Furthermore, RON was able to improve the loss rate, latency, or throughput perceived by data transfers; for example, about 5% of the transfers doubled their TCP throughput and 5% of our transfers saw their loss probability reduced by 0.05. We found that forwarding packets via at most one intermediate RON node is sufficient to overcome faults and improve performance in most cases. These improvements, particularly in the area of fault detection and recovery, demonstrate the benefits of moving some of the control over routing into the hands of end-systems.
2001
Building efficient wireless sensor networks with low-level naming
In most distributed systems, naming of nodes for low-level communication leverages topological location (such as node addresses) and is independent of any application. In this paper, we investigate an emerging class of distributed systems where low-level communication does not rely on network topological location. Rather, low-level communication is based on attributes that are external to the network topology and relevant to the application. When combined with dense deployment of nodes, this kind of named data enables in-network processing for data aggregation, collaborative signal processing, and similar problems. These approaches are essential for emerging applications such as sensor networks where resources such as bandwidth and energy are limited. This paper is the first description of the software architecture that supports named data and in-network processing in an operational, multi-application sensor-network. We show that approaches such as in-network aggregation and nested queries can significantly affect network traffic. In one experiment aggregation reduces traffic by up to 42% and nested queries reduce loss rates by 30%. Although aggregation has been previously studied in simulation, this paper demonstrates nested queries as another form of in-network processing, and it presents the first evaluation of these approaches over an operational testbed.
2001
Mesh-based content routing using XML
We have developed a new approach for reliably multicasting time-critical data to heterogeneous clients over mesh-based overlay networks. To facilitate intelligent content pruning, data streams are comprised of a sequence of XML packets and forwarded by application-level XML routers. XML routers perform content-based routing of individual XML packets to other routers or clients based upon queries that describe the information needs of downstream nodes. Our PC-based XML router prototype can route an 18 Mbit per second XML stream.Our routers use a novel Diversity Control Protocol (DCP) for router-to-router and router-to-client communication. DCP reassembles a received stream of packets from one or more senders using the first copy of a packet to arrive from any sender. When each node is connected to n parents, the resulting network is resilient to (n − 1) router or independent link failures without repair. Associated mesh algorithms permit the system to recover to (n − 1) resilience after node and/or link failure. We have deployed a distributed network of XML routers that streams real-time air traffic control data. Experimental results show multiple senders improve reliability and latency when compared to tree-based networks.
File Systems
2001
A low-bandwidth network file system
Users rarely consider running network file systems over slow or wide-area networks, as the performance would be unacceptable and the bandwidth consumption too high. Nonetheless, efficient remote file access would often be desirable over such networks—particularly when high latency makes remote login sessions unresponsive. Rather than run interactive programs such as editors remotely, users could run the programs locally and manipulate remote files through the file system. To do so, however, would require a network file system that consumes less bandwidth than most current file systems.This paper presents LBFS, a network file system designed for low-bandwidth networks. LBFS exploits similarities between files or versions of the same file to save bandwidth. It avoids sending data over the network when the same data can already be found in the server's file system or the client's cache. Using this technique in conjunction with conventional compression and caching, LBFS consumes over an order of magnitude less bandwidth than traditional network file systems on common workloads.
2001
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility
This paper presents and evaluates the storage management and caching in PAST, a large-scale peer-to-peer persistent storage utility. PAST is based on a self-organizing, Internet-based overlay network of storage nodes that cooperatively route file queries, store multiple replicas of files, and cache additional copies of popular files.In the PAST system, storage nodes and files are each assigned uniformly distributed identifiers, and replicas of a file are stored at nodes whose identifier matches most closely the file's identifier. This statistical assignment of files to storage nodes approximately balances the number of files stored on each node. However, non-uniform storage node capacities and file sizes require more explicit storage load balancing to permit graceful behavior under high global storage utilization; likewise, non-uniform popularity of files requires caching to minimize fetch distance and to balance the query load.We present and evaluate PAST, with an emphasis on its storage management and caching system. Extensive trace-driven experiments show that the system minimizes fetch distance, that it balances the query load for popular files, and that it displays graceful degradation of performance as the global storage utilization increases beyond 95%.
2001
Wide-area cooperative storage with CFS
The Cooperative File System (CFS) is a new peer-to-peer read-only storage system that provides provable guarantees for the efficiency, robustness, and load-balance of file storage and retrieval. CFS does this with a completely decentralized architecture that can scale to large systems. CFS servers provide a distributed hash table (DHash) for block storage. CFS clients interpret DHash blocks as a file system. DHash distributes and caches blocks at a fine granularity to achieve load balance, uses replication for robustness, and decreases latency with server selection. DHash finds blocks using the Chord location protocol, which operates in time logarithmic in the number of servers.CFS is implemented using the SFS file system toolkit and runs on Linux, OpenBSD, and FreeBSD. Experience on a globally deployed prototype shows that CFS delivers data to clients as fast as FTP. Controlled tests show that CFS is scalable: with 4,096 servers, looking up a block of data involves contacting only seven servers. The tests also demonstrate nearly perfect robustness and unimpaired performance even when as many as half the servers fail.
Invited Talk
2001
Software upgrades in distributed systems
Event-Driven Architectures
2001
Building a robust software-based router using network processors
Recent efforts to add new services to the Internet have increased interest in software-based routers that are easy to extend and evolve. This paper describes our experiences using emerging network processors—in particular, the Intel IXP1200—to implement a router. We show it is possible to combine an IXP1200 development board and a PC to build an inexpensive router that forwards minimum-sized packets at a rate of 3.47Mpps. This is nearly an order of magnitude faster than existing pure PC-based routers, and sufficient to support 1.77Gbps of aggregate link bandwidth. At lesser aggregate line speeds, our design also allows the excess resources available on the IXP1200 to be used robustly for extra packet processing. For example, with 8 × 100Mbps links, 240 register operations and 96 bytes of state storage are available for each 64-byte packet. Using a hierarchical architecture we can guarantee line-speed forwarding rates for simple packets with the IXP1200, and still have extra capacity to process exceptional packets with the Pentium. Up to 310Kpps of the traffic can be routed through the Pentium to receive 1510 cycles of extra per-packet processing.
2001
SEDA: an architecture for well-conditioned, scalable internet services
We propose a new design for highly concurrent Internet services, which we call the staged event-driven architecture (SEDA). SEDA is intended to support massive concurrency demands and simplify the construction of well-conditioned services. In SEDA, applications consist of a network of event-driven stages connected by explicit queues. This architecture allows services to be well-conditioned to load, preventing resources from being overcommitted when demand exceeds service capacity. SEDA makes use of a set of dynamic resource controllers to keep stages within their operating regime despite large fluctuations in load. We describe several control mechanisms for automatic tuning and load conditioning, including thread pool sizing, event batching, and adaptive load shedding. We present the SEDA design and an implementation of an Internet services platform based on this architecture. We evaluate the use of SEDA through two applications: a high-performance HTTP server and a packet router for the Gnutella peer-to-peer file sharing network. These results show that SEDA applications exhibit higher performance than traditional service designs, and are robust to huge variations in load.
Invited Talk
2001
Contrasting the past — new contexts, new arguments

17th SOSP — December 12-15, 1999 — Kiawah Island, South Carolina, USA
Distributed Systems (1)
1999
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service
This paper describes the motivation, design, and performance of Porcupine, a scalable mail server. The goal of Porcupine is to provide a highly available and scalable electronic mail service using a large cluster of commodity PCs. We designed Porcupine to be easy to manage by emphasizing dynamic load balancing, automatic configuration, and graceful degradation in the presence of failures. Key to the system's manageability, availability, and performance is that sessions, data, and underlying services are distributed homogeneously and dynamically across nodes in a cluster.
1999
On the scale and performance of cooperative Web proxy caching
While algorithms for cooperative proxy caching have been widely studied, little is understood about cooperative-caching performance in the large-scale World Wide Web environment. This paper uses both trace-based analysis and analytic modelling to show the potential advantages and drawbacks of inter-proxy cooperation. With our traces, we evaluate quantitatively the performance-improvement potential of cooperation between 200 small-organization proxies within a university environment, and between two large-organization proxies handling 23,000 and 60,000 clients, respectively. With our model, we extend beyond these populations to project cooperative caching behavior in regions with millions of clients. Overall, we demonstrate that cooperative caching has performance benefits only within limited population bounds. We also use our model to examine the implications of future trends in Web-access behavior and traffic.
Client Systems
1999
The interactive performance of SLIM: a stateless, thin-client architecture
Taking the concept of thin clients to the limit, this paper proposes that desktop machines should just be simple, stateless I/O devices (display, keyboard, mouse, etc.) that access a shared pool of computational resources over a dedicated interconnection fabric — much in the same way as a building's telephone services are accessed by a collection of handset devices. The stateless desktop design provides a useful mobility model in which users can transparently resume their work on any desktop console.This paper examines the fundamental premise in this system design that modern, off-the-shelf interconnection technology can support the quality-of-service required by today's graphical and multimedia applications. We devised a methodology for analyzing the interactive performance of modern systems, and we characterized the I/O properties of common, real-life applications (e.g. Netscape, streaming video, and Quake) executing in thin-client environments. We have conducted a series of experiments on the Sun Ray™ 1 implementation of this new system architecture, and our results indicate that it provides an effective means of delivering computational services to a workgroup.We have found that response times over a dedicated network are so low that interactive performance is indistinguishable from a dedicated workstation. A simple pixel encoding protocol requires only modest network resources (as little as a 1Mbps home connection) and is quite competitive with the X protocol. Tens of users running interactive applications can share a processor without any noticeable degradation, and many more can share the network. The simple protocol over a 100Mbps interconnection fabric can support streaming video and Quake at display rates and resolutions which provide a high-fidelity user experience.
1999
Energy-aware adaptation for mobile applications
In this paper, we demonstrate that a collaborative relationship between the operating system and applications can be used to meet user-specified goals for battery duration. We first show how applications can dynamically modify their behavior to conserve energy. We then show how the Linux operating system can guide such adaptation to yield a battery-life of desired duration. By monitoring energy supply and demand, it is able to select the correct tradeoff between energy conservation and application quality. Our evaluation shows that this approach can meet goals that extend battery life by as much as 30%.
Networking (1)
1999
Active network vision and reality: lessions from a capsule-based system
Although active networks have generated much debate in the research community, on the whole there has been little hard evidence to inform this debate. This paper aims to redress the situation by reporting what we have learned by designing, implementing and using the ANTS active network toolkit over the past two years. At this early stage, active networks remain an open research area. However, we believe that we have made substantial progress towards providing a more flexible network layer while at the same time addressing the performance and security concerns raised by the presence of mobile code in the network. In this paper, we argue our progress towards the original vision and the difficulties that we have not yet resolved in three areas that characterize a "pure" active network: the capsule model of programmability; the accessibility of that model to all users; and the applications that can be constructed in practice.
1999
Building reliable, high-performance communication systems from components
Although building systems from components has attractions, this approach also has problems. Can we be sure that a certain configuration of components is correct? Can it perform as well as a monolithic system? Our paper answers these questions for the Ensemble communication architecture by showing how, with help of the Nuprl formal system, configurations may be checked against specifications, and how optimized code can be synthesized from these configurations. The performance results show that we can substantially reduce end-to-end latency in the already optimized Ensemble system. Finally, we discuss whether the techniques we used are general enough for systems other than communication systems.
File Systems
1999
File system usage in Windows NT 4.0
We have performed a study of the usage of the Windows NT File System through long-term kernel tracing. Our goal was to provide a new data point with respect to the 1985 and 1991 trace-based File System studies, to investigate the usage details of the Windows NT file system architecture, and to study the overall statistical behavior of the usage data.In this paper we report on these issues through a detailed comparison with the older traces, through details on the operational characteristics and through a usage analysis of the file system and cache manager. Next to architectural insights we provide evidence for the pervasive presence of heavy-tail distribution characteristics in all aspect of file system usage. Extreme variances are found in session inter-arrival time, session holding times, read/write frequencies, read/write buffer sizes, etc., which is of importance to system engineering, tuning and benchmarking.
1999
Deciding when to forget in the Elephant file system
Modern file systems associate the deletion of a file with the immediate release of storage, and file writes with the irrevocable change of file contents. We argue that this behavior is a relic of the past, when disk storage was a scarce resource. Today, large cheap disks make it possible for the file system to protect valuable data from accidental delete or overwrite.This paper describes the design, implementation, and performance of the Elephant file system, which automatically retains all important versions of user files. Users name previous file versions by combining a traditional pathname with a time when the desired version of a file or directory existed. Storage in Elephant is managed by the system using file-grain user-specified retention policies. This approach contrasts with checkpointing file systems such as Plan-9, AFS, and WAFL that periodically generate efficient checkpoints of entire file systems and thus restrict retention to be guided by a single policy for all files within that file system.Elephant is implemented as a new Virtual File System in the FreeBSD kernel.
1999
Separating key management from file system security
No secure network file system has ever grown to span the Internet. Existing systems all lack adequate key management for security at a global scale. Given the diversity of the Internet, any particular mechanism a file system employs to manage keys will fail to support many types of use.We propose separating key management from file system security, letting the world share a single global file system no matter how individuals manage keys. We present SFS, a secure file system that avoids internal key management. While other file systems need key management to map file names to encryption keys, SFS file names effectively contain public keys, making them self-certifying pathnames. Key management in SFS occurs outside of the file system, in whatever procedure users choose to generate file names.Self-certifying pathnames free SFS clients from any notion of administrative realm, making inter-realm file sharing trivial. They let users authenticate servers through a number of different techniques. The file namespace doubles as a key certification namespace, so that people can realize many key management schemes using only standard file utilities. Finally, with self-certifying pathnames, people can bootstrap one key management mechanism using another. These properties make SFS more versatile than any file system with built-in key management.
OS Kernels
1999
Integrating segmentation and paging protection for safe, efficient and transparent software extensions
The trend towards extensible software architectures and component-based software development demands safe, efficient, and easy-to-use extension mechanisms to enforce protection boundaries among software modules residing in the same address space. This paper describes the design, implementation, and evaluation of a novel intra-address space protection mechanism called Palladium, which exploits the segmentation and paging hardware in the Intel X86 architecture and efficiently supports safe kernel-level and user-level extensions in a way that is largely transparent to programmers and existing programming tools. Based on the considerations on ease of extension programming and systems implementation complexity, Palladium uses different approaches to support user-level and kernel-level extension mechanisms. To demonstrate the effectiveness of the Palladium architecture, we built a Web server that exploits the user-level extension mechanism to invoke CGI scripts as local function calls in a safe way, and we constructed a compiled network packet filter that exploits the kernel-level extension mechanism to run packet-filtering binaries safely inside the kernel at native speed. The current Palladium prototype implementation demonstrates that a protected procedure call and return costs 142 CPU cycles on a Pentium 200MHz machine running Linux.
1999
Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors
Despite the fact that large-scale shared-memory multiprocessors have been commercially available for several years, system software that fully utilizes all their features is still not available, mostly due to the complexity and cost of making the required changes to the operating system. A recently proposed approach, called Disco, substantially reduces this development cost by using a virtual machine monitor that leverages the existing operating system technology.In this paper we present a system called Cellular Disco that extends the Disco work to provide all the advantages of the hardware partitioning and scalable operating system approaches. We argue that Cellular Disco can achieve these benefits at only a small fraction of the development cost of modifying the operating system. Cellular Disco effectively turns a large-scale shared-memory multiprocessor into a virtual cluster that supports fault containment and heterogeneity, while avoiding operating system scalability bottle-necks. Yet at the same time, Cellular Disco preserves the benefits of a shared-memory multiprocessor by implementing dynamic, fine-grained resource sharing, and by allowing users to overcommit resources such as processors and memory. This hybrid approach requires a scalable resource manager that makes local decisions with limited information while still providing good global performance and fault containment.In this paper we describe our experience with a Cellular Disco prototype on a 32-processor SGI Origin 2000 system. We show that the execution time penalty for this approach is low, typically within 10% of the best available commercial operating system for most workloads, and that it can manage the CPU and memory resources of the machine significantly better than the hardware partitioning approach.
1999
EROS: a fast capability system
EROS is a capability-based operating system for commodity processors which uses a single level storage model. The single level store's persistence is transparent to applications. The performance consequences of support for transparent persistence and capability-based architectures are generally believed to be negative. Surprisingly, the basic operations of EROS (such as IPC) are generally comparable in cost to similar operations in conventional systems. This is demonstrated with a set of microbenchmark measurements of semantically similar operations in Linux.The EROS system achieves its performance by coupling well-chosen abstract objects with caching techniques for those objects. The objects (processes, nodes, and pages) are well-supported by conventional hardware, reducing the overhead of capabilities. Software-managed caching techniques for these objects reduce the cost of persistence. The resulting performance suggests that composing protected subsystems may be less costly than commonly believed.
Invited Speakers
1999
Coping with complexity
1999
Computer systems research: past and future
Distributed Systems (2)
1999
The design and implementation of an intentional naming system
This paper presents the design and implementation of the Intentional Naming System (INS), a resource discovery and service location system for dynamic and mobile networks of devices and computers. Such environments require a naming system that is (i) expressive, to describe and make requests based on specific properties of services, (ii) responsive, to track changes due to mobility and performance, (iii) robust, to handle failures, and (iv) easily configurable. INS uses a simple language based on attributes and values for its names. Applications use the language to describe what they are looking for (i.e., their intent), not where to find things (i.e., not hostnames). INS implements a late binding mechanism that integrates name resolution and message routing, enabling clients to continue communicating with end-nodes even if the name-to-address mappings change while a session is in progress. INS resolvers self-configure to form an application-level overlay network, which they use to discover new services, perform late binding, and maintain weak consistency of names using soft-state name exchanges and updates. We analyze the performance of the INS algorithms and protocols, present measurements of a Java-based implementation, and describe three applications we have implemented that demonstrate the feasibility and utility of INS.
1999
Design and implementation of a distributed virtual machine for networked computers
This paper describes the motivation, architecture and performance of a distributed virtual machine (DVM) for networked computers. DVMs rely on a distributed service architecture to meet the manageability, security and uniformity requirements of large, heterogeneous clusters of networked computers. In a DVM, system services, such as verification, security enforcement, compilation and optimization, are factored out of clients and located on powerful network servers. This partitioning of system functionality reduces resource requirements on network clients, improves site security through physical isolation and increases the manageability of a large and heterogeneous network without sacrificing performance. Our DVM implements the Java virtual machine, runs on x86 and DEC Alpha processors and supports existing Java-enabled clients.
Networking (2)
1999
The Click modular router
Click is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queueing, scheduling, and interfacing with network devices. Complete configurations are built by connecting elements into a graph; packets flow along the graph's edges. Several features make individual elements more powerful and complex configurations easier to write, including pull processing, which models packet flow driven by transmitting interfaces, and flow-based router context, which helps an element locate other interesting elements.We demonstrate several working configurations, including an IP router and an Ethernet bridge. These configurations are modular—the IP router has 16 elements on the forwarding path—and easy to extend by adding additional elements, which we demonstrate with augmented configurations. On commodity PC hardware running Linux, the Click IP router can forward 64-byte packets at 73,000 packets per second, just 10% slower than Linux alone.
1999
Soft timers: efficient microsecond software timer support for network processing
This paper proposes and evaluates soft timers, a new operating system facility that allows the efficient scheduling of software events at a granularity down to tens of microseconds. Soft timers can be used to avoid interrupts and reduce context switches associated with network processing without sacrificing low communication delays.More specifically, soft timers enable transport protocols like TCP to efficiently perform rate-based clocking of packet transmissions. Experiments show that rate-based clocking can improve HTTP response time over connections with high bandwidth-delay products by up to 89% and that soft timers allow a server to employ rate-based clocking with little CPU overhead (2-6%) at high aggregate bandwidths.Soft timers can also be used to perform network polling, which eliminates network interrupts and increases the memory access locality of the network subsystem without sacrificing delay. Experiments show that this technique can improve the throughput of a Web server by up to 25%.
Real-time
1999
Progress-based regulation of low-importance processes
MS Manners is a mechanism that employs progress-based regulation to prevent resource contention with low-importance processes from degrading the performance of high-importance processes. The mechanism assumes that resource contention that degrades the performance of a high-importance process will also retard the progress of the low-importance process. MS Manners detects this contention by monitoring the progress of the low-importance process and inferring resource contention from a drop in the progress rate. This technique recognizes contention over any system resource, as long as the performance impact on contending processes is roughly symmetric. MS Manners employs statistical mechanisms to deal with stochastic progress measurements; it automatically calibrates a target progress rate, so no manual tuning is required; it supports multiple progress metrics from applications that perform several distinct tasks; and it orchestrates multiple low-importance processes to prevent measurement interference. Experiments with two low-importance applications show that MS Manners can reduce the degradation of high-importance processes by up to an order of magnitude.
1999
Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler
Systems need to run a larger and more diverse set of applications, from real-time to interactive to batch, on uniprocessor and multiprocessor platforms. However, most schedulers either do not address latency requirements or are specialized to complex real-time paradigms, limiting their applicability to general-purpose systems.In this paper, we present Borrowed-Virtual-Time (BVT) Scheduling, showing that it provides low-latency for real-time and interactive applications yet weighted sharing of the CPU across applications according to system policy, even with thread failure at the real-time level, all with a low-overhead implementation on multiprocessors as well as uniprocessors. It makes minimal demands on application developers, and can be used with a reservation or admission control module for hard real-time applications.
1999
EMERALDS: a small-memory real-time microkernel
EMERALDS (Extensible Microkernel for Embedded, ReAL-time, Distributed Systems) is a real-time microkernel designed for small-memory embedded applications. These applications must run on slow (15-25MHz) processors with just 32-128 kbytes of memory, either to keep production costs down in mass-produced systems or to keep weight and power consumption low. To be feasible for such applications, the OS must not only be small in size (less than 20 kbytes), but also have low-overhead kernel services. Unlike commercial embedded OSs which rely on carefully-crafted code to achieve efficiency, EMERALDS takes the approach of re-designing the basic OS services of task scheduling, synchronization, communication, and system call mechanism by using characteristics found in small-memory embedded systems, such as small code size and a priori knowledge of task execution and communication patterns. With these new schemes, the overheads of various OS services are reduced 20-40% without compromising any OS functionality.

16th SOSP — October 5-8, 1997 — Saint Malo, France
Keynote Address
1997
The Linux challenge
Performance and Correctness
1997
Continuous profiling: where have all the cycles gone?
This paper describes the DIGlTAL Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel. Samples are collected at a high rate (over 5200 samples/sec per 333-MHz processor), yet with low overhead (l-3% slowdown for most workloads).

Analysis tools supplied with the profiling system use the sample data to produce an accurate accounting, down to the level of pipeline stalls incurred by individual instructions, of where time is being spent. When instructions incur stalls, the tools identify possible reasons, such as cache misses, branch mispredictions, and functional unit contention. The fine-grained instruction-level analysis guides users and automated optimizers to the causes of performance problems and provides important insights for fixing them.

1997
System support for automatic profiling and optimization
The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user’s workstation in an automatic, continuous, and transparent manner. Our implementation for a Digital Alpha machine running Digital UNIX 4.0 achieves run-time overheads of less than 0.3% during profile collection. Through the application of three code layout optimizations, we further show that Morph can use statistical profiles to improve application performance. With appropriate system support, automatic profiling and optimization is both possible and effective.
1997
Eraser: a dynamic data race detector for multi-threaded programs
Multi-threaded programming is difficult and error prone. It is easy to make a mistake in synchronization that produces a data race, yet it can be extremely hard to locate this mistake during debugging. This paper describes a new tool, called Eraser, for dynamically detecting data races in lock-based multi-threaded programs. Eraser uses binary rewriting techniques to monitor every shared memory reference and verify that consistent locking behavior is observed. We present several case studies, including undergraduate coursework and a multi-threaded Web search engine, that demonstrate the effectiveness of this approach.
Kernels and OS Structure
1997
The Flux OSKit: a substrate for kernel and language research
Implementing new operating systems is tedious, costly, and often impractical except for large projects. The Flux OSKit addresses this problem in a novel way by providing clean, well-documented OS components designed to be reused in a wide variety of other environments, rather than defining a new OS structure. The OSKit uses unconventional techniques to maximize its usefulness, such as intentionally exposing implementation details and platform-specific facilities. Further, the OSKit demonstrates a technique that allows unmodified code from existing mature operating systems to be incorporated quickly and updated regularly, by wrapping it with a small amount of carefully designed “glue” code to isolate its dependencies and export well-defined interfaces. The OSKit uses this technique to incorporate over 230,000 lines of stable code including device drivers, file systems, and network protocols. Our experience demonstrates that this approach to component software structure and reuse has a surprisingly large impact in the OS implementation domain. Four real-world examples show how the OSKit is catalyzing research and development in operating systems and programming languages.
1997
Application performance and flexibility on exokernel systems
The exokernel operating system architecture safely gives untrusted software efficient control over hardware and software resources by separating management from protection. This paper describes an exokernel system that allows specialized applications to achieve high performance without sacrificing the performance of unmodified UNIX programs. It evaluates the exokernel architecture by measuring end-to-end application performance on Xok, an exokernel for Intel x86-based computers, and by comparing Xok’s performance to the performance of two widely-used 4.4BSD UNIX systems (FreeBSD and OpenBSD). The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). This paper also describes insights about the exokemel approach gained through building three different exokemel systems, and presents novel approaches to resource multiplexing.
1997
The performance of μ-kernel-based systems
First-generation µ-kernels have a reputation for being too slow and lacking sufficient flexibility. To determine whether L4, a lean second-generation µ-kernel, has overcome these limitations, we have repeated several earlier experiments and conducted some novel ones. Moreover, we ported the Linux operating system to run on top of the L4 µ-kernel and compared the resulting system with both Linux running native, and MkLinux, a Linux version that executes on top of a first-generation Mach-derived µ-kernel.

For L4Linux, the AIM benchmarks report a maximum throughput which is only 5% lower than that of native Linux. The corresponding penalty is 5 times higher for a co-located in-kernel version of MkLinux, and 7 times higher for a user-level version of MkLinux. These numbers demonstrate both that it is possible to implement a high-performance conventional operating system personality above a µ-kernel, and that the performance of the µ-kernel is crucial to achieve this.

Further experiments illustrate that the resulting system is highly extensible and that the extensions perform well. Even real-time memory management including second-level cache allocation can be implemented at user-level, coexisting with L4Linux.

Network and Data Services
1997
Cluster-based scalable network services
We identify three fundamental requirements for scalable network services: incremental scalability and oveflow growth provisioning, 24x7 availability through fault masking, and cost-effectiveness. We argue that clusters of commodity workstations interconnected by a high-speed SAN are exceptionally well-suited to meeting these challenges for Internet-server workloads, provided the software infrastructure for managing partial failures and administering a large cluster does not have to be reinvented for each new service. To this end, we propose a general, layered architecture for building cluster-based scalable network services that encapsulates the above requirements for reuse, and a service-programming model based on composable workers that perform transformation, aggregation, caching, and customization (TACC) of Internet content. For both performance and implementation simplicity, the architecture and TACC programming model exploit BASE, a weaker-than-ACID data semantics that results from trading consistency for availability and relying on soft state for robustness in failure management. Our architecture can be used as an “off the shelf” infrastructural platform for creating new network services, allowing authors to focus on the “content” of the service {by composing TACC building blocks) rather than its implementation. We discuss two real implementations of services based on this architecture: TranSend, a Web distillation proxy deployed to the UC Berkeley dialup IP population, and HotBot, the commercial implementation of the Inktomi search engine. We present detailed measurements of TranSend’s performance based on substantial client traces, as well as anecdotal evidence from the TranSend and HotBot experience, to support the claims made for the architecture.
1997
Free transactions with Rio Vista
Transactions and recoverable memories are powerful mechanisms for handling failures and manipulating persistent data. Unfortunately, standard recoverable memories incur an overhead of several milliseconds per transaction. This paper presents a system that improves transaction overhead by a factor of 2000 for working sets that fit in main memory. Of this factor of 2000, a factor of 20 is due to the Rio file cache, which absorbs synchronous writes to disk without losing data during system crashes. The remaining factor of 100 is due to Vista, a 720-line, recoverable-memory library tailored for Rio. Vista lowers transaction overhead to 5 µsec by using no redo log, no system calls, and only one memory-to-memory copy. This drastic reduction in overhead leads to a overall speedup of 150-556x for benchmarks based on TPC-B and TPC-C.
1997
HAC: hybrid adaptive caching for distributed storage systems
This paper presents HAC, a novel technique for managing the client cache in a distributed, persistent object storage system. HAC is a hybrid between page and object caching that combines the virtues of both while avoiding their disadvantages. It achieves the low miss penalties of a page-caching system, but is able to perform well even when locality is poor, since it can discard pages while retaining their hot objects. It realizes the potentially lower miss rates of object-caching systems, yet avoids their problems of fragmentation and high overheads. Furthermore, HAC is adaptive: when locality is good it behaves like a page-caching system, while if locality is poor it behaves like an object-caching system. It is able to adjust the amount of cache space devoted to pages dynamically so that space in the cache can be used in the way that Fcst matches tbe needs of the application.

The paper also presents results of experiments that indicate that HAC outperforms other object storage systems across a wide range of cache sizes and workloads; it performs substantially better on the expected workloads, which have low to moderate locality. Thus we show that our hybrid, adaptive approach is the cache management technique of choice for distributed, persistent object systems.

Security
1997
Extensible security architectures for Java
Mobile code technologies such as Java, JavaScript, and ActiveX generally limit all programs to a single restrictive security policy. However, software-based protection can allow for more extensible security models, with potentially significant performance improvements over traditional hardware-based solutions. An extensible security system should be able to protect subsystems and implement policies that are created after the initial system is shipped. We describe and analyze three implementation strategies for interposing such security policies in softwarebased security systems. Implementations exist for all three strategies: several vendors have adapted capabilities to Java, Netscape and Microsoft have extensions to Java’s stack introspection, and we built a name space management system as an add-on to Microsoft Internet Explorer. Theoretically, all these systems are equivalently secure, but many practical issues and implementation details favor some aspects of each system.
1997
A decentralized model for information flow control
This paper presents a new model for controlling information flow in systems with mutual distrust and decentralized authority. The model allows users to share information with distrusted code (e.g., downloaded applets), yet still control how that code disseminates the shared information to others. The model improves on existing multilevel security models by allowing users to declassify information in a decentralized way, and by improving support for fine-grained data sharing. The paper also shows how static program analysis can be used to certify proper information flows in this model and to avoid most run-time information flow checks.
Multiprocessing Support
1997
Disco: running commodity operating systems on scalable multiprocessors
In this paper we examine the problem of extending modern operating systems to run efficiently on large-scale shared memory multiprocessors without a large implementation effort. Our approach brings back an idea popular in the 197Os, virtual machine monitors. We use virtual machines to run multiple commodity operating systems on a scalable multiprocessor. This solution addresses many of the challenges facing the system software for these machines. We demonstrate our approach with a prototype called Disco that can run multiple copies of Silicon Graphics’ IRIX operating system on a multiprocessor. Our experience shows that the overheads of the monitor are small and that the approach provides scalability as well as the ability to deal with the non-uniform memory access time of these systems. To reduce the memory overheads associated with running multiple operating systems, we have developed techniques where the virtual machines transparently share major data structures such as the program code and the file system buffer cache. We use the distributed system support of modem operating systems to export a partial single system image to the users. The overall solution achieves most of the benefits of operating systems customized for scalable multiprocessors yet it can be achieved with a significantly smaller implementation effort.
1997
Towards transparent and efficient software distributed shared memory
Despite a large research effort, software distributed shared memory systems have not been widely used to run parallel applications across clusters of computers. The higher performance of hardware multiprocessors makes them the preferred platform for developing and executing applications. In addition, most applications are distributed only in binary format for a handful of popular hardware systems. Due to their limited functionality, software systems cannot directly execute the applications developed for hardware platforms. We have developed a system called Shasta that attempts to address the issues of efficiency and transparency that have hindered wider acceptance of software systems. Shastais a distributed shared memory system that supports coherence at a fine granularity in software and can efficiently exploit small-scale SMP nodes by allowing processes on the same node to share data at hardware speeds.

This paper focuses on our goal of tapping into large classes of commercially available applications by transparently executing the same binaries that run on hardware platforms. We discuss the issues involved in achieving transparent execution of binaries, which include supporting the full instruction set architecture, implementing an appropriate memory consistency model, and extending OS services across separate nodes. We also describe the techniques used in Shasta to solve the above problems. The Shasta system is fully functional on a prototype cluster of Alpha multiprocessors connected through Digital’s Memory Channel network and can transparently run parallel applications on the cluster that were compiled to run on a single shared-memory multiprocessor. As an example of Shasta’s flexibility, it can execute Oracle 7.3, a commercial database engine, across the cluster, including workloads modeled after the TPC-B and TPC-D database benchmarks. To characterize the performance of the system and the cost of providing complete transparency, we present performance results for microbenchmarks and applications running on the cluster, include preliminary results for Oracle runs.

1997
Cashmere-2L: software coherent shared memory on a clustered remote-write network
Low-latency remote-write networks, such as DEC’s Memory Channel, provide the possibility of transparent, inexpensive, huge-scale shared-memory parallel computing on clusters of shared memory multiprocessors (SMPs). The challenge is to take advantage of hardware shared memory for sharing within an SMI: and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster. In this paper, we describe a “two-level” software coherent shared memory system—Cashmere-2L—that meets this challenge. Cashmere2L uses hardware to share memory within a node, while exploiting the Memory Channel’s remote-write capabilities to implement “moderately lazy” release consistency with multiple concurrent writers, directories, home nodes, and page-size coherence blocks across nodes. Cashmere-2L employs a novel coherence protocol that allows a high level of asynchrony by eliminating global directory locks and the need for TLB shootdown. Remote interrupts are minimized by exploiting the remote-write capabilities of the Memory Channel network.

Cashmere-2L currently runs on an 8-node, 32-processor DEC AlphaServer system. Speedups range from 8 to 31 on 32 processors for our benchmark suite, depending on the application’s characteristics. We quantify the importance of our protocol optimizations by comparing performance to that of several alternative protocols that do not share memory in hardware within an SMP, and require more synchronization. In comparison to a one-level protocol that does not share memory in hardware within an SMP Cashmere-2L improves performance by up to 46%.

Scheduling for Multimedia
1997
The design, implementation and evaluation of SMART: a scheduler for multimedia applications
Real-tune applications such as multimedia audio and video are increasingly populating the workstation desktop. To support the execution of these applications in conjunction with traditional non-realtime applications, we have created SMART, a Scheduler for Multimedia And Real-Time applications. SMART supports applications with time constraints. and provides dynamic feedback to applications to allow them to adapt to the current load. In addition. the support for real-time applications is integrated with the support for conventional computations. This allows the user to prioritize across real-time and conventional computations, and dictate how the processor is to be shared among applications of the same priority. As the system load changes, SMART adjusts the allocation of resources dynamically and seamlessly. SMART is unique in its ability to automatically shed real-time tasks and regulate their execution rates when the system is overloaded, while providing better value in underloaded conditions than previously proposed schemes. We have implemented SMART in the Solaris UNIX operating system and measured its performance against other schedulers in executing real-time, interactive, and batch applications. Our results demonstrate SMART’s superior performance in supporting multimedia applications.
1997
CPU reservations and time constraints: efficient, predictable scheduling of independent activities
Workstations and personal computers are increasingly being used for applications with real-time characteristics such as speech understanding and synthesis, media computations and I/O. and animation, often concurrently executed with traditional non-real-time workloads. This paper presents a system that can schedule multiple independent activities so that:
  • activities can obtain minimum guaranteed execution rates with application-specified reservation granularities via CPU Reservations,
  • CPU Reservations, which are of the form “reserve X units of time out of every Y units”, provide not just an average case execution rate of X/Y over long periods of time, but the stronger guarantee that from any instant of time, by Y time units later, the activity will have executed for at least X time units,
  • applications can use Time Constraints to schedule tasks by deadlines, with on-time completion guaranteed for tasks with accepted constraints, and
  • both CPU Reservations and Time Constraints are implemented very efficiently. In particular,
  • CPU scheduling overhead is bounded by a constant and is not a function of the number of schedulable tasks.
Other key scheduler properties are:
  • activities cannot violate other activities’ guarantees,
  • time constraints and CPU reservations may be used together, separately, or not at all (which gives a round-robin schedule), with well-defined interactions between all combinations, and
  • spare CPU time is fairly shared among all activities.
The Rialto operating system, developed at Microsoft Research, achieves these goals by using a precomputed schedule, which is the fundamental basis of this work.
1997
Distributed schedule management in the Tiger video fileserver
Tiger is a scalable, fault-tolerant video file server constructed from a collection of computers connected by a switched network. All content files are striped across all of the computers and disks in a Tiger system. In order to prevent conflicts for a particular resource between two viewers, Tiger schedules viewers so that they do not require access to the same resource at the same time. In the abstract, there is a single, global schedule that describes all of the viewers in the system. In practice, the schedule is distributed among all of the computers in the system, each of which has a possibly partially inconsistent view of a subset of the schedule. By using such a relaxed consistency model for the schedule, Tiger achieves scalability and fault tolerance while still providing the consistent, coordinated service required by viewers.
File Systems and I/O
1997
Frangipani: a scalable distributed file system
The ideal distributed file system would provide all its users with coherent, shared access to the same set of files, yet would be arbitrarily scalable to provide more storage space and higher performance to a growing user community. It would be highly available in spite of component failures. It would require minimal human administration, and administration would not become more complex as more components were added.

Frangipani is a new file system that approximates this ideal, yet was relatively easy to build because of its two-layer structure. The lower layer is Petal (described in an earlier paper), a distributed storage service that provides incrementally scalable, highly available, automatically managed virtual disks. In the upper layer, multiple machines run the same Frangipani file system code on top of a shared Petal virtual disk, using a distributed lock service to ensure coherence.

Frangipani is meant to run in a cluster of machines that are under a common administration and can communicate securely. Thus the machines trust one another and the shared virtual disk approach is practical. Of course, a Frangipani file system can be exported to untrusted machines using ordinary network file access protocols.

We have implemented Frangipani on a collection of Alphas running DIGITAL Unix 4.0. Initial measurements indicate that Frangipani has excellent single-server performance and scales well as servers are added.

1997
Improving the performance of log-structured file systems with adaptive methods
File system designers today face a dilemma. A log-structured file system (LFS) can offer superior performance for many common workloads such as those with frequent small writes, read traffic that is predominantly absorbed by the cache, and sufficient idle time to clean the log. However, an LFS has poor performance for other workloads, such as random updates to a full disk with little idle time to clean. In this paper, we show how adaptive algorithms can be used to enable LFS to provide high performance across a wider range of workloads. First, we show how to improve LFS write performance in three ways: by choosing the segment size to match disk and workload characteristics, by modifying the LFS cleaning policy to adapt to changes in disk utilization, and by using cached data to lower cleaning costs. Second, we show how to improve LFS read performance by reorganizing data to match read patterns. Using trace-driven simulations on a combination of synthetic and measured workloads, we demonstrate that these extensions to LFS can significantly improve its performance.
1997
Exploiting the non-determinism and asynchrony of set iterators to reduce aggregate file I/O latency
A key goal of distributed systems is to provide prompt access to shared information repositories. The high latency of remote access is a serious impediment to this goal. This paper describes a new file system abstraction called dynamic sets — unordered collections created by an application to hold the files it intends to process. Applications that iterate on the set to access its members allow the system to reduce the aggregate I/O Iatency by exploiting the non-determinism and asychrony inherent in the semantics of set iterators. This reduction in latency comes without relying on reference locality, without modifying DFS servers and protocols, and without unduly complicating the programming model. This paper presents this abstraction and describes an implementation of it that runs on local and distributed file systems, as well as the World Wide Web. Dynamic sets demonstrate substantial performance gains — up to 50% savings in runtime for search on NFS, and up to 90% reduction in I/O latency for Web searches.
Mobile Systems
1997
Automated hoarding for mobile computers
A common problem facing mobile computing is disconnected operation, or computing in the absence of a network. Hoarding eases disconnected operation by selecting a subset of the user’s files for local storage. We describe a hoarding system that can operate without user intervention, by observing user activity and predicting future needs. The system calculates a new measure, semantic distance, between individual files, and uses this to feed a clustering algorithm that chooses which files should be hoarded. A separate replication system manages the actual transport of data; any of a number of replication systems may be used. We discuss practical problems encountered in the real world and present usage statistics showing that our system outperforms previous approaches by factors that can exceed 1O:l.
1997
Agile application-aware adaptation for mobility
In this paper we show that application-aware adaptation, a collaborative partnership between the operating system and applications, offers the most general and effective approach to mobile information access. We describe the design of Odyssey, a prototype implementing this approach, and show how it supports concurrent execution of diverse mobile applications. We identify agility as a key attribute of adaptive systems, and describe how to quantify and measure it. We present the results of our evaluation of Odyssey, indicating performance improvements up to a factor of 5 on a benchmark of three applications concurrently using remote services over a network with highly variable bandwidth.
1997
Flexible update propagation for weakly consistent replication
Bayou’s anti-entropy protocol for update propagation between weakly consistent storage replicas is based on pair-wise communication, the propagation of write operations, and a set of ordering and closure constraints on the propagation of the writes. The simplicity of the design makes the protocol very flexible, thereby providing support for diverse networking environments and usage scenarios. It accommodates a variety of policies for when and where to propagate updates. It operates over diverse network topologies, including low-bandwidth links. It is incremental. It enables replica convergence, and updates can be propagated using floppy disks and similar transportable media. Moreover, the protocol handles replica creation and retirement in a light-weight manner. Each of these features is enabled by only one or two of the protocol’s design choices, and can be independently incorporated in other systems. This paper presents the anti-entropy protocol in detail, describing the design decisions and resulting features.

15th SOSP — December 3-6, 1995 — Copper Mountain, Colorado, USA
Reliability
1995
Hypervisor-based fault tolerance
Protocols to implement a fault-tolerant computing system are described. These protocols augment the hypervisor of a virtual-machine manager and coordinate a primary virtual machine with its backup. The result is a fault-tolerant computing system. No modification to hardware, operating system, or application programs is required. A prototype system was constructed for HP’s PA-RISC instruction-set architecture. The prototype was able to run programs about a factor of 2 slower than a bare machine would.
1995
Hive: fault containment for shared-memory multiprocessors
Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distributed system of independent kernels called cells. This improves reliability because a hardware or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor.

This paper focuses on Hive’s solutlon to the following key challenges: (1) fault containment, i.e. confining the effects of hardware or software faults to the cell where they occur, and (2) memory sharing among cells, which is required to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requires defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH firewall, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a unified free page frame pool.

We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH. The effects of faults were contained to the cell in which they occurred m all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0%.and 11% slowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).

1995
Logged virtual memory
Logged virtual memory (LVM) provides a log of writes to one or more specified regions of the virtual address space. Logging is useful for applications that require rollback and/or persistence such as parallel simulations and memory-mapped object-oriented databases. It can also be used for output, debuggingg and distributed consistency maintenance.

This paper describes logged virtual memory as an extensiou of the standard virtual memory system software and hardware, our prototype implementation and some performance measurements from this prototype. Based on these measurements and the experience with our prototype, we argue that logged virtual memory can be supported with modest extensions to standard virtual memory systems, provides significant benefit to applications and servers, and is faster than other log-generation techniques.

Distributed Computing
1995
U-Net: a user-level network interface for parallel and distributed computing (includes URL)
The U-Net communication architecture provides processes with a virtual view of a network interface to enable user-level access to high-speed communication devices. The architecture, implemented on standard workstations using off-the-shelf ATM communication hardware, removes the kernel from the communication path, while still providing full protection.

The model presented by U-Net allows for the construction of protocols at user level whose performance is only limited by the capabilities of network. The architecture is extremely flexible in the sense that traditional protocols like TCP and UDP, as well as novel abstractions like Active Messages can be implemented efficiently. A U-Net prototype on an 8-node ATM cluster of standard workstations offers 65 microseconds round-trip latency and 15 Mbytes/sec bandwidth. It achieves TCP performance at maximum network bandwidth and demonstrates performance equivalent to Meiko CS-2 and TMC CM-5 supercomputers on a set of Split-C benchmarks.

1995
A highly available scalable ITV system
As part of Time Warner’s interactive TV trial in Orlando, Florida, we have implemented mechanisms for the construction of highly available and scalable system services and applications. Our mechanisms rely on an underlying distributed objects architecture, similar to Spring. We have extended a standard name service interface to provide selectors for choosing among service replicas and auditing to allow the automatic detection and removal of unresponsive objects from the name space. In addition, our system supports resource recovery, by letting servers detect client failures, and automated restart of failed services. Our experience has been that these mechanisms greatly simplify the development of services that are both highly available and scalable, The system was built in less than 15 months, is currently in a small number of homes, and will support the trial’s 4,000 users later this year.
1995
Object and native code thread mobility among heterogeneous computers (includes sources)
We present a technique for moving objects and threads among heterogeneous computers at the native code level. To enable mobility of threads running native code, we convert thread states among machine-dependent and machine-independent formats. We introduce the concept of bus stops, which are machine-independent representations of program points as represented by program counter values, The concept of bus stops can be used also for other purposes, e.g., to aid inspecting and debugging optimized code, garbage collection etc. We also discuss techniques for thread mobility among processors executing differently optimized codes.

We demonstrate the viability of our ideas by providing a prototype implementation of object and thread mobility among heterogeneous computers, The prototype uses the Emerald distributed programming language without modification; we have merely extended the Emerald runtime system and the code generator of the Emerald compiler. Our extensions allow object and thread mobility among VAX, Sun-3, HP9000/300, and Sun SPARC workstations. The excellent intra-node performance of the original homogeneous Emerald is retained: migrated threads run at native code speed before and after migration; the same speed as on homogeneous Emerald and close to C code performance. Our implementation of mobility has not been optimized: thread mobility and trans-architecture invocations take about 60% longer than in lhe homogeneous implementation.

We believe this is the first implementation of full object and thread mobility among heterogeneous computers with threads executing native code.

File Systems
1995
Informed prefetching and caching
In this paper, we present aggressive, proactive mechanisms that tailor file system resource management to the needs of I/O-intensive applications. In particular, we show how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism, and to dynamically allocate file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses. Our approach estimates the impact of alternative buffer allocations on application execution time and applies cost-benefit analysis to allocate buffers where they will have the greatest impact. We have implemented informed prefetching and caching in Digital’s OSF/1 operating system and measured its performance on a 150 MHz Alpha equipped with 15 disks running a range of applications. Informed prefetching reduces the execution time of text search, scientific visualization, relational database queries, speech recognition, and object linking by 20-83%. Informed caching reduces the execution time of computational physics by up to 42% and contributes to the performance improvement of the object linker and the database. Moreover, applied to multiprogrammed, I/O-intensive workloads, informed prefetching and caching increase overall throughput.
1995
The HP AutoRAID hierarchical storage system
Configuring redundant disk arrays is a black art. To properly configure an array, a system administrator must understand the details of both the array and the workload it will support; incorrect understanding of either, or changes in the workload over time, can lead to poor performance.

We present a solution to this problem: a two-level storage hierarchy implemented inside a single disk-array controller In the upper level of this hierarchy, two copies of active data are stored to provide full redundancy and excellent performance. In the lower level, RAID 5 parity protection is used to provide excellent storage cost for inactive data, at somewhat lower performance.

The technology we describe in this paper, known as HP AutoRAID, automatically and transparently manages migration of data blocks between these two levels as access patterns change. The result is a fully-redundant storage system that is extremely easy to use, suitable for a wide variety of workloads, largely insensitive to dynamic workload changes, and that performs much better than disk arrays with comparable numbers of spindles and much larger amounts of front-end RAM cache. Because the implementation of the HP AutORAID technology is almost entirely in embedded software, the additional hardware cost for these benefits is very small. We describe the HP AutoRAID technology in detail, and provide performance data for an embodiment of it in a prototype storage array, together with the results of simulation studies used to choose algorithms used in the array.

1995
Serverless network file systems
In this paper, we propose a new paradigm for network file system design, serverless network file systems. While traditional network file systems rely on a central server machine, a serverless system utilizes workstations cooperating as peers to provide all file system services. Any machine in the system can store, cache, or control any block of data. Our approach uses this location independence, in combination with fast local area networks, to provide better performance and scalability than traditional file systems. Further, because any machine in the system can assume the responsibilities of a failed component, our serverless design also provides high availability via redundant data storage. To demonstrate our approach, we have implemented a prototype serverless network file system called xFS. Preliminary performance measurements suggest that our architecture achieves its goal of scalability. For instance, in a 32-node xFS system with 32 active clients, each client receives nearly as much read or write throughput as it would see if it were the only active client.
1995
Performance of cache coherence in stackable filing
Stackable design of filing systems constructs sophisticated services from multiple, independently developed layers. This approach has been advocated to address development problems from code re-use, to extensibility, to version management.

Individual layers of such a system often need to cache data to improve performance or provide desired functionality. When access to different layers is allowed, cache incoherencies can occur. Without a cache coherence solution, layer designers must either restrict layer access and flexibility or compromise the layered structure to avoid potential data corruption. The value of modular designs such as stacking can be questioned without a suitable solution to this problem.

This paper presents a general cache coherence architecture for stackable filing, including a standard approach to data identification as key component to layered coherence protocols. We also present a detailed performance analysis of one implementation of stack cache-coherence, which suggests that very low overheads can be achieved in practice.

Mobility
1995
Exploiting weak connectivity for mobile file access
Weak connectivity, in the form of intermittent, low-bandwidth, or expensive networks is a fact of life in mobile computing. In this paper, we describe how the Coda File System has evolved to exploit such networks. The underlying theme of this evolution has been the systematic introduction of adaptivity to eliminate hidden assumptions about strong connectivity. Many aspects of the system, including communication, cache validation, update propagation and cache miss handling have been modified. As a result, Coda is able to provide good performance even when network bandwidth varies over four orders of magnitude — from modem speeds to LAN speeds.
1995
Rover: a toolkit for mobile information access
The Rover toolkit combines relocatable dynamic objects and queued remote procedure calls to provide unique services for “roving” mobile applications. A relocatable dynamic object is an object with a well-defined interface that can be dynamically loaded into a client computer from a server computer (or vice versa) to reduce client-server communication requirements. Queued remote procedure call is a communication system that permits applications to continue to make non-blocking remote procedure call requests even when a host is disconnected, with requests and responses being exchanged upon network reconnection. The challenges of mobile environments include intermittent connectivity, limited bandwidth, and channel-use optimization. Experimental results from a Rover-based mail reader, calendar program, and two non-blocking versions of World-Wide Web browsers show that Rover’s services are a good match to these challenges. The Rover toolkit also offers advantages for workstation applications by providing a uniform distributed object architecture for code shipping, object caching, and asynchronous object invocation.
1995
Managing update conflicts in Bayou, a weakly connected replicated storage system
Bayou is a replicated, weakly consistent storage system designed for a mobile computing environment that includes portable machines with less than ideal network connectivity. To maximize availability, users can read and write any accessible replica. Bayou’s design has focused on supporting application-specific mechanisms to detect and resolve the update conflicts that naturally arise in such a system, ensuring that replicas move towards eventual consistency, and defining a protocol by which the resolution of update conflicts stabilizes. It includes novel methods for conflict detection, called dependency checks, and per-write conflict resolution based on client-provided merge procedures. To guarantee eventual consistency, Bayou servers must be able to rollback the effects of previously executed writes and redo them according to a global serialization order. Furthermore, Bayou permits clients to observe the results of all writes received by a server, including tentative writes whose conflicts have not been ultimately resolved. This paper presents the motivation for and design of these mechanisms and describes the experiences gained with an initial implementation of the system.
Virtual Memory
1995
A new page table for 64-bit address spaces
Most computer architectures are moving to 64-bit virtual address spaces. We first discuss how this change impacts conventional linear, forward-mapped, and hashed page tables. We then introduce a new page table data structure—clustered page table—that can be viewed as a hashed page table augmented with subblocking. Specifically, it associates mapping information for several pages (e.g., sixteen) with a single virtual tag and next pointer. Simulation results with several workloads show that clustered page tables use less memory than alternatives without adversely affecting page table access time.

Since physical address space use is also increasing, computer architects are using new techniques—such as superpages, complete-subblocking, and partial-subblocking—to increase the memory mapped by a translation lookaside buffer (TLB). Since these techniques are completely ineffective without page table support, we next look at extending conventional and clustered page tables to support them. Simulation results show clustered page tables support medium-sized superpage and subblock TLBs especially well.

1995
Implementing global memory management in a workstation cluster
Advances in network and processor technology have greatly changed the communication and computational power of local-area workstation clusters. However, operating systems still treat workstation clusters as a collection of loosely-connected processors, where each workstation acts as an autonomous and independent agent. This operating system structure makes it difficult to exploit the characteristics of current clusters, such as low-latency communication, huge primary memories, and high-speed processors, in order to improve the performance of cluster applications.

This paper describes the design and implementation of global memory management in a workstation cluster. Our objective is to use a single, unified, but distributed memory management algorithm at the lowest level of the operating system. By managing memory globally at this level, all system- and higher-level software, including VM, file systems, transaction systems, and user applications, can benefit from available cluster memory. We have implemented our algorithm in the OSF/1 operating system running on an ATM-connected cluster of DEC Alpha workstations. Our measurements show that on a suite of memory-intensive programs, our system improves performance by a factor of 1.5 to 3.5. We also show that our algorithm has a performance advantage over others that have been proposed in the past.

1995
CRL: high-performance all-software distributed shared memory
The C Region Library (CRL) is a new all-software distributed shared memory (DSM) system. CRL requires no special compiler, hardware, or operating system support beyond the ability to send and receive messages. It provides a simple, portable, region-based shared address space programming model that is capable of delivering good performance on a wide range of multiprocessor and distributed system architectures. Each region is an arbitrarily sized, contiguous area of memory. The programmer defines regions and delimits accesses to them using annotations.

We have developed CRL Implementations for two platforms: the Thinking Machines CM-5, a commercial multicomputer, and the MIT Alewife machine, an experimental multiprocessor offering efficient support for both message passing and shared memory. We present results for up to 128 processors on the CM-5 and up to 32 processors on Alewife, In a set of controlled experiments, we demonstrate that CRL is the first all-software DSM system capable of delivering performance competitive with hardware DSMS. CRL achieves speedups within 15% of those provided by Alewife’s native support for shared memory, even for challenging applications (e.g., Barnes-Hut) and small problem sizes.

Panel
1995
Going Threadbare: Sense or Sedition?
Kernels
1995
On micro-kernel construction
From a software-technology point of view, the µ-kernel concept is superior to large integrated kernels. On the other hand, it is widely believed that (a) µ-kernel based systems are inherently inefficient and (b) they are not sufficiently flexible. Contradictory to this belief, we show and support by documentary evidence that inefficiency and inflexibility of current µ-kernels is not inherited from the basic idea but mostly from overloading the kernel and/or from improper implementation.

Based on functional reasons, we describe some concepts which must be implemented by a µ-kernel and illustrate their flexibility. Then, we analyze the performance critical points. We show what performance is achievable, that the efficiency is sufficient with respect to macro-kernels and why some published contradictory measurements are not evident. Furthermore, we describe some implementation techniques and illustrate why µ-kernels are inherently not portable, although they improve portability of the whole system.

1995
Exokernel: an operating system architecture for application-level resource management
Traditional operating systems limit the performance, flexibility, and functionality of applications by fixing the interface and implementation of operating system abstractions such as interprocess communication and virtual memory. The exokernel operating system architecture addresses this problem by providing application-level management of physical resources. In the exokernel architecture, a small kernel securely exports all hardware resources through a low-level interface to untrusted library operating systems. Library operating systems use this interface to implement system objects and policies. This separation of resource protection from management allows application-specific customization of traditional operating system abstractions by extending, specializing, or even replacing libraries.

We have implemented a prototype exokernel operating system. Measurements show that most primitive kernel operations (such as exception handling and protected control transfer) are ten to 100 times faster than in Ultrix, a mature monolithic UNIX operating system. In addition, we demonstrate that an exokernel allows applications to control machine resources in ways not possible in traditional operating systems. For instance, virtual memory and interprocess communication abstractions are implemented entirely within an application-level library. Measurements show that application-level virtual memory and interprocess communication primitives are five to 40 times faster than Ultrix’s kernel primitives. Compared to state-of-the-art implementations from the literature, the prototype exokernel system is at least five times faster on operations such as exception dispatching and interprocess communication.

1995
Extensibility safety and performance in the SPIN operating system
This paper describes the motivation, architecture and performance of SPIN, an extensible operating system. SPIN provides an extension infrastructure, together with a core set of extensible services, that allow applications to safely change the operating system’s interface and implementation. Extensions allow an application to specialize the underlying operating system in order to achieve a particular level of performance and functionality. SPIN uses language and link-time mechanisms to inexpensively export fine-grained interfaces to operating system services. Extensions are written in a type safe language, and are dynamically linked into the operating system kernel. This approach offers extensions rapid access to system services, while protecting the operating system code executing within the kernel address space. SPIN and its extensions are written in Modula-3 and run on DEC Alpha workstations.
O.S. Performance
1995
The impact of architectural trends on operating system performance
Computer systems are rapidly changing. Over the next few years, we will see wide-scale deployment of dynamically-scheduled processors that can issue multiple instructions every clock cycle, execute instructions out of order, and overlap computation and cache misses. We also expect clock-rates to increase, caches to grow, and multiprocessors to replace uniprocessors. Using SimOS, a complete machine simulation environment, this paper explores the impact of the above architectural trends on operating system performance. We present results based on the execution of large and realistic workloads (program development, transaction processing, and engineering compute-server) running on the IRIX 5.3 operating system from Silicon Graphics Inc.

Looking at uniprocessor trends, we find that disk I/0 is the first-order bottleneck for workloads such as program development and transaction processing. Its importance continues to grow over time. Ignoring I/0, we find that the memory system is the key bottleneck, stalling the CPU for over 50% of the execution time. Surprisingly, however, our results show that this stall fraction is unlikely to increase on future machines due to increased cache sizes and new latency hiding techniques in processors. We also find that the benefits of these architectural trends spread broadly across a majority of the important services provided by the operating system. We find the situation to be much worse for multiprocessors. Most operating systems services consume 30-70% more time than their uniprocessor counterparts. A large fraction of the stalls are due to coherence misses caused by communication between processors. Because larger caches do not reduce coherence misses, the performance gap between uniprocessor and multiprocessor performance will increase unless operating system developers focus on kernel restructuring to reduce unnecessary communication. The paper presents a detailed decomposition of execution time (e.g., instruction execution time, memory stall time separately for instructions and data, synchronization time) for important kernel services in the three workloads.

1995
The measured performance of personal computer operating systems
This paper presents a comparative study of the performance of three operating systems that run on the personal computer architecture derived from the IBM-PC. The operating systems, Windows for Workgroups, Windows NT, and NetBSD (a freely available variant of the UNIX operating system), cover a broad range of system functionality and user requirements, from a single address space model to full protection with preemptive multi-tasking. Our measurements were enabled by hardware counters in Intel’s Pentium processor that permit measurement of a broad range of processor events including instruction counts and on-chip cache miss counts. We used both microbenchmarks, which expose specific differences between the systems, and application workloads, which provide an indication of expected end-to-end performance. Our microbenchmark results show that accessing system functionality is often more expensive in Windows for Workgroups than in the other two systems due to frequent changes in machine mode and the use of system call hooks. When running native applications, Windows NT is more efficient than Windows, but it incurs overhead similar to that of a microkernel since its application interface (the Wln32 API) is implemented as a user-level server. Overall, system functionality can be accessed most efficiently in NetBSD; we attribute this to its monolithic structure, and to the absence of the complications created by hardware backwards compatibility requirements in the other systems. Measurements of application performance show that although the impact of these differences is significant in terms of instruction counts and other hardware events (often a factor of 2 to 7 difference between the systems), overall performance is sometimes determined by the functionality provided by specific subsystems, such as the graphics subsystem or the file system buffer cache.
1995
Optimistic incremental specialization: streamlining a commercial operating system
Conventional operating system code is written to deal with all possible system states, and performs considerable interpretation to determine the current system state before taking action. A consequence of this approach is that kernel calls which perform little actual work take a long time to execute. To address this problem, we use specialized operating system code that reduces interpretation for common cases, but still behaves correctly in the fully general case. We describe how specialized operating system code can be generated and bound incrementally as the information on which it depends becomes available. We extend our specialization techniques to include the notion of optimistic incremental specialization: a technique for generating specialized kernel code optimistically for system states that are likely to occur, but not certain. The ideas outlined in this paper allow the conventional kernel design tenet of “optimizing for the common case” to be extended to the domain of adaptive operating systems. We also show that aggressive use of specialization can produce in-kernel implementations of operating system functionality with performance comparable to user-level implementations.

We demonstrate that these ideas are applicable in real-world operating systems by describing a re-implementation of the HP-UX file system. Our specialized read system call reduces the cost of a single byte read by a factor of 3, and an 8 KB read by 26%, while preserving the semantics of the HP-UX read call. By relaxing the semantics of HP-UX read we were able to cut the cost of a single byte read system call by more than an order of magnitude.


14th SOSP — December 5-8, 1993 — Asheville, North Carolina, USA
Invited Talk
1993
Operating Systems in a Changing World
File Systems
1993
Extensible file systems in spring
In this paper we describe an architecture for extensible file systems. The architecture enables the extension of file system functionality by composing (or stacking) new file systems on top of existing file systems. A file system that is stacked on top of an existing file system can access the existing file system's files via a well-defined naming interface and can share the same underlying file data in a coherent manner. We describe extending file systems in the context of the Spring operating system. Composing file systems in Spring is facilitated by basic Spring features such as its virtual memory architecture, its strongly-typed well-defined interfaces, its location-independent object invocation mechanism, and its flexible naming architecture. File systems in Spring can reside in the kernel, in user-mode, or on remote machines, and composing them can be done in a very flexible manner.
1993
The logical disk: a new approach to improving file systems
The Logical Disk (LD) defines a new interface to disk storage that separates file management and disk management by using logical block numbers and block lists. The LD interface is designed to support multiple file systems and to allow multiple implementations, both of which are important given the increasing use of kernels that support multiple operating system personalities.A log-structured implementation of LD (LLD) demonstrates that LD can be implemented efficiently. LLD adds about 5% to 10% to the purchase cost of a disk for the main memory it requires. Combining LLD with an existing file system results in a log-structured file system that exhibits the same performance characteristics as the Sprite log-structured file system.
1993
The Zebra striped network file system
Zebra is a network file system that increases throughput by striping file data across multiple servers. Rather than striping each file separately, Zebra forms all the new data from each client into a single stream, which it then stripes using an approach similar to a log-structured file system. This provides high performance for writes of small files as well as for reads and writes of large files. Zebra also writes parity information in each stripe in the style of RAID disk arrays; this increases storage costs slightly but allows the system to continue operation even while a single storage server is unavailable. A prototype implementation of Zebra, built in the Sprite operating system, provides 4-5 times the throughput of the standard Sprite file system or NFS for large files and a 20%-3x improvement for writing small files.
Models of Distributed Systems
1993
Understanding the limitations of causally and totally ordered communication
Causally and totally ordered communication support (CATOCS) has been proposed as important to provide as part of the basic building blocks for constructing reliable distributed systems. In this paper, we identify four major limitations to CATOCS, investigate the applicability of CATOCS to several classes of distributed applications in light of these limitations, and the potential impact of these facilities on communication scalability and robustness. From this investigation, we find limited merit and several potential problems in using CATOCS. The fundamental difficulty with the CATOCS is that it attempts to solve problems at the communication level in violation of the well-known "end-to-end" argument.
1993
The Information Bus®: an architecture for extensible distributed systems
Research can rarely be performed on large-scale, distributed systems at the level of thousands of workstations. In this paper, we describe the motivating constraints, design principles, and architecture for an extensible, distributed system operating in such an environment. The constraints include continuous operation, dynamic system evolution, and integration with extant systems. The Information Bus, our solution, is a novel synthesis of four design principles: core communication protocols have minimal semantics, objects are self-describing, types can be dynamically defined, and communication is anonymous. The current implementation provides both flexibility and high performance, and has been proven in several commercial environments, including integrated circuit fabrication plants and brokerage/trading floors.
System Structure
1993
Subcontract: a flexible base for distributed programming
A key problem in operating systems is permitting the orderly introduction of new properties and new implementation techniques. We describe a mechanism, subcontract, that within the context of an object-oriented distributed system permits application programmers control over fundamental object mechanisms. This allows programmers to define new object communication mechanisms without modifying the base system. We describe how new subcontracts can be introduced as alternative communication mechanisms in the place of existing subcontracts. We also briefly describe some of the uses we have made of the subcontract mechanism to support caching, crash recovery, and replication.
1993
Interposition agents: transparently interposing user code at the system interface
Many contemporary operating systems utilize a system call interface between the operating system and its clients. Increasing numbers of systems are providing low-level mechanisms for intercepting and handling system calls in user code. Nonetheless, they typically provide no higher-level tools or abstractions for effectively utilizing these mechanisms. Using them has typically required reimplementation of a substantial portion of the system interface from scratch, making the use of such facilities unwieldy at best.This paper presents a toolkit that substantially increases the ease of interposing user code between clients and instances of the system interface by allowing such code to be written in terms of the high-level objects provided by this interface, rather than in terms of the intercepted system calls themselves. This toolkit helps enable new interposition agents to be written, many of which would not otherwise have been attempted.This toolkit has also been used to construct several agents including: system call tracing tools, file reference tracing tools, and customizable filesystem views. Examples of other agents that could be built include: protected environments for running untrusted binaries, logical devices implemented entirely in user space, transparent data compression and/or encryption agents, transactional software environments, and emulators for other operating system environments.
1993
Using threads in interactive systems: a case study
We describe the results of examining two large research and commercial systems for the ways that they use threads. We used three methods: analysis of macroscopic thread statistics, analysis the microsecond spacing between thread events, and reading the implementation code. We identify ten different paradigms of thread usage: defer work, general pumps, slack processes, sleepers, one-shots, deadlock avoidance, rejuvenation, serializers, encapsulated fork and exploiting parallelism. While some, like defer work, are well known, others have not been previously described. Most of the paradigms cause few problems for programmers and help keep the resulting system implementation understandable. The slack process paradigm is both particularly effective in improving system performance and particularly difficult to make work well. We observe that thread priorities are difficult to use and may interfere in unanticipated ways with other thread primitives and paradigms. Finally, we glean from the practices in this code several possible future research topics in the area of thread abstractions.
Performance
1993
Protection traps and alternatives for memory management of an object-oriented language
Many operating systems allow user programs to specify the protection level (inaccessible, read-only, read-write) of pages in their virtual memory address space, and to handle any protection violations that may occur. Such page-protection techniques have been exploited by several user-level algorithms for applications including generational garbage collection and persistent stores. Unfortunately, modern hardware has made efficient handling of page protection faults more difficult. Moreover, page-sized granularity may not match the natural granularity of a given application. In light of these problems, we reevaluate the usefulness of page-protection primitives in such applications, by comparing the performance of implementations that make use of the primitives with others that do not. Our results show that for certain applications software solutions outperform solutions that rely on page-protection or other related virtual memory primitives.
1993
The impact of operating system structure on memory system performance
In this paper we evaluate the memory system behavior of two distinctly different implementations of the UNIX operating system: DEC's Ultrix, a monolithic system, and Mach 3.0 with CMU's UNIX server, a microkernel-based system. In our evaluation we use combined system and user memory reference traces of thirteen industry-standard workloads. We show that the microkernel-based system executes substantially more non-idle system instructions for an equivalent workload than the monolithic system. Furthermore, the average instruction for programs running on Mach has a higher cost, in terms of memory cycles per instruction, than on Ultrix. In the context of our traces, we explore a number of popular assertions about the memory system behavior of modern operating systems, paying special attention to the effect that Mach's microkernel architecture has on system performance. Our results indicate that many, but not all of the assertions are true, and that a few, while true, have only negligible impact on real system performance.
1993
Performance assertion checking
Performance assertion checking is an approach to automating the testing of performance properties of complex systems. System designers write assertions that capture expectations for performance; these assertions are checked automatically against monitoring data to detect potential performance bugs. Automatically checking expectations allows a designer to test a wide range of performance properties as a system evolves: data that meets expectations can be discarded automatically, focusing attention on data indicating potential problems.PSpec is a language for writing performance as sertions together with tools for testing assertions and estimating values for constants in assertions. The language is small and efficiently checkable, yet capable of expressing a wide variety of performance properties. Initial experience indicates that PSpec is a useful tool for performance testing and debugging; it helped uncover several performance bugs in the runtime system of a parallel programming language.
Persistent Storage
1993
Lightweight recoverable virtual memory
Recoverable virtual memory refers to regions of a virtual address space on which transactional guarantees are offered. This paper describes RVM, an efficient, portable, and easily used implementation of recoverable virtual memory for Unix environments. A unique characteristic of RVM is that it allows independent control over the transactional properties of atomicity, permanence, and serializability. This leads to considerable flexibility in the use of RVM, potentially enlarging the range of applications than can benefit from transactions. It also simplifies the layering of functionality such as nesting and distribution. The paper shows that RVM performs well over its intended range of usage even though it does not benefit from specialized operating system support. It also demonstrates the importance of intra- and intertransaction optimizations.
1993
Concurrent compacting garbage collection of a persistent heap
We describe a replicating garbage collector for a persistent heap. The garbage collector cooperates with a transaction manager to provide safe and efficient transactional storage management. Clients read and write the heap in primary memory and can commit or abort their write operations. When write operations are committed they are preserved in stable storage and survive system failures. Clients can freely access the heap during garbage collection because the collector concurrently builds a compact replica of the heap. A log captures client write operations and is used to support both the transaction manager and the replicating garbage collector.Our implementation is the first to provide concurrent and compacting garbage collection of a persistent heap. Measurements show that concurrent replicating collection produces significantly shorter pause times than stop-and-copy collection. For small transactions, throughput is limited by the logging bandwidth of the underlying log manager. The results suggest that replicating garbage collection offers a flexible and efficient way to provide automatic storage management in transaction systems, object-oriented databases and persistent programming environments.
Communication I
1993
Improving IPC by kernel design
Inter-process communication (ipc) has to be fast and effective, otherwise programmers will not use remote procedure calls (RPC), multithreading and multitasking adequately. Thus ipc performance is vital for modern operating systems, especially μ-kernel based ones. Surprisingly, most μ-kernels exhibit poor ipc performance, typically requiring 100 μs for a short message transfer on a modern processor, running with 50 MHz clock rate.In contrast, we achieve 5 μs; a twentyfold improvement.This paper describes the methods and principles used, starting from the architectural design and going down to the coding level. There is no single trick to obtaining this high performance; rather, a synergetic approach in design and implementation on all levels is needed. The methods and their synergy are illustrated by applying them to a concrete example, the L3 μ-kernel (an industrial-quality operating system in daily use at several hundred sites). The main ideas are to guide the complete kernel design by the ipc requirements, and to make heavy use of the concept of virtual address space inside the μ-kernel itself.As the L3 experiment shows, significant performance gains are possible: compared with Mach, they range from a factor of 22 (8-byte messages) to 3 (4-Kbyte messages). Although hardware specific details influence both the design and implementation, these techniques are applicable to the whole class of conventional general purpose von Neumann processors supporting virtual addresses. Furthermore, the effort required is reasonably small, for example the dedicated parts of the μ-kernel can be concentrated in a single medium sized module.
1993
Fbufs: a high-bandwidth cross-domain transfer facility
We have designed and implemented a new operating system facility for I/O buffer management and data transferacross protection domain boundaries on shared memory machines. This facility, called fast buffers (fbufs), combines virtual page remapping with shared virtual memory, and exploits locality in I/O traffic to achieve high throughput without compromising protection, security, or modularity. goal is to help deliver the high bandwidth afforded by emerging high-speed networks to user-level processes, both in monolithic and microkernel-based operating systems.This paper outlines the requirements for a cross-domain transfer facility, describes the design of the fbuf mechanism that meets these requirements, and experimentally quantifies the impact of fbufs on network performance.
1993
Efficient software-based fault isolation
One way to provide fault isolation among cooperating software modules is to place each in its own address space. However, for tightly-coupled modules, this solution incurs prohibitive context switch overhead. In this paper, we present a software approach to implementing fault isolation within a single address space.Our approach has two parts. First, we load the code and data for a distrusted module into its own fault do main, a logically separate portion of the application's address space. Second, we modify the object code of a distrusted module to prevent it from writing or jumping to an address outside its fault domain. Both these software operations are portable and programming language independent.Our approach poses a tradeoff relative to hardware fault isolation: substantially faster communication between fault domains, at a cost of slightly increased execution time for distrusted modules. We demonstrate that for frequently communicating modules, implementing fault isolation in software rather than hardware can substantially improve end-to-end application performance.
Communication II
1993
Network objects
A network object is an object whose methods can be invoked over a network. This paper describes the design, implementation, and early experience with a network objects system for Modula-3. The system is novel for its overall simplicity. The paper includes a thorough description of realistic marshaling algorithms for network objects.
1993
Handling audio and video streams in a distributed environment
Handling audio and video in a digital environment requires timely delivery of data. This paper describes the principles adopted in the design of the Pandora networked multi-media system. They attempt to give the user the best possible service while dealing with error and overload conditions.Pandora uses a sub-system to handle the multi-media peripherals. It uses transputers and associated Occam code to implement the time critical functions. Stream implementation is based on self-contained segments of data containing information for delivery, synchronisation and error recovery. Decoupling buffers are used to allow concurrent operation of multiple processing elements. Clawback buffers are used to resynchronise streams at their destinations with minimum latency.The system has proved robust in normal use, under overload, and in the presence of errors. It has been in use for a number of years.The principles involved in this design are now being used in the development of two complementary systems. One approach explodes Pandora by having the camera, microphone, speaker and display as independent units linked only by the LAN. The other approach integrates these devices as peripheral cards in a powerful workstation.
1993
Protocol service decomposition for high-performance networking
In this paper we describe a new approach to implementing network protocols that enables them to have high performance and high flexibility, while retaining complete conformity to existing application programming interfaces. The key insight behind our work is that an application's interface to the network is distinct and separable from its interface to the operating system. We have separated these interfaces for two protocol implementations, TCP/IP and UDP/IP, running on the Mach 3.0 operating system and UNIX server. Specifically, library code in the application's address space implements the network protocols and transfers data to and from the network, while an operating system server manages the heavyweight abstractions that applications use when manipulating the network through operations other than send and receive. On DECstation 5000/200 systems connected by 10Mb/sec Ethernet, this approach to protocol decomposition achieves TCP/IP throughput of 1088 KB/second, which is comparable to that of a high-quality in-kernel TCP/IP implementation, and substantially better than a server-based one. Our approach achieves small-packet UDP/IP round trip latencies of 1.23 ms, again comparable to a kernel-based implementation and more than twice as fast as a server-based one.
Systems Issues
1993
Authentication in the Taos operating system
We describe a design and implementation of security for a distributed system. In our system, applications access security services through a narrow interface. This interface provides a notion of identity that includes simple principals, groups, roles, and delegations. A new operating system component manages principals, credentials, and secure channels. It checks credentials according to the formal rules of a logic of authentication. Our implementation is efficient enough to support a substantial user community.
1993
Providing location information in a ubiquitous computing environment
To take full advantage of the promise of ubiquitous computing requires the use of location information, yet people should have control over who may know their whereabouts. We present an architecture that achieves these goals for an interesting set of applications. Personal information is managed by User Agents, and a partially decentralized Location Query Service is used to facilitate location-based operations. This architecture gives users primary control over their location information, at the cost of making more expensive certain queries, such as those wherein location and identity closely interact. We also discuss various extensions to our architecture that offer users additional trade-offs between privacy and efficiency. Finally, we report some measurements of the unextended system in operation, focusing on how well the system is actually able to track people. Our system uses two kinds of location information, which turn out to provide partial and complementary coverage.

13th SOSP — October 13-16, 1991 — Pacific Grove, California, USA
File Systems I
1991
The design and implementation of a log-structured file system
This paper presents a new technique for disk storage management called a log-structured file system. A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash recovery. The log is the only structure on disk; it contains indexing information so that files can be read back from the log efficiently. In order to maintain large free areas on disk for fast writing, we divide the log into segments and use a segment cleaner to compress the live information from heavily fragmented segments. We present a series of simulations that demonstrate the efficiency of a simple cleaning policy based on cost and benefit. We have implemented a prototype log-structured file system called Sprite LFS; it outperforms current Unix file systems by an order of magnitude for small-file writes while matching or exceeding Unix performance for reads and large writes. Even when the overhead for cleaning is included, Sprite LFS can use 70% of the disk bandwidth for writing, whereas Unix file systems typically can use only 5-10%.
1991
Semantic file systems
A semantic file system is an information storage system that provides flexible associative access to the system's contents by automatically extracting attributes from files with file type specific transducers. Associative access is provided by a conservative extension to existing tree-structured file system protocols, and by protocols that are designed specifically for content based access. Compatiblity with existing file system protocols is provided by introducing the concept of a virtual directory. Virtual directory names are interpreted as queries, and thus provide flexible associative access to files and directories in a manner compatible with existing software. Rapid attribute-based access to file system contents is implemented by automatic extraction and indexing of key properties of file system objects. The automatic indexing of files and directories is called "semantic" because user programmable transducers use information about the semantics of updated file system objects to extract the properties for indexing. Experimental results from a semantic file system implementation support the thesis that semantic file systems present a more effective storage abstraction than do traditional tree structured file systems for information sharing and command level programming.
Shared Memory Multiprocessing
1991
The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors
In a shared memory multiprocessor with caches, executing tasks develop "affinity" to processors by filling their caches with data and instructions during execution. A scheduling policy that ignores this affinity may waste processing power by causing excessive cache refilling.Our work focuses on quantifying the effect of processor reallocation on the performance of various parallel applications multiprogrammed on a shared memory multiprocessor, and on evaluating how the magnitude of this cost affects the choice of scheduling policy.We first identify the components of application response time, including processor reallocation costs. Next, we measure the impact of reallocation on the cache behavior of several parallel applications executing on a Sequent Symmetry multiprocessor. We also measure, the performance of these applications under a number of alternative allocation policies. These experiments lead us to conclude that on current machines processor affinity has only a very weak influence on the choice of scheduling discipline, and that the benefits of frequent processor reallocation (in response to the changing parallelism of jobs) outweigh the penalties imposed by such reallocation. Finally, we use this experimental data to parameterize a simple analytic model, allowing us to evaluate the effect of processor affinity on future machines, those containing faster processors and larger caches.
1991
Empirical studies of competitve spinning for a shared-memory multiprocessor
A common operation in multiprocessor programs is acquiring a lock to protect access to shared data. Typically, the requesting thread is blocked if the lock it needs is held by another thread. The cost of blocking one thread and activating another can be a substantial part of program execution time. Alternatively, the thread could spin until the lock is free, or spin for a while and then block. This may avoid context-switch overhead, but processor cycles may be wasted in unproductive spinning. This paper studies seven strategies for determining whether and how long to spin before blocking. Of particular interest are competitive strategies, for which the performance can be shown to be no worse than some constant factor times an optimal off-line strategy. The performance of five competitive strategies is compared with that of always blocking, always spinning, or using the optimal off-line algorithm. Measurements of lock-waiting time distributions for five parallel programs were used to compare the cost of synchronization under all the strategies. Additional measurements of elapsed time for some of the programs and strategies allowed assessment of the impact of synchronization strategy on overall program performance. Both types of measurements indicate that the standard blocking strategy performs poorly compared to mixed strategies. Among the mixed strategies studied, adaptive algorithms perform better than non-adaptive ones.
Multimedia Systems
1991
A high performance multi-structured file system design
File system I/O is increasingly becoming a performance bottleneck in large distributed computer systems. This is due to the increased file I/O demands of new applications, the inability of any single storage structure to respond to these demands, and the slow decline of, disk access times (latency and seek) relative to the rapid increase in CPU speeds, memory size, and network bandwidth.We present a multi-structured file system designed for high bandwidth I/O and fast response. Our design is based on combining disk caching with three different file storage structures, each implemented on an independent and isolated disk array. Each storage structure is designed to optimize a different set of file system access characteristics such as cache writes, directory searches, file attribute requests or large sequential reads/writes.As part of our study, we analyze the performance of an existing file system using trace data from UNIX disk I/O-intensive workloads. Using trace driven simulations, we show how performance is improved by using separate storage structures as implemented by a multi-structured file system.
1991
Scheduling and IPC mechanisms for continuous media
Next-generation workstations will have hardware support for digital "continuous media" (CM) such as audio and video. CM applications handle data at high rates, with strict timing requirements, and often in small "chunks". If such applications are to run efficiently and predictably as user-level programs, an operating system must provide scheduling and IPC mechanisms that reflect these needs. We propose two such mechanisms: split-level CPU scheduling of lightweight processes in multiple address spaces, and memory-mapped streams for data movement between address spaces. These techniques reduce the the number of user/kernel interactions (system calls, signals, and preemptions). Compared with existing mechanisms, they can reduce scheduling and I/O overhead by a factor of 4 to 6.
1991
Designing file systems for digital video and audio
We address the unique requirements of a multimedia file system such as continuous storage and retrieval of media, maintenance of synchronization between multiple media streams, and efficient manipulation of huge media objects. We present a model that relates disk and device characteristics to the recording rate, and derive storage granularity and scattering parameters that guarantee continuous access. In order for the file system to support multiple concurrent requests, we develop admission control algorithms for determining whether a new request can be accepted without violating the realtime constraints of any of the requests.We define a strand as an immutable sequence of continuously recorded media samples, and then present a multimedia rope abstraction which is a collection of individual media strands tied together by synchronization information. We devise operations for efficient manipulation of multi-stranded ropes, and develop an algorithm for maintaining the scattering parameter during editing so as to guarantee continuous playback of edited ropes.We have implemented a prototype multimedia file system, which serves as a testbed for experimenting with policies and algorithms for multimedia storage. We present our initial experiences with using the file system.
Panel
1991
Increasing the Effectiveness of Operating System Research
Kernel Structure
1991
Scheduler activations: effective kernel support for the user-level management of parallelism
Threads are the vehicle for concurrency in many approaches to parallel programming. Threads separate the notion of a sequential execution stream from the other aspects of traditional UNIX-like processes, such as address spaces and I/O descriptors. The objective of this separation is to make the expression and control of parallelism sufficiently cheap that the programmer or compiler can exploit even fine-grained parallelism with acceptable overhead.Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory. This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; we thus argue that managing parallelism at the user level is essential to high-performance parallel computing. Next, we argue that the lack of system integration exhibited by user-level threads is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; we thus argue that kernel threads or processes, as currently conceived, are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism.
1991
First-class user-level threads
It is often desirable, for reasons of clarity, portability, and efficiency, to write parallel programs in which the number of processes is independent of the number of available processors. Several modern operating systems support more than one process in an address space, but the overhead of creating and synchronizing kernel processes can be high. Many runtime environments implement lightweight processes (threads) in user space, but this approach usually results in second-class status for threads, making it difficult or impossible to perform scheduling operations at appropriate times (e.g. when the current thread blocks in the kernel). In addition, a lack of common assumptions may also make it difficult for parallel programs or library routines that use dissimilar thread packages to communicate with each other, or to synchronize access to shared data.We describe a set of kernel mechanisms and conventions designed to accord first-class status to user-level threads, allowing them to be used in any reasonable way that traditional kernel-provided processes can be used, while leaving the details of their implementation to user-level code. The key features of our approach are (1) shared memory for asynchronous communication between the kernel and the user, (2) software interrupts for events that might require action on the part of a user-level scheduler, and (3) a scheduler interface convention that facilitates interactions in user space between dissimilar kinds of threads. We have incorporated these mechanisms in the Psyche parallel operating system, and have used them to implement several different kinds of user-level threads. We argue for our approach in terms of both flexibility and performance.
1991
Using continuations to implement thread management and communication in operating systems
We have improved the performance of the Mach 3.0 operating system by redesigning its internal thread and interprocess communication facilities to use continuations as the basis for control transfer. Compared to previous versions of Mach 3.0, our new system consumes 85% less space per thread. Cross-address space remote procedure calls execute 14% faster. Exception handling runs over 60% faster.In addition to improving system performance, we have used continuations to generalize many control transfer optimizations that are common to operating systems, and have recast those optimizations in terms of a single implementation methodology. This paper describes our experiences with using continuations in the Mach operating system.
Distributed Memory Multiprocessing
1991
The robustness of NUMA memory management
The study of operating systems level memory management policies for nonuniform memory access time (NUMA) shared memory multiprocessors is an area of active research. Previous results have suggested that the best policy choice often depends on the application under consideration, while others have reported that the best policy depends on the particular architecture. Since both observations have merit, we explore the concept of policy tuning on an application/architecture basis.We introduce a highly tunable dynamic page placement policy for NUMA multiprocessors, and address issues related to the tuning of that policy to different architectures and applications. Experimental data acquired from our DUnX operating system running on two different NUMA multiprocessors are used to evaluate the usefulness, importance, and ease of policy tuning.Our results indicate that while varying some of the parameters can have dramatic effects on performance, it is easy to select a set of default parameter settings that result in good performance for each of our test applications on both architectures. This apparent robustness of our parameterized policy raises the possibility of machine-independent memory management for NUMA-class machines.
1991
Implementation and performance of Munin
Munin is a distributed shared memory (DSM) system that allows shared memory parallel programs to be executed efficiently on distributed memory multiprocessors. Munin is unique among existing DSM systems in its use of multiple consistency protocols and in its use of release consistency. In Munin, shared program variables are annotated with their expected access pattern, and these annotations are then used by the runtime system to choose a consistency protocol best suited to that access pattern. Release consistency allows Munin to mask network latency and reduce the number of messages required to keep memory consistent. Munin's multiprotocol release consistency is implemented in software using a delayed update queue that buffers and merges pending outgoing writes. A sixteen-processor prototype of Munin is currently operational. We evaluate its implementation and describe the execution of two Munin programs that achieve performance within ten percent of message passing implementations of the same programs. Munin achieves this level of performance with only minor annotations to the shared memory programs.
Networking and Authentication
1991
Authentication in distributed systems: theory and practice
We describe a theory of authentication and a system that implements it. Our theory is based on the notion of principal and a "speaks for" relation between principals. A simple principal either has a name or is a communication channel; a compound principal can express an adopted role or delegation of authority. The theory explains how to reason about a principal's authority by deducing the other principals that it can speak for; authenticating a channel is one important application. We use the theory to explain many existing and proposed mechanisms for security. In particular, we describe the system we have built. It passes principals efficiently as arguments or results of remote procedure calls, and it handles public and shared key encryption, name lookup in a large name space, groups of principals, loading programs, delegation, access control, and revocation.
1991
Automatic reconfiguration in Autonet
Autonet is a switch-based local area network using 100 Mbit/s full-duplex point-to-point links. Crossbar switches are interconnected to other switches and to host controllers in an arbitrary pattern. Switch hardware uses the destination address in each packet to determine the proper outgoing link for the next step in the path from source to destination. Autonet automatically recalculates these forwarding paths in response to failures and additions of network components. This automatic reconfiguration allows the network to continue normal operation without need of human intervention. Reconfiguration occurs quickly enough that higher-level protocols are not disrupted. This paper describes the fault monitoring and topology acquisition mechanisms that are central to automatic reconfiguration in Autonet.
File Systems II
1991
Measurements of a distributed file system
We analyzed the user-level file access patterns and caching behavior of the Sprite distributed file system. The first part of our analysis repeated a study done in 1985 of the: BSD UNIX file system. We found that file throughput has increased by a factor of 20 to an average of 8 Kbytes per second per active user over 10-minute intervals, and that the use of process migration for load sharing increased burst rates by another factor of six. Also, many more very large (multi-megabyte) files are in use today than in 1985. The second part of our analysis measured the behavior of Sprite's main-memory file caches. Client-level caches average about 7 Mbytes in size (about one-quarter to one-third of main memory) and filter out about 50% of the traffic between clients and servers. 35% of the remaining server traffic is caused by paging, even on workstations with large memories. We found that client cache consistency is needed to prevent stale data errors, but that it is not invoked often enough to degrade overall system performance.
1991
Disconnected operation in the Coda file system
Disconnected operation is a mode of operation that enables a client to continue accessing critical data during temporary failures of a shared data repository. An important, though not exclusive, application of disconnected operation is in supporting portable computers. In this paper, we show that disconnected operation is feasible, efficient and usable by describing its design and implementation in the Coda File System. The central idea behind our work is that caching of data, now widely used for performance, can also be exploited to improve availability.
Reliability in Distributed Systems
1991
Replication in the harp file system
This paper describes the design and implementation of the Harp file system. Harp is a replicated Unix file system accessible via the VFS interface. It provides highly available and reliable storage for files and guarantees that file operations are executed atomically in spite of concurrency and failures. It uses a novel variation of the primary copy replication technique that provides good performance because it allows us to trade disk accesses for network communication. Harp is intended to be used within a file service in a distributed network; in our current implementation, it is accessed via NFS. Preliminary performance results indicate that Harp provides equal or better response time and system capacity than an unreplicated implementation of NFS that uses Unix files directly.
1991
Experience with transactions in QuickSilver
All programs in the QuickSilver distributed system behave atomically with respect to their updates to permanent data. Operating system support for transactions provides the framework required to support this, as well as a mechanism that unifies reclamation of resources after failures or normal process termination. This paper evaluates the use of transactions for these purposes in a general purpose operating system and presents some of the lessons learned from our experience with a complete running system based on transactions. Examples of how transactions are used in QuickSilver and measurements of their use demonstrate that the transaction mechanism provides an efficient and powerful means for solving many of the problems introduced by operating system extensibility and distribution.

12th SOSP — December 3-6, 1989 — Litchfield Park, Arizona, USA
1989
Front Matter
Security
1989
A logic of authentication
Authentication protocols are the basis of security in many distributed systems, and it is therefore essential to ensure that these protocols function correctly. Unfortunately, their design has been extremely error prone. Most of the protocols found in the literature contain redundancies or security flaws. A simple logic has allowed us to describe the beliefs of trustworthy parties involved in authentication protocols and the evolution of these beliefs as a consequence of communication. We have been able to explain a variety of authentication protocols formally, to discover subtleties and errors in them, and to suggest improvements. In this paper, we present the logic and then give the results of our analysis of four published protocols, chosen either because of their practical importance or because they serve to illustrate our method.
1989
Reducing risks from poorly chosen keys
It is well-known that, left to themselves, people will choose passwords that can be rather readily guessed. If this is done, they are usually vulnerable to an attack based on copying the content of messages forming part of an authentication protocol and experimenting, e.g. with a dictionary, offline. The most usual counter to this threat is to require people to use passwords which are obscure, or even to insist on the system choosing their passwords for them. In this paper we show alternatively how to construct an authentication protocol in which offline experimentation is impracticable; any attack based on experiment must involve the real authentication server and is thus open to detection by the server noticing multiple attempts.
NUMA Memory Management
1989
Simple but effective techniques for NUMA memory management
Multiprocessors with non-uniform memory access times introduce the problem of placing data near the processes that use them, in order to improve performance. We have implemented an automatic page placement strategy in the Mach operating system on the IBM ACE multiprocessor workstation. Our experience indicates that even very simple automatic strategies can produce nearly optimal page placement. It also suggests that the greatest leverage for further performance improvement lies in reducing false sharing, which occurs when the same page contains objects that would best be placed in different memories.
1989
The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum
PLATINUM is an operating system kernel with a novel memory management system for Non-Uniform Memory Access (NUMA) multiprocessor architectures. This memory management system implements a coherent memory abstraction. Coherent memory is uniformly accessible from all processors in the system. When used by applications coded with appropriate programming styles it appears to be nearly as fast as local physical memory and it reduces memory contention. Coherent memory makes programming NUMA multiprocessors easier for the user while attaining a level of performance comparable with hand-tuned programs. This paper describes the design and implementation of the PLATINUM memory management system, emphasizing the coherent memory. We measure the cost of basic operations implementing the coherent memory. We also measure the performance of a set of application programs running on PLATINUM. Finally, we comment on the interaction between architecture and the coherent memory system. PLATINUM currently runs on the BBN Butterfly Plus Multiprocessor.
File Caching
1989
Spritely NFS: experiments with cache-consistency protocols
File caching is essential to good performance in a distributed system, especially as processor speeds and memory sizes continue to improve rapidly while disk latencies do not. Stateless-server systems, such as NFS, cannot properly manage client file caches. Stateful systems, such as Sprite, can use explicit cache consistency protocols to improve both cache consistency and overall performance. By modifying NFS to use the Sprite cache consistency protocols, we isolate the effects of the consistency mechanism from the other features of Sprite. We find dramatic improvements on some, although not all, benchmarks, suggesting that an explicit cache consistency protocol is necessary for both correctness and good performance.
1989
Exploiting read-mostly workloads in the FileNet file system
Most recent studies of file system workloads have focussed on loads imposed by general computing. This paper introduces a significantly different workload imposed by a distributed application system. The FileNet system is a distributed application system that supports document image processing. The FileNet file system was designed to support the workload imposed by this application. To characterize the read-mostly workload applied to the file system and how it differs from general computing environments, we present statistics gathered from live production installations. We contrast these statistics with previously published data for more general computing. We describe the key algorithms of the file system, focusing on the caching approach. A bimodal client caching approach is employed, to match the file modification patterns observed. Different cache consistency algorithms are used depending on usage patterns observed for each file. Under most conditions, files cached at workstations can be accessed without contacting servers. When a file is subject to frequent modification that causes excessive cache consistency traffic, caching is disabled for that file, and servers participate in all open and close activities. The data from production sites is examined to evaluate the success of the approach under its applied load. Contrasts with alternative approaches are made based on this data.
1989
Improving the efficiency of UNIX buffer caches
This paper reports on the effects of using hardware virtual memory assists in managing file buffer caches in UNIX. A controlled experimental environment was constructed from two systems whose only difference was that one of them (XMF) used the virtual memory hardware to assist file buffer cache search and retrieval. An extensive series of performance characterizations was used to study the effects of varying the buffer cache size (from 3 Megabytes to 70 MB); I\O transfer sizes (from 4 bytes to 64 KB); cache-resident and non-cache-resident data; READs and WRITEs; and a range of application programs. The results: small READ/WRITE transfers from the cache (≤1 KB) were 5O% faster under XMF, while larger transfers (≥8 KB) were 20% faster. Retrieving data from disk, the XMF improvement was 25% and 1O% respectively, although OPEN/CLOSE system calls took slightly longer in XMF. Some individual programs ran as much as 40% faster on XMF, while an application benchmark suite showed a 7-15% improvement in overall execution time. Perhaps surprisingly. XMF had fewer translation lookaside buffer misses.
Remote Procedure Call
1989
Performance of Firefly RPC
In this paper, we report on the performance of the remote procedure call implementation for the Firefly multiprocessor and analyze the implementation to account precisely for all measured latency. From the analysis and measurements, we estimate how much faster RPC could be if certain improvements were made. The elapsed time for an inter-machine call to a remote procedure that accepts no arguments and produces no results is 2.66 milliseconds. The elapsed time for an RPC that has a single 1440-byte result (the maximum result that will fit in a single packet) is 6.35 milliseconds. Maximum inter-machine throughput of application program data using RPC is 4.65 megabits/second, achieved with 4 threads making parallel RPCs that return the maximum sized result that fits in a single RPC result packet. CPU utilization at maximum throughput is about 1.2 CPU seconds per second on the calling machine and a little less on the server. These measurements are for RPCs from user space on one machine to user space on another, using the installed system and a 10 megabit/second Ethernet. The RPC packet exchange protocol is built on IP/UDP, and the times include calculating and verifying UDP checksums. The Fireflies used in the tests had 5 Micro VAX II processors and a DEQNA Ethernet controller.
1989
RPC in the x-Kernel: evaluating new design techniques
This paper reports our experiences implementing remote procedure call (RPC) protocols in the x-kernel. This exercise is interesting because the RPC protocols exploit two novel design techniques: virtual protocols and layered protocols. These techniques are made possible because the x-kernel provides an object-oriented infrastructure that supports three significant features: a uniform interface to all protocols, a late binding between protocol layers, and a small overhead for invoking any given protocol layer. For each design technique, the paper motivates the technique with a concrete example, describes how it is applied to the implementation of RPC protocols, and presents the results of experiments designed to evaluate the technique.
1989
Lightweight remote procedure call
Lightweight Remote Procedure Call (LRPC) is a communication facility designed and optimized for communication between protection domains on the same machine. In contemporary small-kernel operating systems, existing RPC systems incur an unnecessarily high cost when used for the type of communication that predominates — between protection domains on the same machine. This cost leads system designers to coalesce weakly-related subsystems into the same protection domain, trading safety for performance. By reducing the overhead of same-machine communication, LRPC encourages both safety and performance. LRPC combines the control transfer and communication model of capability systems with the programming semantics and large-grained protection model of RPC. LRPC achieves a factor of three performance improvement over more traditional approaches based on independent threads exchanging messages, reducing the cost of same-machine communication to nearly the lower bound imposed by conventional hardware. LRPC has been integrated into the Taos operating system of the DEC SRC Firefly multiprocessor workstation.
Structural Issues
1989
The portable common runtime approach to interoperability
Operating system abstractions do not always reach high enough for direct use by a language or applications designer. The gap is filled by language-specific runtime environments, which become more complex for richer languages (CommonLisp needs more than C+ +, which needs more than C). But language-specific environments inhibit integrated multi-lingual programming, and also make porting hard (for instance, because of operating system dependencies). To help solve these problems, we have built the Portable Common Runtime (PCR), a language-independent and operating-system-independent base for modern languages. PCR offers four interrelated facilities: storage management (including universal garbage collection), symbol binding (including static and dynamic linking and loading), threads (lightweight processes), and low-level I/O (including network sockets). PCR is “common” because these facilities simultaneously support programs in several languages. PCR supports C. Cedar, Scheme, and CommonLisp intercalling and runs pre-existing C and CommonLisp (Kyoto) binaries. PCR is “portable” because it uses only a small set of operating system features. The PCR source code is available for use by other researchers and developers.
1989
Generic virtual memory management for operating system kernels
We discuss the rationale and design of a Generic Memory management Interface, for a family of scalable operating systems. It consists of a general interface for managing virtual memory, independently of the underlying hardware architecture (e.g. paged versus segmented memory), and independently of the operating system kernel in which it is to be integrated. In particular, this interface provides abstractions for support of a single, consistent cache for both mapped objects and explicit I/O, and control of data caching in real memory. Data management policies are delegated to external managers. A portable implementation of the Generic Memory management Interface for paged architectures, the Paged Virtual Memory manager, is detailed. The PVM uses the novel history object technique for efficient deferred copying. The GMI is used by the Chorus Nucleus, in particular to support a distributed version of Unix. Performance measurements compare favorably with other systems.
Multiprocessors
1989
Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors
Operating systems for most current shared-memory multiprocessors must maintain translation lookaside buffer (TLB) consistency across processors. A processor that changes a shared page table must flush outdated mapping information from its own TLB, and it must force the other processors using the page table to do so as well. Published algorithms for maintaining TLB consistency on some popular commercial multiprocessors incur excessively high synchronization costs. We present an efficient TLB consistency algorithm that can be implemented on multiprocessors that include a small set of reasonable architectural features. This algorithm has been incorporated in a version of the MACH operating system developed for the IBM Research Parallel Processor Prototype (RP3).
1989
The Amber system: parallel programming on a network of multiprocessors
This paper describes a programming system called Amber that permits a single application program to use a homogeneous network of computers in a uniform way, making the network appear to the application as an integrated multiprocessor. Amber is specifically designed for high performance in the case where each node in the network is a shared-memory multiprocessor. Amber shows that support for loosely-coupled multiprocessing can be efficiently realized using an object-based programming model. Amber programs execute in a uniform network-wide object space, with memory coherence maintained at the object level. Careful data placement and consistency control are essential for reducing communication overhead in a loosely-coupled system. Amber programmers use object migration primitives to control the location of data and processing.
1989
Process control and scheduling issues for multiprogrammed shared-memory multiprocessors
Shared-memory multiprocessors are frequently used in a time-sharing style with multiple parallel applications executing at the same time. In such an environment, where the machine load is continuously varying, the question arises of how an application should maximize its performance while being fair to other users of the system. In this paper, we address this issue. We first show that if the number of runnable processes belonging to a parallel application significantly exceeds the effective number of physical processors executing it, its performance can be significantly degraded. We then propose a way of controlling the number of runnable processes associated with an application dynamically, to ensure good performance. The optimal number of runnable processes for each application is determined by a centralized server, and applications dynamically suspend or resume processes in order to match that number. A preliminary implementation of the proposed scheme is now running on the Encore Multimax and we show how it helps improve the performance of several applications. In some cases the improvement is more than a factor of two. We also discuss implications of the proposed scheme for multiprocessor schedulers, and how the scheme should interface with parallel programming languages.
Performance
1989
A lazy buddy system bounded by two coalescing delays
The watermark-based lazy buddy system for dynamic memory management uses lazy coalescing rules controlled by watermark parameters to achieve low operational costs. The correctness of the watermark-based lazy buddy system is shown by defining a space of legal states called the lazy space and proving that the watermark-based lazy coalescing rules always keep the memory state within that space. In this paper we describe a different lazy coalescing policy, called the DELAY-2 algorithm, that focuses directly on keeping the memory state within the lazy space. The resulting implementation is simpler, and experimental data shows it to be up to 12% faster than the watermark-based buddy system and about 33% faster than the standard buddy system. Inexpensive operations make the DELAY-2 algorithm attractive as a memory manager for an operating system. The watermark-based lazy buddy policy offers fine control over the coalescing policy of the buddy system. However, applications such as the UNIX System kernel memory manager do not need such fine control. For these applications, the DELAY-2 buddy system provides an efficient memory manager with low operational costs and low request blocking probability. In the DELAY-2 buddy system, the worst-case time for a free operation is bounded by two coalescing delays per class, and when all blocks are returned to the system, the system memory is coalesced back to its original state. This ensures that the memory space can be completely shared.
1989
Analysis of transaction management performance
There is currently much interest in incorporating transactions into both operating systems and general-purpose programming languages. This paper provides a detailed examination of the design and performance of the transaction manager of the Camelot system. Camelot is a transaction facility that provides a rich model of transactions intended to support a wide variety of general-purpose applications. The transaction manager's principal function is to execute the protocols that ensure atomicity. The conclusions of this study are: a simple optimization to two-phase commit reduces logging activity of distributed transactions; non-blocking commit is practical for some applications; multithreaded design improves throughput provided that log batching is used; multi-casting reduces the variance of distributed commit protocols in a LAN environment; and the performance of transaction mechanisms such as Camelot depend heavily upon kernel performance.
1989
Threads and input/output in the synthesis kernal
The Synthesis operating system kernel combines several techniques to provide high performance, including kernel code synthesis, fine-grain scheduling, and optimistic synchronization. Kernel code synthesis reduces the execution path for frequently used kernel calls. Optimistic synchronization increases concurrency within the kernel. Their combination results in significant performance improvement over traditional operating system implementations. Using hardware and software emulating a SUN 3/160 running SUNOS, Synthesis achieves several times to several dozen times speedup for UNIX kernel calls and context switch times of 21 microseconds or faster.
Time-based Distributed Coherency
1989
Leases: an efficient fault-tolerant mechanism for distributed file cache consistency
Caching introduces the overhead and complexity of ensuring consistency, reducing some of its performance benefits. In a distributed system, caching must deal with the additional complications of communication and host failures. Leases are proposed as a time-based mechanism that provides efficient consistent access to cached data in distributed systems. Non-Byzantine failures affect performance, not correctness, with their effect minimized by short leases. An analytic model and an evaluation for file access in the V system show that leases of short duration provide good performance. The impact of leases on performance grows more significant in systems of larger scale and higher processor performance.
1989
Mirage: a coherent distributed shared memory design
Shared memory is an effective and efficient paradigm for interprocess communication. We are concerned with software that makes use of shared memory in a single site system and its extension to a multimachine environment. Here we describe the design of a distributed shared memory (DSM) system called Mirage developed at UCLA. Mirage provides a form of network transparency to make network boundaries invisible for shared memory and is upward compatible with an existing interface to shared memory. We present the rationale behind our design decisions and important details of the implementation. Mirage's basic performance is examined by component timings, a worst case application, and a “representative” application. In some instances of page contention, the tuning parameter in our design improves application throughput. In other cases, thrashing is reduced and overall system performance improved using our tuning parameter.

11th SOSP — November 8-11, 1987 — Austin, Texas, USA
1987
Front Matter
Distributed File Systems
1987
Scale and performance in a distributed file system
The Andrew File System is a location-transparent distributed tile system that will eventually span more than 5000 workstations at Carnegie Mellon University. Large scale affects performance and complicates system operation. In this paper we present observations of a prototype implementation, motivate changes in the areas of cache validation, server process structure, name translation, and low-level storage representation, and quantitatively demonstrate Andrew’s ability to scale gracefully. We establish the importance of whole-file transfer and caching in Andrew by comparing its performance with that of Sun Microsystem’s NFS tile system. We also show how the aggregation of files into volumes improves the operability of the system.
1987
Caching in the Sprite network file system
The Sprite network operating system uses large main-memory disk block caches to achieve high performance in its file system. It provides non-write-through file caching on both client and server machines. A simple cache consistency mechanism permits files to be shared bymultiple clients without danger of stale data. In order to allow the file cache to occupy as much memory as possible, the file system of each machine negotiates with the virtual memory system over physical memory usage and changes the size of the file cache dynamically. Benchmark programs indicate that client caches allow diskless Sprite workstations to perform within O-12 percent of workstations with disks. In addition, client caching reduces server loading by 50 percent and network traffic by 90 percent.
Processing on Multiple Computers
1987
Using idle workstations in a shared computing environment
The Butler system is a set of programs running on Andrew workstations at CMU that give users access to idle workstations. Current Andrew users use the system over 300 times per day. This paper describes the implementation of the Butler system and tells of our experience in using it. In addition, it describes an application of the system known as gypsy servers, which allow network server programs to be run on idle workstations instead of using dedicated server machines.
1987
Attacking the process migration bottleneck
Moving the contents of a large virtual address space stands out as the bottleneck in process migration, dominating all other costs and growing with the size of the program. Copy-on-reference shipment is shown to successfully attack this problem in the Accent distributed computing environment. Logical memory transfers at migration time with individual on-demand page fetches during remote execution allows relocations to occur up to one thousand times faster than with standard techniques. While the amount of allocated memory varies by four orders of magnitude across the processes studied, their transfer times are practically constant. The number of bytes exchanged between machines as a result of migration and remote execution drops by an average of 58% in the representative processes studied, and message-handling costs are cut by over 47% on average. The assumption that processes touch a relatively small part of their memory while executing is shown to be correct, helping to account for these figures. Accent's copy-on-reference facility can be used by any application wishing to take advantage of lazy shipment of data.
Useful Techniques
1987
Hashed and hierarchical timing wheels: data structures for the efficient implementation of a timer facility
Conventional algorithms to implement an Operating System timer module take O(n) time to start or maintain a timer, where n is the number of outstanding timers: this is expensive for large n. This paper begins by exploring the relationship between timer algorithms, time flow mechanisms used in discrete event simulations, and sorting techniques. Next a timer algorithm for small timer intervals is presented that is similar to the timing wheel technique used in logic simulators. By using a circular buffer or timing wheel, it takes O(1) time to start, stop, and maintain timers within the range of the wheel.Two extensions for larger values of the interval are described. In the first, the timer interval is hashed into a slot on the timing wheel. In the second, a hierarchy of timing wheels with different granularities is used to span a greater range of intervals. The performance of these two schemes and various implementation trade-offs are discussed.
1987
The packer filter: an efficient mechanism for user-level network code
Code to implement network protocols can be either inside the kernel of an operating system or in user-level processes. Kernel-resident code is hard to develop, debug, and maintain, but user-level implementations typically incur significant overhead and perform poorly.The performance of user-level network code depends on the mechanism used to demultiplex received packets. Demultiplexing in a user-level process increases the rate of context switches and system calls, resulting in poor performance. Demultiplexing in the kernel eliminates unnecessary overhead.This paper describes the packet filter, a kernel-resident, protocol-independent packet demultiplexer. Individual user processes have great flexibility in selecting which packets they will receive. Protocol implementations using the packet filter perform quite well, and have been in production use for several years.
1987
A name service for evolving heterogeneous systems
A prototype implementation has been built as part of the Heterogeneous Computer Systems project at the University of Washington. This service supports RPC binding and other applications in our heterogeneous environment. Measurements of the performance of this prototype show that it is close to that of the underlying name services, due largely to the use of specialized caching techniques.
Operating Systems
1987
The duality of memory and communication in the implementation of a multiprocessor operating system
Mach is a multiprocessor operating system being implemented at Carnegie-Mellon University. An important component of the Mach design is the use of memory objects which can be managed either by the kernel or by user programs through a message interface. This feature allows applications such as transaction management systems to participate in decisions regarding secondary storage management and page replacement.This paper explores the goals, design and implementation of Mach and its external memory management facility. The relationship between memory and communication in Mach is examined as it relates to overall performance, applicability of Mach to new multiprocessor architectures, and the structure of application programs.
1987
Time warp operating system
This paper describes the Time Warp Operating System, under development for three years at the Jet Propulsion Laboratory for the Caltech Mark III Hypercube multi-processor. Its primary goal is concurrent execution of large, irregular discrete event simulations at maximum speed. It also supports any other distributed applications that are synchronized by virtual time.The Time Warp Operating System includes a complete implementation of the Time Warp mechanism, and is a substantial departure from conventional operating systems in that it performs synchronization by a general distributed process rollback mechanism. The use of general rollback forces a rethinking of many aspects of operating system design, including programming interface, scheduling, message routing and queueing, storage management, flow control, and commitment.In this paper we review the mechanics of Time Warp, describe the TWOS operating system, show how to construct simulations in object-oriented form to run under TWOS, and offer a qualitative comparison of Time Warp to the Chandy-Misra method of distributed simulation. We also include details of two benchmark simulations and preliminary measurements of time-to-completion, speedup, rollback rate, and antimessage rate, all as functions of the number of processors used.
1987
Synchronization primitives for a multiprocessor: a formal specification
Formal specifications of operating system interfaces can be a useful part of their documentation. We illustrate this by documenting the Threads synchronization primitives of the Taos operating system. We start with an informal description, present a way to formally specify interfaces in concurrent systems, give a formal specification of the synchronization primitives, briefly discuss the implementation, and conclude with a discussion of what we have learned from using the specification for more than a year.
Systems
1987
Managing stored voice in the etherphone system
The voice manager in the Etherphone system provides facilities for recording, editing, and playing stored voice in a distributed personal-computing environment. It provides the basis for applications such as voice mail, annotation of multimedia documents, and voice editing using standard text-editing techniques. To facilitate sharing, the voice manager stores voice on a special voice file server that is accessible via the local internet. Operations for editing a passage of recorded voice simply build persistent data structures to represent the edited voice. These data structures, implementing an abstraction called voice ropes, are stored in a server database and consist of lists of intervals within voice files. Clients refer to voice ropes solely by reference. Interests, additional persistent data structures maintained by the server, serve two purposes: First, they provide a sort of directory service for managing the voice ropes that have been created. More importantly, they provide a reliable reference-counting mechanism, permitting the garbage collection of voice ropes that are no longer needed. These interests are grouped into classes; for some important classes, obsolete interests can be detected and deleted by a class-specific algorithm that runs periodically.
1987
Fine-grained mobility in the emerald system
Emerald is an object-based language and system designed for the construction of distributed programs. An explicit goal of Emerald is support for object mobility; objects in Emerald can freely move within the system to take advantage of distribution and dynamically changing environments. We say that Emerald has fine-grained mobility because Emerald objects can be small data objects as well as process objects. Fine-grained mobility allows us to apply mobility in new ways but presents implementation problems as well. This paper discusses the benefits of tine-grained mobility, the Emerald language and run-time mechanisms that support mobility, and techniques for implementing mobility that do not degrade the performance of local operations. Performance measurements of the current implementation are included.
Transaction Facilities
1987
Recovery management in QuickSilver
This paper describes Quicksilver, developed at the IBM Almaden Research Center, which uses atomic transactions as a unified failure recovery mechanism for a client-server structured distributed system. Transactions allow failure atomicity for related activities at a single server or at a number of independent servers. Rather than bundling transaction management into a dedicated language or recoverable object manager, Quicksilver exposes the basic commit protocol and log recovery primitives, allowing clients and servers to tailor their recovery techniques to their specific needs. Servers can implement their own log recovery protocols rather than being required to use a system-defined protocol. These decisions allow servers to make their own choices to balance simplicity, efficiency, and recoverability.
1987
801 Storage: architecture and programming
Based on novel architecture, the 801 minicomputer project has developed a low-level storage manager that can significantly simplify storage programming in subsystems and applications. The storage manager embodies three ideas: (1) large virtual storage, to contain all temporary data and permanent files for the active programs; (2) the innovation of database storage, which has implicit properties of access serializability and atomic update, similar to those of database transaction systems; and (3) access to all storage, including files, by the usual operations and types of a high-level programming language. The IBM RT PC implements the hardware architecture necessary for these storage facilities in its storage controller (MMU). The storage manager and language elements required, as well as subsystems and applications that use them, have been implemented and studied in a prototype operating system called CPR, that runs on the RT PC. Low cost and good performance are achieved in both hardware and software. The design is intended to be extensible across a wide performance/cost spectrum.
1987
Implementation of Argus
Argus is a programming language and system developed to support the construction and execution of distributed programs. This paper describes the implementation of Argus, with particular emphasis on the way we implement atomic actions, because this is where Argus differs most from other implemented systems. The paper also discusses the performance of Argus. The cost of actions is quite reasonable, indicating that action systems like Argus are practical.
Reliability Methods
1987
Exploiting virtual synchrony in distributed systems
We describe applications of a virtually synchronous environment for distributed programming, which underlies a collection of distributed programming tools in the ISIS2 system. A virtually synchronous environment allows processes to be structured into process groups, and makes events like broadcasts to the group as an entity, group membership changes, and even migration of an activity from one place to another appear to occur instantaneously — in other words, synchronously. A major advantage to this approach is that many aspects of a distributed application can be treated independently without compromising correctness. Moreover, user code that is designed as if the system were synchronous can often be executed concurrently. We argue that this approach to building distributed and fault-tolerant software is more straightforward, more flexible, and more likely to yield correct solutions than alternative approaches.
1987
Log files: an extended file service exploiting write-once storage
A log service provides efficient storage and retrieval of data that is written sequentially (append-only) and not subsequently modified. Application programs and subsystems use log services for recovery, to record security audit trails, and for performance monitoring. Ideally, a log service should accommodate very large, long-lived logs, and provide efficient retrieval and low space overhead.In this paper, we describe the design and implementation of the Clio log service. Clio provides the abstraction of log files: readable, append-only files that are accessed in the same way as conventional files. The underlying storage medium is required only to be append-only; more general types of write access are not necessary. We show how log files can be implemented efficiently and robustly on top of such storage media—in particular, write-once optical disk.In addition, we describe a general application software storage architecture that makes use of log files.This work was supported in part by the Defense Advanced Research Projects Agency under contracts N00039-84-C-0211 and N00039-86-K-0431, by National Science Foundation grant DCR-83-52048, and by Digital Equipment Corporation, Bell-Northern Research and AT&T Information Systems.
Efficient Recovery Techniques
1987
A simple and efficient implementation of a small database
This paper describes a technique for implementing the sort of small databases that frequently occur in the design of operating systems and distributed systems. We take advantage of the existence of very large virtual memories, and quite large real memories, to make the technique feasible. We maintain the database as a strongly typed data structure in virtual memory, record updates incrementally on disk in a log and occasionally make a checkpoint of the entire database. We recover from crashes by restoring the database from an old checkpoint then replaying the log. We use existing packages to convert between strongly typed data objects and their disk representations, and to communicate strongly typed data across the network (using remote procedure calls). Our memory is managed entirely by a general purpose allocator and garbage collector. This scheme has been used to implement a name server for a distributed system. The resulting implementation has the desirable property of being simultaneously simple, efficient and reliable.
1987
Reimplementing the Cedar file system using logging and group commit
The workstation file system for the Cedar programming environment was modified to improve its robustness and performance. Previously, the file system used hardware-provided labels on disk blocks to increase robustness against hardware and software errors. The new system does not require hardware disk labels, yet is more robust than the old system. Recovery is rapid after a crash. The performance of operations on file system metadata, e.g., file creation or open, is greatly improved. The new file system has two features that make it atypical. The system uses a log, as do most database systems, to recover metadata about the file system. To gain performance, it uses group commit, a concept derived from high performance database systems. The design of the system used a simple, yet detailed and accurate, analytical model to choose between several design alternatives in order to provide good disk performance.

10th SOSP — December 1-4, 1985 — Orcas Island, Washington, USA
1985
Front Matter
Distributed Operating Systems
1985
VAXclusters: a closely-coupled distributed system
A VAXcluster is a highly available and extensible configuration of VAX computers that operate as a single system. To achieve performance in a multicomputer environment, a new communications architecture, communications hardware, and distributed software were jointly designed. The software is a distributed version of the VAX/VMS operating system that uses a distributed lock manager to synchronize access to shared resources. The communications hardware includes a 70 megabit per second message-oriented interconnect and an interconnect port that performs communications tasks traditionally handled by software. Performance measurements show this structure to be highly efficient, for example, capable of sending and receiving 3000 messages per second on a VAX-11/780.
1985
Preemptable remote execution facilities for the V-system
A remote execution facility allows a user of a workstation-based distributed system to offload programs onto idle workstations, thereby providing the user with access to computational resources beyond that provided by his personal workstation. In this paper, we describe the design and performance of the remote execution facility in the V distributed system, as well as several implementation issues of interest. In particular, we focus on network transparency of the execution environment, preemption and migration of remotely executed programs, and avoidance of residual dependencies on the original host. We argue that preemptable remote execution allows idle workstations to be used as a “pool of processors” without interfering with use by their owners and without significant overhead for the normal execution of programs. In general, we conclude that the cost of providing preemption is modest compared to providing a similar amount of computation service by dedicated “computation engines”.
Measured Performance of Systems
1985
The integration of virtual memory management and interprocess communication in accent
The integration of virtual memory management and interprocess communication in the Accent network operating system kernel is examined. The design and implementation of the Accent memory management system is discussed and its performance, both on a series of message-oriented benchmarks and in normal operation, is analyzed in detail.
1985
A trace-driven analysis of the UNIX 4.2 BSD file system
We analyzed the UNIX 4.2 BSD file system by recording user-level activity in trace files and writing programs to analyze the traces. The tracer did not record individual read and write operations, yet still provided tight bounds on what information was accessed and when. The trace analysis shows that the average file system bandwidth needed per user is low (a few hundred bytes per second). Most of the files accessed are open only a short time and are accessed sequentially. Most new information is deleted or overwritten within a few minutes of its creation. We also wrote a simulator that uses the traces to predict the performance of caches for disk blocks. The moderate-sized caches used in UNIX reduce disk traffic for file blocks by about 50%, but larger caches (several megabytes) can eliminate 90% or more of all disk traffic. With those large caches, large block sizes (16 kbytes or more) result in the fewest disk accesses.
Distributed File Systems
1985
A caching file system for a programmer's workstation
This paper describes a file system for a programmer’s workstation that has access both to a local disk and to remote file servers. The file system is designed to help programmers manage their local naming environments and share consistent versions of collections of software. It names multiple versions of local and remote files in a hierarchy. Local names can refer to local files or be attached to remote files. Remote files also may be referred to directly. Remote files are immutable and cached on the local disk. The file system is part of the Cedar experimental programming environment at Xerox PARC and has been in use since late 1983.
1985
The ITC distributed file system: principles and design
This paper presents the design and rationale of a distributed file system for a network of more than 5000 personal computer workstations. While scale has been the dominant design influence, careful attention has also been paid to the goals of location transparency, user mobility and compatibility with existing operating system interfaces. Security is an important design consideration, and the mechanisms for it do not assume that the workstations or the network are secure. Caching of entire files at workstations is a key element in this design. A prototype of this system has been built and is in use by a user community of about 400 individuals. A refined implementation that will scale more gracefully and provide better performance is close to completion.
1985
A distributed file service based on optimistic concurrency control
The design of a layered file service for the Amoeba Distributed System is discussed, on top of which various applications can easily be implemented. The bottom layer is formed by the Amoeba Block Services, responsible for implementing stable storage and replicated, highly available disk blocks. The next layer is formed by the Amoeba File Service which provides version management and concurrency control for tree-structured files. On top of this layer, the applications, ranging from databases to source code control systems, determine the structure of the file trees and provide an interface to the users.
Panel
1985
Future Directions
Replication, Fault Tolerance
1985
Replicated distributed programs
A troupe is a set of replicas of a module, executing on machines that have independent failure modes. Troupes are the building blocks of replicated distributed programs and the key to achieving high availability. Individual members of a troupe do not communicate among themselves, and are unaware of one another's existence; this property is what distinguishes troupes from other software architectures for fault tolerance.

Replicated procedure call is introduced to handle the many-to-many pattern of conmmnication between troupes. The semantics of replicated procedure call can be summarized as exactly-once execution at all replicas.

An implementation of troupes and replicated procedure call is described, and its performance is measured. The problem of concurrency control for troupes is examined, and a commit protocol for replicated atomic transactions is presented. Binding and reconfiguration mechanisms for replicated distributed programs are described.

1985
Replication and fault-tolerance in the ISIS system
The ISIS system transforms abstract type specifications into fault-tolerant distributed implementations while insulating users from the mechanisms used to achieve fault-tolerance. This paper discusses techniques for obtaining a fault-tolerant implementation from a non-distributed specification and for achieving improved performance by concurrently updating replicated data. The system itself is based on a small set of communication primitives, which are interesting because they achieve high levels of concurrency while respecting higher level ordering requirements. The performance of distributed fault-tolerant services running on this initial version of ISIS is found to be nearly as good as that of non-distributed, fault-intolerant ones.
1985
Consistency and recovery control for replicated files
We present a consistency and recovery control scheme for replicated files. The purpose of a replicated file is to improve the availability of a logical file in the presence of site failures and network partitions. The accessible physical copies of a replicated file will be mutually consistent and behave like a single copy as far as the user can tell. Our recovery scheme requires no manual intervention. The scheme tolerates any number of site failures and network partitions as well as repairs.
Facilities for Supercomputing
1985
Compiler directed memory management policy for numerical programs
A Compiler Directed Management Policy for numerical programs is described in this paper. The high level source codes of numerical programs contain useful information which can be used by a compiler to determine the memory requirements of a program. Using this information, the compiler can insert some directives into the operating system for effective management of the memory hierarchy. During the program’s execution the operating system dynamically allocates to a program the space it requires as specified by the received directive. The new policy is compared with LRU and WS policies. Empirical results presented in this paper show that the CD policy can out-perform LRU and WS by a wide margin.
1985
A data-flow approach to multitasking on CRAY X-MP computers
The initial CRAY X-MP multitasking implementation supported multitasking of infrequently synchronizing jobs running in a dedicated environment. Running in a shared environment or using too small a granularity often had adverse effects on performance. The original mechanisms were not designed for general use and hence were neither flexible nor efficient. Additionally, multitasking as an integrated feature (user libraries and operating system) needed to vary the number of processors working on a problem quickly. We subsequently implemented simpler mechanisms in the user libraries and operating system. The new mechanisms exploit the speed of a set of hardware registers which are shared among cooperating CPUs to achieve the needed performance. In retrospect, we saw how neatly a data-flow paradigm fit our design.
Atomic Actions
1985
Transactions and synchronization in a distributed operating system
A fully distributed operating system transaction facility with fine-grain record level synchronization is described. Multiple member processes, remote resource access, dynamic process migration, and orderly interaction with concurrent non-transaction activities are all supported, An unusual logging strategy, based on shadow pages but supporting logical level locking, is used. This choice is justified on the basis of ease of implementation and performance analysis.

The design and implementation is done in the context of Locus, a high performance distributed Unix operating system for local area networks.

1985
Distributed transactions for reliable systems
Facilities that support distributed transactions on user-defined types can be implemented efficiently and can simplify the construction of reliable distributed programs. To demonstrate these points, this paper describes a prototype transaction facility, called TABS, that supports objects, transparent communication, synchronization, recovery, and transaction management. Various objects that use the facilities of TABS are exemplified and the performance of the system is discussed in detail. The paper concludes that the prototype provides useful facilities, and that it would be feasible to build a high performance implementation based on its ideas.
1985
Reliable object storage to support atomic actions
Maintaining consistency of on-line, long-lived, distributed data in the presence of hardware failures is a necessity for many applications. The Argus programming language and system, currently under development at M.I.T., provides users with linguistic constructs to implement such applications. Argus permits users to identify certain data objects as being resilient to failures, and the set of such resilient objects can vary dynamically as programs run. When resilient objects are modified, they are automatically copied by the Argus implementation to stable storage, storage that with very high probability does not lose information. The resilient objects are therefore guaranteed, with very high probability, to survive both media failures and node crashes.

This paper presents a method for implementing resilient objects, using a log-based mechanism to organize the information on stable storage. Of particular interest is the handling of a dynamic, user-controlled set of resilient objects, and the use of early prepare to minimize delays in user activities.

Alternative Architectures
1985
The S/Net's Linda kernel
Linda is a parallel programming language that differs from other parallel languages in its simplicity and in its support for distributed data structures. The S/Net is a multicomputer, designed and built at AT&T Bell Laboratories, that is based on a fast, word-parallel bus interconnect. We describe the Linda-supporting communication kernel we have implemented on the S/Net. The implementation suggests that Linda’s unusual shared-memory-like communication primitives can be made to run well in the absence of physically shared memory; the simplicity of the language and of our implementation’s logical structure suggest that similar Linda implementations might readily be constructed on related architectures. We outline the language, and programming methodologies based on distributed data structures; we then describe the implementation, and the performance both of the Linda primitives themselves and of a simple S/Net-Linda matrix-multiplication program designed to exercise them.
1985
An architecture for large scale information systems
A new type of system architecture is described that uses both duplex communication and wide-area simplex communication to implement a single service. A working community information system based on this architecture is discussed from a systems perspective, with an emphasis on the unique way in which processing is distributed among a confederation of shared servers and private personal systems.

In the community information system, each personal system maintains a local, user-defined subset of the databases stored on the shared servers. Database updates are transmitted to the personal systems via a broadcast packet radio system. This design allows many queries to be processed completely at users’ personal machines, and thus reduces the reliance on shared servers.

A unifying design principle is that the system is seen as a collection of independent shared and personal databascs, as opposed to a single monolithic database. Query routing is used to hide the system’s division into component databases from a user.

Experience with Systems
1985
The structuring of systems using upcalls
When implementing a system specified as a number of layers of abstraction, it is tempting to implement each layer as a process. However, this requires that communication between layers be via asynchnonous inter-process messages. Our experience, especially with implementing network protocols, suggests that asynchronous communication between layers leads to serious performance problems. In place of this structure we propose an implementation methodology which permits synchronous (procedure call) between layers, both when a higher layer invokes a lower layer and in the reverse direction, from lower layer upward. This paper discusses the motivation for this methodology, as well as the pitfalls that accompany it.
1985
Supporting distributed applications: experience with Eden
The Eden distributed system has been running at the University of Washington for over two years. Most of the principles and implementation ideas of Eden have been adequately discussed in the literature. This paper presents some of the experience that has been gained from the implementation and use of Eden. Much of this experience is relevant to other distributed systems, even though they may be based on different assumptions.

9th SOSP — October 10-13, 1983 — Bretton Woods, New Hampshire, USA
1983
Front Matter
Communication
1983
Computation & communication in R*: a distributed database manager
This article presents and discusses the computation and communication model used by R*, a prototype distributed database management system. An R* computation consists of a tree of processes connected by virtual circuit communication paths. The process management and communication protocols used by R* enable the system to provide reliable, distributed transactions while maintaining adequate levels of performance. Of particular interest is the use of processes in R* to retain user context from one transaction to another, in order to improve the system performance and recovery characteristics.
1983
Implementing remote procedure calls
Remote procedure calls (RPC) appear to be a useful paradigm for providing communication across a network between programs written in a high-level language. This paper describes a package providing a remote procedure call facility, the options that face the designer of such a package, and the decisions we made. We describe the overall structure of our RPC mechanism, our facilities for binding RPC clients, the transport level communication protocol, and some performance measurements. We include descriptions of some optimizations used to achieve high performance and to minimize the load on server machines that have many clients.
1983
An asymmetric stream communication system
Input and output are often viewed as complementary operations, and it is certainly true that the direction of data flow during input is the reverse of that during output. However, in a conventional operating system, the direction of control flow is the same for both input and output: the program plays the active role, while the operating system transput primitives are always passive. Thus there are four primitive transput operations, not two: the corresponding pairs are passive input and active output, and active input and passive output. This paper explores the implications of this idea in the context of an object oriented operating system. This work is supported in part by the National Science Foundation under Grant No. MCS-8004111. Computing equipment and technical support are provided in part under a cooperative research agreement with Digital Equipment Corporation.
Resource Management
1983
Resource management in a decentralized system
The heterogeneous collection of machines constituting the Processor Bank at Cambridge represents an important set of resources which must be managed. The Resource Manager incorporates these machines into a wider view of resources in a network. It will accept existing resources and specifications for constructing others from existing ones. It will then allocate these to clients on demand, combining existing resources as necessary. Resource management in a decentralized system presents interesting problems: the resources are varied and of a multi-level nature; they are available at different locations from where they are required; the stock of available resources varies dynamically; and the underlying system is in constant flux.
1983
A file system supporting cooperation between programs
File systems coordinate simultaneous access of files by concurrent client processes. Although several processes may read a file simultaneously, the file system must grant exclusive access to one process wanting to write the file. Most file systems consider processes to be antagonistic: they prevent one process from taking actions on a file that have any chance to harm to another process already using the file. If several processes need to cooperate in a more sophisticated manner in their use of a file, they must communicate explicitly among themselves. The next three sections describe the file system procedures used by clients for using and sharing files. Section Two discusses how a client gains access to a file and how a client can respond if the file system asks it to give up a file it is using. Section Three discusses the mechanism by which a client might ask to be notified that a file is available for some access. Section Four discusses a controlled type of file access that lets clients read and write the same file at the same time. Section Five comprises three examples of the cooperative features of the file system taken from the Xerox Development Environment. Section Six discusses the subtleties of writing the call-back procedures that clients provide to the file system to implement interprocess cooperation. Section Seven discusses the implementation of this file system.
1983
New methods for dynamic storage allocation (Fast Fits)
A cartesian tree can be used to manage a pool of storage so that typically the allocation or release of a variable-length area takes time O(log M), where M = number of discontiguous available blocks. It neither incurs the spacial penalty of ‘Buddy’ methods nor has the restrictions associated with tags. It can support the same programming interface that is conventionally associated with ‘First Fit’, without using additional storage, and with much better performance when M is large.
Special Session
1983
Hints for computer system design
Experience with the design and implementation of a number of computer systems, and study of many other systems, has led to some general hints for system design which are described here. They are illustrated by a number of examples, ranging from hardware such as the Alto and the Dorado to applications programs such as Bravo and Star.
LOCUS
1983
The LOCUS distributed operating system
LOCUS is a distributed operating system which supports transparent access to data through a network wide filesystem, permits automatic replication of storage, supports transparent distributed process execution, supplies a number of high reliability functions such as nested transactions, and is upward compatible with Unix. Partitioned operation of subnet's and their dynamic merge is also supported.

The system has been operational for about two years at UCLA and extensive experience in its use has been obtained. The complete system architecture is outlined in this paper, and that experience is summarized.

1983
A nested transaction mechanism for LOCUS
Atomic transactions are useful in distributed systems as a means of providing reliable operation in the face of hardware failures. Nested transactions are a generalization of traditional transactions in which transactions may be composed of other transactions. The programmer may initiate several transactions from within a transaction, and serializability of the transactions is guaranteed even if they are executed concurrently. In addition, transactions invoked from within a given transaction fail independently of their invoking transaction and of one another, allowing use of alternate transactions to accomplish the desired task in the event that the original should fail. Thus nested transactions are the basis for a general-purpose reliable programming environment in which transactions are modules which may be composed freely. A working implementation of nested transactions has been produced for LOCUS, an integrated distributed operating system which provides a high degree of network transparency. Several aspects of our mechanism are novel. First, the mechanism allows a transaction to access objects directly without regard to the location of the object. Second, processes running on behalf of a single transaction may be located at many sites. Thus there is no need to invoke a new transaction to perform processing or access objects at a remote site. Third, unlike other environments, LOCUS allows replication of data objects at more than one site in the network, and this capability is incorporated into the transaction mechanism. If the copy of an object that is currently being accessed becomes unavailable, it is possible to continue work by using another one of the replicated copies. Finally, an efficient orphan removal algorithm is presented, and the problem of providing continued operation during network partitions is addressed in detail.
Recovery and Reconfiguration
1983
A message system supporting fault tolerance
A simple and general design uses message-based communication to provide software tolerance of single-point hardware failures. By delivering all interprocess messages to inactive backups for both the sender and the destination, both backups are kept in a state in which they can take over for their primaries. An implementation for the Auragen 4000 series of M68000-based systems is described. The operating system, AurosTM, is a distributed version of UNIX*. Major goals have been transparency of fault tolerance and efficient execution in the absence of failure.
1983
Publishing: a reliable broadcast communication mechanism
Publishing is a model and mechanism for crash recovery in a distributed computing environment. Published communication works for systems connected via a broadcast medium by recording messages transmitted over the network. The recovery mechanism can be completely transparent to the failed process and all processes interacting with it. Although published communication is intended for a broadcast network such as a bus, a ring, or an Ethernet, it can be used in other environments. A recorder reliably stores all messages that are transmitted, as well as checkpoint and recovery information. When it detects a failure, the recorder may restart affected processes from checkpoints. The recorder subsequently resends to each process all messages which were sent to it since the time its checkpoint was taken, while ignoring duplicate messages sent by it. Message-based systems without shared memory can use published communications to recover groups of processes. Simulations show that at least 5 multi-user minicomputers can be supported on a standard Ethernet using a single recorder. The prototype version implemented in DEMOS/MP demonstrates that an error recovery can be transparent to user processes and can be centralized in the network.
1983
Process migration in DEMOS/MP
Process migration has been added to the DEMOS/MP operating system. A process can be moved during its execution, and continue on another processor, with continuous access to all its resources. Messages are correctly delivered to the process's new location, and message paths are quickly updated to take advantage of the process's new location. No centralized algorithms are necessary to move a process. A number of characteristics of DEMOS/MP allowed process migration to be implemented efficiently and with no changes to system services. Among these characteristics are the uniform and location independent communication interface, and the fact that the kernel can participate in message send and receive operations in the same manner as a normal process.
Debate
1983
Resolved: That network Transparency is a Bad Idea
Distributed File Access
1983
The TRIPOS filing machine, a front end to a file server
This paper discusses an experiment which sets out to improve the performance of a number of single user computers which rely on a general purpose file server for their filing systems. The background is described in detail in reference [1], but for completeness it is necessary to say something about it here. The Cambridge Distributed Computing System consists, at the time of writing, of between 50 and 60 machines of various types, connected by a digital communications ring. On the ring, there are two file servers [2], [3], which are general purpose (or "universal" [4]) in the sense that they have no commitment to a particular directory or access control structure. This is done in order that they may support several client systems, and so that new systems may be added without difficulty. We speak of a particular directory and access control structure implemented over the file server as "a filing system".
1983
The distributed V kernel and its performance for diskless workstations
The distributed V kernel is a message-oriented kernel that provides uniform local and network interprocess communication. It is primarily being used in an environment of diskless workstations connected by a high-speed local network to a set of file servers. We describe a performance evaluation of the kernel, with particular emphasis on the cost of network file access. Our results show that over a local network: 1. Diskless workstations can access remote files with minimal performance penalty. 2. The V message facility can be used to access remote files at comparable cost to any well-tuned specialized file access protocol. We conclude that it is feasible to build a distributed system with all network communication using the V message facility even when most of the network nodes have no secondary storage.
Experience
1983
Experience with Grapevine: the growth of a distributed system
Grapevine is a distributed, replicated system that provides message delivery, naming, authentication, resource location, and access control services in an internet of computers. The system, described in a previous paper, was designed and implemented several years ago. We now have had operational experience with the system under substantial load. In this paper we report on what we have learned from using Grapevine.
1983
Reflections on the verification of the security of an operating system kernel
This paper discusses the formal verification of the design of an operating system kernel's conformance to the multilevel security property. The kernel implements multiple protection structures to support both discretionary and nondiscretionary security policies. The design of the kernel was formally specified. Mechanical techniques were used to check that the design conformed to the multilevel security property. All discovered security flaws were then either closed or minimized. This paper considers the multilevel security model, the verification methodology, and the verification tools used. This work is significant for two reasons. First, it is for a complete implementation of a commercially available secure computer system. Second, the verification used off-the-shelf tools and was not done by the verification environment researchers.

8th SOSP — December 14-16, 1981 — Pacific Grove, California, USA
1981
Front Matter
Verifying Systems Properties
1981
Proving real-time properties of programs with temporal logic
Wirth [Wi77] categorized programs into three classes. The most difficult type of program to understand and write is a real-time program. Much work has been done in the formal verification of sequential programs, but much remains to be done for concurrent and real-time programs. The critical nature of typical real-time applications makes the validity problem for real-time programs particularly important. Owicki and Lamport [OL80] present a relatively new method for verifying concurrent programs using temporal logic. This paper presents an extension of their work to the area of real-time programs. A model and proof system are presented and their use demonstrated using examples from the literature.
1981
Design and verification of secure systems
This paper reviews some of the difficulties that arise in the verification of kernelized secure systems and suggests new techniques for their resolution. It is proposed that secure systems should be conceived as distributed systems in which security is achieved partly through the physical separation of its individual components and partly through the mediation of trusted functions performed within some of those components. The purpose of a security kernel is simply to allow such a 'distributed' system to actually run within a single processor; policy enforcement is not the concern of a security kernel. This approach decouples verification of components which perform trusted functions from verification of the security kernel. This latter task may be accomplished by a new verification technique called 'proof of separability' which explicitly addresses the security relevant aspects of interrupt handling and other issues ignored by present methods.
Systems
1981
A NonStop kernel
The Tandem NonStop System is a fault-tolerant [1], expandable, and distributed computer system designed expressly for online transaction processing. This paper describes the key primitives of the kernel of the operating system. The first section describes the basic hardware building blocks and introduces their software analogs: processes and messages. Using these primitives, a mechanism that allows fault-tolerant resource access, the process-pair, is described. The paper concludes with some observations on this type of system structure and on actual use of the system.
1981
Observations on the development of an operating system
The development of Pilot, an operating system for a personal computer, is reviewed, including a brief history and some of the problems and lessons encountered during this development. As part of understanding how Pilot and other operating systems come about, an hypothesis is presented that systems can be classified into five kinds according to the style and direction of their development, independent of their structure. A further hypothesis is presented that systems such as Pilot, and many others in widespread use, take about five to seven years to reach maturity, independent of the quality and quantity of the talent applied to their development. The pressures, constraints, and problems of producing Pilot are discussed in the context of these hypotheses.
Remote Data Storage
1981
The Felix File Server
This paper describes Felix - a File Server for an experimental distributed multicomputer system. Felix is designed to support a variety of file systems, virtual memory, and database applications with access being provided by a local area network. Its interface combines block oriented data access with a high degree of crash resistance and a comprehensive set of primitives for controlling data sharing and consistency. An extended set of access modes allows increased concurrency over conventional systems.
1981
A comparison of two network-based file servers
This paper compares two working network-based file servers, the Xerox Distributed File System (XDFS) implemented at the Xerox Palo Alto Research Center, and the Cambridge File Server (CFS) implemented at the Cambridge University Computer Laboratory. Both servers support concurrent random access to files using atomic transactions, both are connected to local area networks, and both have been in service long enough to enable us to draw lessons from them for future file servers.

We compare the servers in terms of design goals, implementation issues, performance, and their relative successes and failures, and discuss what we would do differently next time.

1981
A reliable object-oriented data repository for a distributed computer system
The repository described in this paper is a component of a distributed data storage system for a network of many autonomous machines that might run diverse applications. The repository is a server machine that provides very large, very reliable long-term storage for both private and shared data objects. The repository can handle both very small and very large data objects, and it supports atomic update of groups of objects that might be distributed over several repositories. Each object is represented as a history of its states; in the actual implementation, an object is a list of immutable versions. The core of the repository is stable append-only storage called Version Storage (VS). VS contains the histories of all data objects in the repository as well as all information needed for crash recovery. To maintain the current versions of objects online, a copying scheme was adopted that resembles techniques of real-time garbage collection. VS can be implemented with optical disks.
Computer-Computer Communication
1981
Sequencing computation steps in a network
It is sometimes necessary in the course of a distributed computation to arrange that a certain set of operations is carried out in the correct order and the correct number of times (typically once). If several sets of operations are performed on different machines on the network there is no obvious mechanism for enforcing such ordering constraints in a fully distributed way. This lack basically stems from the difficulty of preventing copying and repetition of messages by machines and from the impossibility of constraining externally the actions of machines in response to messages that come into their hands. This paper presents a possible method for ensuring the integrity of sequences of operations on different machines. The technique may be thought of as a means of enabling machines to ensure that requests made of them are valid and timely, not as means of centralized control of services.
1981
Accent: A communication oriented network operating system kernel
Accent is a communication oriented operating system kernel being built at Carnegie-Mellon University to support the distributed personal computing project, Spice, and the development of a fault-tolerant distributed sensor network (DSN). Accent is built around a single, powerful abstraction of communication between processes, with all kernel functions, such as device access and virtual memory management accessible through messages and distributable throughout a network. In this paper, specific attention is given to system supplied facilities which support transparent network access and fault-tolerant behavior. Many of these facilities are already being provided under a modified version of VAX/UNIX. The Accent system itself is currently being implemented on the Three Rivers Corp. PERQ.
1981
Performing remote operations efficiently on a local computer network
A communication model is described that can serve as a basis for a highly efficient communication subsystem for local networks. The model contains a taxonomy of communication instructions that can be implemented efficiently and can be a good basis for interprocessor communication. These communication instructions, called remote references, cause an operation to be performed by a remote process and, optionally, cause a value to be returned. This paper also presents implementation considerations for a communication system based upon the model and describes an experimental communication subsystem that provides one class of remote references. These remote references take about 150 microseconds or 50 average instruction times to perform on Xerox Alto computers connected by a 2.94 megabit Ethernet.
Memory Management
1981
Converting a swap-based system to do paging in an architecture lacking page-referenced bits
This paper discusses the modifications made to the UNIX operating system for the VAX-11/780 to convert it from a swap-based segmented system to a paging-based virtual memory system. Of particular interest is that the host machine architecture does not include page-referenced bits. We discuss considerations in the design of page-replacement and load-control policies for such an architecture, and outline current work in modeling the policies employed by the system. We describe our experience with the chosen algorithms based on benchmark-driven studies and production system use.
1981
WSCLOCK—a simple and effective algorithm for virtual memory management
A new virtual memory management algorithm WSCLOCK has been synthesized from the local working set (WS) algorithm, the global CLOCK algorithm, and a new load control mechanism for auxiliary memory access. The new algorithm combines the most useful feature of WS—a natural and effective load control that prevents thrashing—with the simplicity and efficiency of CLOCK. Studies are presented to show that the performance of WS and WSCLOCK are equivalent, even if the savings in overhead are ignored.
1981
A study of file sizes and functional lifetimes
The performance of a file system depends strongly on the characteristics of the files stored in it. This paper discusses the collection, analysis and interpretation of data pertaining to files in the computing environment of the Computer Science Department at Carnegie-Mellon University (CMU-CSD). The information gathered from this work will be used in a variety of ways: 1. As a data point in the body of information available on file systems. 2. As input to a simulation or analytic model of a file system for a local network, being designed and imlemented at CMU-CSD [1]. 3. As the basis of implementation decisions and parameters for the file system just mentioned. 4. As a step toward understanding how a user community creates, maintains and uses files.
Protection Techniques
1981
Hierarchical Take-Grant Protection systems
The application of the Take-Grant Protection Model to hierarchical protection systems is explored. The proposed model extends the results of Wu [7] and applies the results of Bishop and Snyder [2] to obtain necessary and sufficient conditions for a hierarchical protection graph to be secure. In addition, restrictions on the take and grant rules are developed that ensure the security of all graphs generated by these restricted rules.
1981
Cryptographic sealing for information secrecy and authentication
A new protection mechanism is described that provides general primitives for protection and authentication. The mechanism is based on the idea of sealing an object with a key. Sealed objects are self-authenticating, and in the absence of an appropriate set of keys, only provide information about the size of their contents. New keys can be freely created at any time, and keys can also be derived from existing keys with operators that include Key-And and Key-Or. This flexibility allows the protection mechanism to implement common protection mechanisms such as capabilities, access control lists, and information flow control. The mechanism is enforced with a synthesis of conventional cryptography, public-key cryptography, and a threshold scheme.
The iMAX-432 Operating System
1981
A unified model and implementation for interprocess communication in a multiprocessor environment
This paper describes interprocess communication and process dispatching on the Intel 432. The primary assets of the facility are its generality and its usefulness in a wide range of applications. The conceptual model, supporting mechanisms, available interfaces, current implementations, and absolute and comparative performance are described. The Intel 432 is an object-based multiprocessor. There are two processor types: General Data Processors (GDPs) and Interface Processors (IPs). These processors provide several operating system functions in hardware by defining and using a number of processor-recognized objects and high-level instructions. In particular, they use several types of processor-recognized objects to provide a unified structure for both interprocess communication and process dispatching. One of the prime motivations for providing this level of hardware support is to improve efficiency of these facilities over similar facilities implemented in software. With greater efficiency, they become more practically useful [Stonebraker 81]. The unification allows these traditionally separate facilities to be described by a single conceptual model and implemented by a single set of mechanisms. The 432 model is based on using objects to play roles. The roles are those of requests and servers. In general, a request is a petition for some service and a server is an agent that performs the requested service. Various types of objects are used to represent role-players. The role played by an object may change over time. The type and state of an object determines what role it is playing at any given instant. For any particular class of request, based upon type and state, there is typically a corresponding class of expected server. The request/server model may be applied to a number of common communication situations. In the full paper, several situations are discussed: one-way requestor to server, two-way requestor to server to requestor, nondistinguished requestors, resource source selectivity, nondistinguished servers, and mutual exclusion. While the model embodies most of the essential aspects of the 432's interprocess communication and process dispatching facilities, it leaves a great many practical questions unanswered. The full paper describes our solutions to those problems which often stand between an apparently good model and a successful implementation, namely: binding, queue structure, queuing disciplines, blocking, vectoring, dispatching mixes, and hardware/software cooperation. With an understanding of the mechanisms employed, the paper then reviews the instruction interface to and potential uses of the port mechanism. This instruction interface is provided by seven instructions: SEND, RECEIVE, CONDITIONAL SEND, CONDITIONAL RECEIVE, SURROGATE SEND, SURROGATE RECEIVE, and DELAY. The implementations of the port mechanism are then discussed. The port mechanism is implemented in microcode on both the GDP and IP. Although the microarchitectures differ, in both cases the implementation requires between 600 and 800 lines of vertically encoded 16-bit microinstructions. The corresponding execution times are roughly comparable, with the IP about 20% slower even though most of its microinstructions are twice as slow. Both implementations resulted from the hand translation of the Ada-based algorithms that describe these operations. Finally, the paper characterizes the performance of the 432 port mechanisms and contrasts its performance to other implementations of similar mechanisms. Three recently implemented mechanisms were chosen: one implemented completely in software (i.e., the Exchange mechanism of RMX/86 [Intel 80]) and two implemented in a combination of hardware and software (i.e., the Star0S and Medusa mechanisms of Cm* [Jones 80]). To make the comparison as fair as possible, times for each system are normalized to account for differences in their underlying hardware. The normalization factor is called a “tick” (similar to [Lampson 80]). In the full paper, the absolute and normalized performance of these implementations is examined in six different cases: conditional send time, conditional receive time, minumum message transit time, send plus minimum dispatching latency time, non-blocking send time, and blocking receive time. These performance comparisons show a 3 to 7x normalized performance advantage over the software implemented RMX/86. They show similar normalized performance to the Cm* implementations.
1981
iMAX: A multiprocessor operating system for an object-based computer
The Intel iAPX 432 is an object-based microcomputer which, together with its operating system iMAX, provides a multiprocessor computer system designed around the ideas of data abstraction. iMAX is implemented in Ada and provides, through its interface and facilities, an Ada view of the 432 system. Of paramount concern in this system is the uniformity of approach among the architecture, the operating system, and the language. Some interesting aspects of both the external and internal views of iMAX are discussed to illustrate this uniform approach.
1981
The iMAX-432 object filing system
iMAX is the operating system for Intel's iAPX-432 computer system. The iAPX-4321 is an object-oriented multiprocessor architecture that supports capability-based addressing. The object filing system is that part of iMAX that implements a permanent reliable object store. In this paper we describe the key elements of the iMAX object filing system design. We first contrast the concept of an object filing system with that of a conventional file system. We then describe the iMAX design paying particular attention to five problems that other object filing designs have either solved inadequately or failed to address. Finally, we discuss an effect of object filing on the programming semantics of Ada.
Distributed Systems
1981
The architecture of the Eden system
The University of Washington's Eden project is a five-year research effort to design, build and use an “integrated distributed” computing environment. The underlying philosophy of Eden involves a fresh approach to the tension between these two adjectives. In briefest form, Eden attempts to support both good personal computing and good multi-user integration by combining a node machine / local network hardware base with a software environment that encourages a high degree of sharing and cooperation among its users. The hardware architecture of Eden involves an Ethernet local area network interconnecting a number of node machines with bit-map displays, based upon the Intel iAPX 432 processor. The software architecture is object-based, allowing each user access to the information and resources of the entire system through a simple interface. This paper states the philosophy and goals of Eden, describes the programming methodology that we have chosen to support, and discusses the hardware and kernel architecture of the system.
1981
A distributed UNIX system based on a virtual circuit switch
The popular UNIXTM operating system provides time-sharing service on a single computer. This paper reports on the design and implementation of a distributed UNIX system. The new operating system consists of two components: the S-UNIX subsystem provides a complete UNIX process environment enhanced by access to remote files; the F-UNIX subsystem is specialized to offer remote file service. A system can be configured out of many computers which operate either under the S-UNIX or the F-UNIX operating subsystem. The file servers together present the view of a single global file system. A single-service view is presented to any user terminal connected to one of the S-UNIX subsystems. Computers communicate with each other through a high-bandwidth virtual circuit switch. Small front-end processors handle the data and control protocol for error and flow-controlled virtual circuits. Terminals may be connected directly to the computers or through the switch. Operational since early 1980, the system has served as a vehicle to explore virtual circuit switching as the basis for distributed system design. The performance of the communication software has been a focus of our work. Performance measurement results are presented for user process level and operating system driver level data transfer rates, message exchange times, and system capacity benchmarks. The architecture offers reliability and modularly growable configurations. The communication service offered can serve as the foundation for different distributed architectures.
1981
LOCUS a network transparent, high reliability distributed system
LOCUS is a distributed operating system that provides a very high degree of network transparency while at the same time supporting high performance and automatic replication of storage. By network transparency we mean that at the system call interface there is no need to mention anything network related. Knowledge of the network and code to interact with foreign sites is below this interface and is thus hidden from both users and programs under normal conditions. LOCUS is application code compatible with Unix, and performance compares favorably with standard, single system Unix. LOCUS runs on a high bandwidth, low delay local network. It is designed to permit both a significant degree of local autonomy for each site in the network while still providing a network-wide, location independent name structure. Atomic file operations and extensive synchronization are supported.

Small, slow sites without local mass store can coexist in the same network with much larger and more powerful machines without larger machines being slowed down through forced interaction with slower ones. Graceful operation during network topology changes is supported.

User-Oriented Systems
1981
Grapevine: An exercise in distributed computing
Grapevine is a multicomputer system on the Xerox research internet. It provides facilities for the delivery of digital messages such as computer mail; for naming people, machines, and services; for authenticating people and machines; and for locating services on the internet. This paper has two goals: to describe the system itself and to serve as a case study of a real application of distributed computing. Part I describes the set of services provided by Grapevine and how its data and function are divided among computers on the internet. Part II presents in more detail selected aspects of Grapevine that illustrate novel facilities or implementation techniques, or that provide insight into the structure of a distributed system. Part III summarizes the current state of the system and the lessons learned from it so far.
1981
BRUWIN: An adaptable design strategy for window manager/virtual terminal systems
With only one process viewable and operational at any moment, the standard terminal forces the user to continually switch between contexts. Yet this is unnatural and counter-intuitive to the normal working environment of a desk where the worker is able to view and base subsequent actions on multiple pieces of information. The window manager is an emerging computing paradigm which allows the user to create multiple terminals on the same viewing surface and to display and act upon these simultaneous processes without loss of context. Though several research efforts in the past decade have introduced window managers, they have been based on the design or major overhaul of a language or operating system; the window manager becomes a focus of—rather than a tool of—the system. While many of the existing implementations provide wide functionality, most implementations and their associated designs are not readily available for common use; extensibility is minimal. This paper describes the design and implementation of BRUWIN, the BRown University WINdow manager, stressing how such a design can be adapted to a variety of computer systems and output devices, ranging from alphanumeric terminals to high-resolution raster graphics displays. The paper first gives a brief overview of the general window manager paradigm and existing examples. Next we present an explanation of the user-level functions we have chosen to include in our general design. We then describe the structure and design of a window manager, outlining the five important parts in detail. Finally, we describe our current implementation and provide a sample session to highlight important features.

7th SOSP — December 10-12, 1979 — Pacific Grove, California, USA
System Analysis and Prediction
1979
A virtual machine emulator for performance evaluation
Virtual machines have long been used for functional testing of operating systems and to obtain the services of multiple operating systems from a single machine. They have not been used for performance evaluation however, because the timing observed by a program executing in a virtual machine is unpredictable and dependent on such factors as system load, real operating system overhead, real scheduling, etc. System level performance evaluation of hardware and operating systems has typically been done by hardware prototyping and dedicated machine benchmarking. This paper introduces the notion of virtual hardware prototyping for system performance evaluation. The approach makes use of the virtual machine concept by adding timing simulation to virtual machine support. The resulting virtual machines reproduce both machine timing and machine architecture. They are then useful for system performance evaluation.
1979
Modelling and analysis of distributed software systems
The problem of modelling and analysing the performance of software structures for distributed computer systems is addressed. A modelling technique is proposed which has many striking analogies with current techniques for evaluating hardware systems and yet focuses attention on the system software. The analysis of such models will draw heavily on the substantial work already done in the analysis of hardware models. To illustrate the modelling technique proposed, it is applied to an investigation of the trade-offs associated with the configuration of critical sections in a distributed software system. Simple queueing techniques are used to model a number of alternative configurations. The study throws some light on the regions of optimum decomposition, and the impact on performance of some of the important design variables.
Shared Data in a Distributed System
1979
WFS a simple shared file system for a distributed environment
WFS is a shared file server available to a large network community. WFS responds to a carefully limited repertoire of commands that client programs transmit over the network. The system does not utilize connections, but instead behaves like a remote disk and reacts to page-level requests. The design emphasizes reliance upon client programs to implement the traditional facilities (stream IO, a directory system, etc.) of a file system. The use of atomic commands and connectionless protocols nearly eliminates the need for WFS to maintain transitory state information from request to request. Various uses of the system are discussed and extensions are proposed to provide security and protection without violating the design principles.
1979
A client-based transaction system to maintain data integrity
This paper describes a technique for maintaining data integrity that can be implemented using capabilities typically found in existing file systems. Integrity is a property of a total collection of data. It cannot be maintained simply by using reliable primitives for reading and writing single units—the relations between the units are important also. The technique suggested in this paper ensures that data integrity will not be lost as a result of simultaneous access or as a result of crashes at inopportune times. The approach is attractive because of its relative simplicity and its modest demands on the underlying file system. The paper gives a detailed description of how consistent, atomic transactions can be implemented by client processes communicating with one or more file server computers. The discussion covers file structure, basic client operations, crash recovery, and includes an informal correctness proof.
System Construction Primitives
1979
Evaluating synchronization mechanisms
In recent years, many high-level synchronization constructs have been proposed. Each claims to satisfy criteria such as expressive power, ease of use, and modifiability. Because these terms are so imprecise, we have no good methods for evaluating how well these mechanisms actually meet such requirements. This paper presents a methodology for performing such an evaluation. Synchronization problems are categorized according to some basic properties, and this categorization is used in formulating more precise definitions of the criteria mentioned, and in devising techniques for assessing how well those criteria are met.
1979
Primitives for distributed computing
Distributed programs that run on nodes of a network are now technologically feasible, and are well-suited to the needs of organizations. However, our knowledge about how to construct such programs is limited. This paper discusses primitives that support the construction of distributed programs. Attention is focussed on primitives in two major areas: modularity and communication. The issues underlying the selection of the primitives are discussed, especially the issue of providing robust behavior, and various candidates are analyzed. The primitives will ultimately be provided as part of a programming language that will be used to experiment with construction of distributed programs.
1979
Experience with processes and monitors in Mesa
In early 1977 we began to design the concurrent programming facilities of Pilot, a new operating system for a personal computer [5]. Pilot is a fairly large program itself (25,000 lines of Mesa code). In addition, it supports some large applications, ranging from data base management to internetwork message transmission, which are heavy users of concurrency (our experience with some of these applications is discussed in the paper). We intended the new facilities to be used at least for the following purposes: Local concurrent programming: An individual application can be implemented as a tightly coupled group of synchronized processes to express the concurrency inherent in the application. Global resource sharing: Independent applications can run together on the same machine, cooperatively sharing the resources; in particular, their processes can share the processor. Replacing interrupts: A request for software attention to a device can be handled directly by waking up an appropriate process, without going through a separate interrupt mechanism (e.g., a forced branch, etc.).
Information Flow Control
1979
The transfer of information and authority in a protection system
In the context of a capability-based protection system, the term “transfer” is used (here) to refer to the situation where a user receives information when he does not initially have a direct “right” to it. Two transfer methods are identified: de jure transfer refers to the case when the user acquires the direct authority to read the information; de facto transfer refers to the case when the user acquires the information (usually in the form of a copy and with the assistance of others), without necessarily being able to get the direct authority to read the information. The Take-Grant Protection Model, which already models de jure transfers, is extended with four rewriting rules to model de facto transfer. The configurations under which de facto transfer can arise are characterized. Considerable motivational discussion is included.
1979
A mechanism for information control in parallel systems
Denning and Denning have shown how the information security of sequential programs can be certified by a compile-time mechanism [3]. This paper extends their work by presenting a mechanism for certifying parallel programs. The mechanism is shown to be consistent with an axiomatic description of information transmission.
1979
Specification and verification of the UCLA Unix security kernel
Data Secure Unix, a kernel structured operating system, was constructed as part of an ongoing effort at UCLA to develop procedures by which operating systems can be produced and shown secure. Program verification methods were extensively applied as a constructive means of demonstrating security enforcement. Here we report the specification and verification experience in producing a secure operating system. The work represents, to our knowledge, the first significant attempt to verify a large-scale, production level software system including all aspects from initial specification to verification of implemented code.
Distributed Systems: a Spectrum from Multiprocessors to Networks
Local Networks
1979
The behavior of Ethernet-like computer communications networks
Considering the widespread influence of Ethernet, a surprising amount of confusion exists concerning various important aspects of its design. Our objective in writing this paper is to spare future designers of local area networks the searching and speculation in which we were forced to engage. We begin by describing the policies common to Ethernet-like systems and by using an analytic model to study their behavior. We then precisely describe the mechanisms used in Ethernet itself, exploring its detailed behavior by means of a simulation model. Results from the two models, and particularly from their comparison, provide insight into the nature of low-level protocols in local area broadcast networks.
1979
Systems aspects of The Cambridge Ring
The Cambridge Ring is a local communication system developed in the Computer Laboratory of the University of Cambridge. It differs in various respects from some other local communication mechanisms such as Ethernet systems (Metcalfe & Boggs, 1976), and the purpose of the present paper is to describe the way in which the properties of the ring affect the general systems aspects of its exploitation.
Personal Computers and their User Interfaces
1979
Virtual terminal management in a multiple process environment
Rochester's Intelligent Gateway provides its users with the facilities for communicating simultaneously with a large number of processes spread out among various computer systems. We have adopted the philosophy that the user should be able to manage any number of concurrent tasks or jobs, viewing their output on his display device as he desires. To achieve this goal the Virtual Terminal Management System (VTMS) converts a single physical terminal into multiple virtual terminals, each of which may be written to or queried for user input. VTMS extends the features of the physical terminal by providing extensive editing facilities, the capacity to maintain all output In disk-based data structures, and sophisticated mechanisms for the management of screen space. Virtual terminals are device-independent; the specific characteristics of the physical terminal are known only to the lowest-level I/O handlers for that device. VTMS Is currently running on a network of six minicomputers supporting various text and raster-graphics displays.
1979
An open operating system for a single-user machine
The file system and modularization of a single-user operating system are described. The main points of interest are the openness of the system, which establishes no sharp boundary between itself and the user's programs, and the techniques used to make the system robust.
1979
Pilot: An operating system for a personal computer
The Pilot operating system is designed for the personal computing environment. It provides a basic set of services within which higher-level programs can more easily serve the user and/or communicate with other programs on other machines. Pilot omits certain functions sometimes associated with “complete” operating systems, such as character-string naming or user-command interpretation; higher-level software provides such facilities as needed. On the other hand, Pilot provides a higher level of service than that normally associated with the “kernel” or “nucleus” of an operating system. Pilot is closely coupled to the Mesa programming language and runs on a rather powerful personal computer, which would have been thought sufficient to support a substantial timesharing system of a few years ago. The primary user interface is a high resolution bit-map display, with a keyboard and a pointing device. Secondary storage generally takes the form of a sizable local disk. A local packet network provides a high bandwidth connection to other personal computers, and to server systems offering such remote services as printing and shared file storage. Much of the design of Pilot stems from an initial set of assumptions and goals rather different from those underlying most timesharing systems. Pilot is a single-language, single-user system, with only limited features for protection and resource allocation. Pilot's protection mechanisms are defensive, rather than absolute, since in a single user system, errors are a more serious problem than maliciousness. Similarly, Pilot's resource allocation features are not oriented toward enforcing fair distribution of scarce resources among contending parties.
Distributed Operating Systems
1979
The Roscoe distributed operating system
Roscoe is an operating system implemented at the University of Wisconsin that allows a network of microcomputers to cooperate to provide a general-purpose computing facility. After presenting an overview of the structure of Roscoe, this paper reports on experience with Roscoe and presents several problems currently being investigated by the Roscoe project.
1979
Medusa: An experiment in distributed operating system structure
The paper is a discussion of the issues that arose in the design of an operating system for a distributed multiprocessor, Cm*. Medusa is an attempt to understand the effect on operating system structure of distributed hardware, and to produce a system that capitalizes on and reflects the underlying architecture. The resulting system combines several structural features that make it unique among existing operating systems.
1979
StarOS, a multiprocessor operating system for the support of task forces
StarOS is a message-based, object-oriented, multiprocessor operating system, specifically designed to support task forces, large collections of concurrently executing processes that cooperate to accomplish a single purpose. StarOS has been implemented at Carnegie-Mellon University for the 50 processor Cm* multi-microprocessor computer.
Domains and Capabilities
1979
In support of domain structure for operating systems
One approach advocated in the search for better designed and more reliable operating systems is to base the design on the use of small protection domains. This paper presents empirical evidence to show that, with a suitable architecture, the overheads associated with using small protection domains do not make this an impractical approach.
1979
Variable-length capabilities as a solution to the small-object problem
A capability system which supports very small objects can achieve flexible and efficient protection. This paper presents a scheme for representing both large and small entitles In a computation, down to integers and character strings, as objects. This is achieved by a generalization of tagged memory to encompass extended data types; and by the use of variable-length capabilities, which can be very short if they are close to the object they reference. As developed here, the design assumes a single, systemwide virtual-address space, and a stack architecture; but it could probably be modified for use In other environments.
Distributed Data Management
1979
Polyvalues: A tool for implementing atomic updates to distributed data
The coordination of atomic updates to distributed data is a difficult problem in the design of a distributed information system. A common goal for solutions to this problem is that the failure of a site should not prevent any processing that does not require the data stored at that site. While this goal has been shown to be impossible to achieve all of the time, several approaches have been developed that perform atomic updates such that most site failures do not affect processing at other sites. This paper presents another such approach, one that provides a mechanism by which processing can proceed even if a failure occurs during a critical moment in an atomic update. The solution presented is based on the notion of maintaining several potential current values (a polyvalue) for each database item whose exact value is not known, due to failures interrupting atomic updates. A polyvalue represents the possible set of values that an item could have, depending on the outcome of transactions that have been delayed by failures. Transactions may operate on polyvalues, and in many cases a polyvalue may provide sufficient information to allow the results of a transaction to be computed, even though the polyvalue does not specify an exact value. An analysis and simulation of the polyvalue mechanism shows that the mechanism is suitable for databases with reasonable failure rates and recovery times. The polyvalue mechanism is most useful where prompt processing is essential, but the results that must be produced promptly depend only loosely on the database state. Many applications, such as electronic funds transfer, reservations, and process control, have these characteristics.
1979
Weighted voting for replicated data
In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of rvotes to read a file, and a write quorum of wvotes to write a file, such that r+w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file's voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.
1979
Implementing atomic actions on decentralized data
In this paper, a new approach to coordinating accesses to shared data objects is described. We have observed that synchronization of accesses to shared data and recovering the state of such data in the case of failures are really two sides of the same problem—implementing atomic actions on multiple data items. We describe a mechanism that solves both problems simultaneously in a way that is compatible with requirements of decentralized systems. In particular, the correct construction of a new atomic action can be done without knowledge of all other atomic actions in the system that might execute concurrently. Further, the mechanisms degrade gracefully if parts of the system fail—only those atomic actions that require resources in failed parts of the system are prevented from executing.
Invited Talk
1979
Commentary

6th SOSP — November 16-18, 1977 — West Lafayette, Indiana, USA
Keynote Address
1977
The Birth and Death of Operating Systems
The CAP System
1977
The Cambridge CAP computer and its protection system
This paper gives an outline of the architecture of the CAP computer as it concerns capability-based protection and then gives an account of how protected procedures are used in the construction of an operating system.
1977
The CAP filing system
The filing system for the CAP is based on the idea of preservation of capabilities: if a program has been able to obtain some capability then it has an absolute right to preserve it for subsequent use. The pursuit of this principle, using capability-oriented mechanisms in preference to access control lists, has led to a filing system in which a preserved capability may be retrieved from different directories to achieve different access statuses, in which the significance of a text name depends on the directory to which it is presented, and in which filing system 'privilege' is expressed by possession of directory capabilities.
1977
The CAP project - an interim evaluation
The CAP project has included the design and construction of a computer with an unusual and very detailed structure of memory protection, and subsequently the development of an operating system which fully exploits the protection facilities. The present paper passes the work in review and draws conclusions about good and bad aspects of the system. The basic architecture of the CAP machine is described in [1] and a largely prospective description of the protection system is given in [2]. The project was started as an experiment in hardware memory protection. A computer was to be designed in which operating system development was easy, in which ruggedness was produced by a much more fine-grained network of firewalls than was (or is) usual, and in which the full range of protection facilities was available to the writers of subsystems. Simplicity of mechanism was a very important goal, although some emphasis was placed on flexibility of protection policy.
The DEMOS System
1977
Task communication in DEMOS
This paper describes the fundamentals and some of the details of task communication in DEMOS, the operating system for the CRAY-1 computer being developed at the Los Alamos Scientific Laboratory. The communication mechanism is a message system with several novel features. Messages are sent from one task to another over links. Links are the primary protected objects in the system; they provide both message paths and optional data sharing between tasks. They can be used to represent other objects with capability-like access controls. Links point to the tasks that created them. A task that creates a link determines its contents and possibly restricts its use. A link may be passed from one task to another along with a message sent over some other link subject to the restrictions imposed by the creator of the link being passed. The link based message and data sharing system is an attractive alternative to the semaphore or monitor type of shared variable based operating system on machines with only very simple memory protection mechanisms or on machines connected together in a network.
1977
The DEMOS file system
This paper discusses the design of the file system for DEMOS, an operating system being developed for the CRAY-1 computer at Los Alamos Scientific Laboratory. The goals to be met, in particular the performance and usability considerations, are outlined. A description is given of the user interface and the general structure of the file system and the file system routines. A simple model of program behavior is used to demonstrate the effect of buffering data by the file system routines. A disk space allocation strategy is described which will take advantage of this buffering. The last section outlines how the performance mechanisms are integrated into the file system routines.
Panel
1977
Capability Systems: the Case for and Against
Issues in Computer Security
1977
The Multics kernel design project
We describe a plan to create an auditable version of Multics. The engineering experiments of that plan are now complete. Type extension as a design discipline has been demonstrated feasible, even for the internal workings of an operating system, where many subtle intermodule dependencies were discovered and controlled. Insight was gained into several tradeoffs between kernel complexity and user semantics. The performance and size effects of this work are encouraging. We conclude that verifiable operating system kernels may someday be feasible.
1977
Proving multilevel security of a system design
Two nearly equivalent models of multilevel security are presented. The use of multiple models permits the utilization of each model for purposes where that model is particularly advantageous. In this case, the more general model is simple and easily comprehensible, being more abstract, and is useful for exposition of the meaning of multilevel security. The less general model relates well to design specifications and permits straightforward proof of the security of a system design. The correspondence between the two models is easily demonstrated. The two models when applied appropriately are more useful for defining and proving the multilevel security of systems than existing models. The utility of the two models and their relationship to existing models is discussed and the proof of the security of one particular system design is illustrated. The technique for accomplishing the security proof is straightforward and can be extensively automated.
1977
Consistency and correctness of duplicate database systems
Solutions to the duplicate database update problem are considered, and a formal validation technique using the theory of L systems is developed and applied to the problem. The paper shows some particular solutions but is primarily concerned with general properties of the problem, convenient representational techniques, and formal proof procedures which are general enough to apply to this and to a number of other problems in parallel processing and synchronization.
Toward Distributed Systems
1977
Measurements of sharing in Multics
There are many good arguments for implementing information systems as distributed systems. These arguments depend on the extent to which interactions between machines in the distributed implementation can be minimized. Sharing among users of a computer utility is a type of interaction that may be difficult to provide in a distributed system. This paper defines a number of parameters that can be used to characterize such sharing. This paper reports measurements that were made on the M.I.T. Multics system in order to obtain estimates of the values of these parameters for that system. These estimates are upper bounds on the amount of sharing and show that although Multics was designed to provide active sharing among its users, very little sharing actually takes place. Most of the sharing that does take place is sharing of system programs, such as the compilers and editors.
1977
Synchronization with eventcounts and sequencers (Extended Abstract)
The design of computer systems to be concurrently used by multiple, independent users requires a mechanism that allows programs to synchronize their use of shared resources. Many such mechanisms have been developed and used in practical applications. Most of the currently favored mechanisms, such as semaphores and monitors are based on the concept of mutual exclusion. In this paper, we describe an alternative synchronization mechanism that is not based on the concept of mutual exclusion, but rather on observing the sequencing of significant events in the course of an asynchronous computation. Two kinds of objects are described, an eventcount, which is a communication path for signalling and observing the progress of concurrent computations, and a sequencer, which assigns an order to events occurring in the system.
1977
Metric (Extended Abstract): A kernel instrumentation system for distributed environments
Metric is a distributed software measurement system that communicates measurement data over the PARC computer network, the Ethernet. Metric is used to instrument stand alone and distributed computer systems (it works in an environment of about 90 machines total and is used by about 15 machines). The system is divided into three parts: object system probes that transmit measurement events, the accountant that receives and stores those events, and the analyst that manipulates the data for the user. Measurement events, small packets of standardly formatted measurement data, are used in a way that emphasizes their independence, history and context in a running system. Events are not counts of some system activity, they are a mini-snapshot of the state of the system when some activity begins or ends. In this way they provide context about what is happening in the system, and a succession of events provides a rich history of what has occurred in the system under study. The contextual information intrinsic to an event supports its independence—the event carries with it the information necessary to describe what it is all about. Metric's robustness is a direct consequence of its simplicity, its simple communications protocols and the independence of its parts prevent failures in the Metric system from interfering with the user's object system. Most failures in the object system are unlikely to interfere with the functioning of the Metric system. The standard format of events enables the accountant to receive events from different environments in a straightforward fashion, and makes the job of data handling easier for the analyst. Another advantage of Metric's simplicity is its economy of use: object system probes use about 100 microseconds to transmit data to the analyst. Object systems that use Metric continuously transmit event data. This means the event history log maintained by the accountant can be examined after particularly mysterious crashes to determine what the system had been doing lately. The tripartite division of the analyst into the kernel, utility layer and applications layer simplifies the job of maintenance, use, and extension of the system. The kernel understands event format and acts in behalf of applications to examine data collected by the accountant. The utility layer understands global system structures and language constructs to simplify the job of data analysis and presentation. The application layer is specific code written to answer some particular questions about a system. It is usually quite small and simple. In summary, Metric is unusual because of the way it exploits the Ethernet, its insistence on standardized measurement information, its efforts to make information intelligible to its users, and its extensibility in the face of very different user environments. The isolation of Metric's parts into different machines that communicate over the Ethernet has proven to be a very effective way of achieving a remarkably robust, low cost measurement tool. Metric's emphasis upon the context and history associated with measurements facilitates the use of measurement data.
1977
A domain structure for distributed computer systems
The successful implementation of generalized multiple computer systems will require attention both to the form of physical architecture and to the choice and implementation of a suitable systems environment in which to construct and run applications. This paper argues for the use of a multi-computer physical architecture in preference to a multi-processor architecture, and for dynamic distribution of functions and control as opposed to static allocation of functions and hierarchical control. A systems environment which is based on a domain structure is then described. The domain structure restricts sharing of items. This alleviates the main problem in implementing a capability mechanism to support domains in a system without shared memory, which is that a central table of capabilities is required. It also makes the management of the non shared items easier since they can be required only at one computer at a time. Essential sharing is also handled without central control but at the cost of some complexity. Considerable attention is paid to the handling of interdomain jumps as they provide the opportunity for the dynamic allocation of functions. It is conjectured that the resulting system would be capable of smooth expansion in size from one to twenty five computers. In operation it would exhibit dynamic load balancing as well as having the protection advantages of domain structure.
Program and Memory Policy Performance
1977
Automatic and general solution to the adaptation of programs in a paging environment
The efficiency of replacement algorithms in paged virtual-storage systems depends on the locality of memory references. The restructuring of the blocks which compose the program may improve this locality. [HATFIELD and GERALD 71] [MASUDA SHIOTA NOGUCHI and OHKI 74] [FERRARI 76]. In confining this restructuring to the link-editing operation, a general and completely automatic solution has been implemented, in the form of a self-adaptative system, on the SIRIS 8 operating system. A reduction of 40 to 70% in the page fault rate has been obtained.
1977
Effect of program localities on memory management strategies
Programs tend to reference pages unequally and cluster references to certain pages in short time intervals. These properties depend on the tendency of program locality references and program phase transitions. The significant effects on system performances arise from the phase transition behavior. However, the phase transition behavior of programs has been rarely taken into account in the analysis of memory management strategies. This paper investigates the effect of the phase transition behavior on the total system performance. For this purpose, an elaborate simulation model of the multiprogrammed memory management has been developed for a time-sharing environment. The working set strategy and the local LRU strategy are modeled in the simulation system. A simple phase transition model and the simple LRU stack model are used as a program paging behavior model. Both cases are analyzed where (1) locality variations exist and phase transitions occur, and (2) only locality variations exist and phase transitions do not occur. The relations between the phase transition rate and the system performance are found in the above memory management strategies.
1977
Analysis of demand paging policies with swapped working sets
The performance improvements brought by demand paging policies with swapped working-sets depend on several factors, among which the scheduling policy, the behaviour of the programs running in the system and the secondary memory latency characteristics are the more noticeable. We present in this paper a modelling approach to quantify the effects of these factors on the performance of a system running with a swapped working-sets policy. A preliminary analysis, conducted in the virtual time of the programs, shows their influence on the paging behaviour of programs. The results of this analysis are then used within a detailed queueing network of a multiprogrammed system. Computationnaly simple expressions for the CPU time spent in user state and in supervisor state are obtained for a class of paging policies ranging from pure demand paging to demand paging with swapped working-sets. Numerical examples illustrate the analysis, and these results are compared with measurements made on a real system running with swapped working-sets policies.
Panel
1977
The Role of Performance Modeling in Systems Design
Information Flow and Fault Tolerance
1977
Information transmission in computational systems
This paper presents Strong Dependency, a formalism based on an information theoretic approach to information transmission in computational systems. Using the formalism, we show how the imposition of initial constraints reduces variety in a system, eliminating undesirable information paths. In this way, protection problems, such as the Confinement Problem, may be solved. A variety of inductive techniques are developed useful for proving that such solutions are correct.
1977
On the synthesis and analysis of protection systems
The design of a protection system for an operating system is seen to involve satisfying the competing properties of richness and integrity. Achieving both requires the interplay of analysis and synthesis. Using a formal model from the literature, three designs are developed whose integrity (with the help of the model) can be shown.
1977
Process backup in producer-consumer systems
System state restoration after detection of an error is discussed for producer-consumer systems, with emphasis on the control of the domino effect. Recovery primitives MARK, RESTORE, and PURGE are proposed that, in conjunction with the use of SEND-RECEIVE interprocess communication primitives, allow bounds to be placed on the amount of unnecessary restoration that can occur as a result of system state restoration.
Language and Systems
1977
Indeterminacy, monitors, and dataflow
The work described in this paper began with a desire to include some linguistic concept of a resource manager within a dataflow language we have been designing [AGP76]. In doing so, we discovered that dataflow monitors (resource managers) provide a natural way of thinking about resources and especially their scheduling. Dataflow semantics are based upon a program composed of asynchronous operators interconnected by lines along which data tokens (messages) flow, such that when all of the input tokens for a given operator have arrived then that operator may fire (execute) by absorbing the input tokens, computing, and producing an output token as its result. These operations closely match one's intuitive model of resource managers (operators) passing signals (tokens) to one another for the purpose of synchronizing and scheduling resource usage. Previously though, dataflow languages [D73, K73, W75] have dealt only with the expression of highly asynchronous yet determinate computations; however, resource management characteristically involves indeterminate computation. The introduction here of dataflow monitors and an explicit nondeterministic merge operator for dataflow streams makes dataflow very well suited for expressing interprocess communication and operations on resources.
1977
Thoth, a portable real-time operating system (Extended Abstract)
Thoth is a portable real-time operating system which has been developed at the University of Waterloo. Various configurations of Thoth have been running since May 1976; it is currently running on two minicomputers with quite different architectures (Texas Instruments 990 and Data General NOVA). This research is motivated by the difficulties encountered when moving application programs from one system to another; these difficulties arise when interfacing with the hardware and support software of the target machine. The problems encountered interfacing with the new software support are usually more difficult than those of interfacing with new hardware because of the wide variety of abstract machines presented by the compilers, assemblers, loaders, file systems and operating systems of the various target machines. We have taken the approach of developing portable system software and porting it to “bare” hardware. Because the same system software is used on different hardware, the same abstract machine is available to application programs. Thus most application programs which use Thoth can be portable, if not machine independent. Most previous work on software portability has focused on problems of porting programs over different operating systems as well as hardware. To our knowledge, this is the first time an entire system has been ported. Our experience indicates that this approach is practical both in the cost of porting the system and its time and space performance. The design of Thoth strives for more than portability. A second design goal is to provide a system in which programs can be structured using many small concurrent processes. Thus we have aimed for efficient interprocess communication to make this structuring technique attractive. We have also provided safe dynamic process creation and destruction. A third design goal is that the system meet the demands of real-time applications. To meet this goal, the system guarantees that the worst-case time for response to certain external events (interrupt requests) is bounded by a small constant. A fourth design goal is that the system be adaptable to a wide range of real-time applications. A range of system configurations is possible: A stand-alone application program can use a stripped version of the Thoth kernel which supports dynamic memory allocation and interprocess communication. Such a configuration requires less than 2000 16-bit words of memory. Larger configurations can support process destruction, a device-independent input-output system, a tree-structured file system, and multiple teams of processes. (A team is a set of processes which share the same logical address space and therefore can share data.) Thoth is implemented in a high-level language called Eh (a descendant of BCPL) and a small amount of assembly language. The major job in porting the system seems to be in redesigning the code generation parts of the compiler. Since it appears impractical to design system software to be portable over all computers (even over all existing machines), we have aimed at making Thoth portable over a subset of machines. Machines in the set can be characterized by a set of properties such as: a word must be at least 16 bits in length, a pointer to a word must fit into a word, etc. Roughly, this set of machines includes most modern minicomputers. It is important that many machines which do not yet exist will be included in it. A number of application programs have been written using Thoth. In addition to software development tools, communications and real-time control programs have been written. All of these programs require few if any changes when ported to new hardware. Some of these programs have been developed by inexperienced programmers who were not planning on porting their program. Hence, it seems to take less skill to write portable software in this system than using conventional techniques. However, existing software written for other systems is incompatible with Thoth and usually difficult to port to the Thoth system. Although, at the time of this writing, we have limited experience with porting the system to new hardware, we feel that Thoth has been highly successful in terms of our original objectives. Among other things, it has partially demonstrated the feasibility of building a portable operating system for a specified class of machines.
1977
Beyond concurrent Pascal
We take the view that operating systems should not be written in assembly language. Alternatives are machine oriented high-level languages and “safe” languages in the style of Concurrent Pascal and MODULA. A serious drawback of the Concurrent Pascal approach is the fact that those very language features that pertain to operating systems must be implemented separately, using some other language. A technique is presented which solves this problem. This technique is based on user-defined trap handling. It is exhibited by demonstrating how virtual memory systems can be constructed using Concurrent Pascal and how process management can be moved from the kernel to the Concurrent Pascal program. We demonstrate that a fundamental solution of the difficulties with Concurrent Pascal, MODULA, and similar languages cannot be found in going back to classical implementation languages, but in designing languages that are not rich with special features, but powerful with respect to extension and shrinkage.
Invited Talk
1977
The left-handed least unlikely last page-removal selection algorithm

5th SOSP — November 19-21, 1975 — Austin, Texas, USA
Virtual Memory Algorithms
1975
How to evaluate page replacement algorithms
The designer of a virtual memory system can obtain accurate estimates of the average memory requirements of programs running in the system by weighting the average allocation during execution intervals with the average allocation during page waiting intervals. We show how to combine the averages, how to use the measure to determine the size of primary memory while achieving system balance between memory and processor demands, and how to partially order the performance of paging algorithms.
1975
MIN—an optimal variable-space page replacement algorithm
A criterion for comparing variable space page replacement algorithms is presented. An optimum page replacement algorithm, called VMIN, is described and shown to be optimum with respect to this criterion. The results of simulating VMIN, Denning's working set, and the page partitioning replacement algorithms on five virtual memory programs are presented to demonstrate the improvement possible over the known realizable variable space algorithms.
1975
Analysis of the PFF replacement algorithm via a semi-Markov model
An analytical model is presented to estimate the performance of the Page Fault Frequency (PFF) replacement algorithm. In this model, program behavior is represented by the LRU stack distance model and the PFF replacement algorithm is represented by a semi-Markov model. Using these models, such parameters as the inter-page-fault interval distribution, the probability of the number of distinct pages being referenced during an inter-page-fault interval, etc. are able to be analytically determined. Using these models to evaluate these parameter values permits study of the performance of the replacement algorithm by simulating the page fault events rather than every page reference event. This significantly reduces the required computation time in estimating the performance of the PFF algorithm.
1975
An analysis of the performance of the page fault frequency (PFF) replacement algorithm.
Most of the replacement algorithms devised and implemented largely depend on program behavior, in other words, to optimally select the parameters of these algorithms program behavior or at least a probability model of it should be known. The page fault frequency (PFF) algorithm adapts to dynamic changes in program behavior during execution. Therefore its performance is expected to be less dependent on prior knowledge of the program behavior during execution. Therefore its performance is expected to be less dependent on prior knowledge of the program behavior and input data. The PFF algorithm uses the measured page fault frequency (by actually monitoring the inter-page fault interval) as the basic parameter for memory allocation decision process. In order to analyze the performance of the PFF algorithm, a mathematical model was developed. The resultant random process is the memory space allocation for a program as a function of the processor time (virtual time). This random process can be analyzed using the method of imbedded Markov chains. The parameter obtained from this analysis are the distributions of the memory allocation during processing interval and during page waiting intervals, the average page fault rate and the expected space time product accumulated by the program. The input parameters for the model were obtained from address traces of two programs. The results of the model were validated by simulation.
Protection and Security
1975
On protection in operating systems
A model of protection mechanisms in computing systems is presented and its appropriateness is demonstrated. The “safety” problem for protection systems under our model is to determine in a given situation whether a subject can acquire a particular right to an object. In restricted cases, one can show that this problem is decidable, i.e., there is an algorithm to determine whether a system in a particular configuration is safe. In general, and under surprisingly weak assumptions, one cannot decide if a situation is safe. Various implications of this fact are discussed.
1975
Security Kernel validation in practice
A security kernel is a software and hardware mechanism that enforces access controls within a computer system. The correctness of a security kernel on a PDP-11/45 is being proved. This paper describes the technique used to carry out the first step of the proof: validating a formal specification of the program with respect to axioms for a secure system.
1975
Engineering a security kernel for Multics
This paper describes a research project to engineer a security kernel for Multics, a general-purpose, remotely accessed, multiuser computer system. The goals are to identify the minimum mechanism that must be correct to guarantee computer enforcement of desired constraints on information access, to simplify the structure of that minimum mechanism to make verification of correctness by auditing possible, and to demonstrate by test implementation that the security kernel so developed is capable of supporting the functionality of Multics completely and efficiently. The paper presents the overall viewpoint and plan for the project and discusses initial strategies being employed to define and structure the security kernel.
Case Studies
1975
MERT - a multi-environment real-time operating system
MERT is a multi-environment real-time operating system for the Digital Equipment PDP-11/45 and 11/70 computers. It is a structured operating system built on top of a kernel which provides the basic services such as memory management, process scheduling, and trap handling needed to build various operating system environments. Real-time response to processes is achieved by means of preemptive priority scheduling. The file system structure is optimized for real-time response. Processes are built as modular entities with data structures that are independent of all other processes. Interprocess communication is achieved by means of messages, event flags, shared segments, and shared files. Process ports are used for communication between unrelated processes.
1975
Dynamic linking and environment initialization in a multi-domain process.
As part of an effort to engineer a security kernel for Multics, the dynamic linker has been removed from the domain of the security kernel. The resulting implementation of the dynamic linking function requires minimal security kernel support and is consistent with the principle of least privilege. In the course of the project, the dynamic linker was found to implement not only a linking function, but also an environment initialization function for executing procedures. This report presents an analysis of dynamic linking and environment initialization in a multi-domain process, isolating three sets of functions requiring different sets of access privileges. A design based on this decomposition of the dynamic linking and environment initialization functions is presented.
1975
Architecture of a real time operating system
Architecture is receiving increasing recognition as a major design factor for operating systems development which contributes to the clarity, and modifiability of the completed system. The MOSS Operating System uses an architecture based on hierarchical levels of system functions overlayed dynamically by asynchronous cooperating processes carrying out the system activities. Since efficient operation in a real time environment requires that the number of processes and process switches be kept to a minimum, the MOSS system uses processes only where a truly asynchronous activity is identified. The layers of the MOSS Operating System do not represent a hierarchical structure of virtual machine processes, but rather a hierarchy of functions used to create the processes. This paper describes the layering concepts and process concepts defining the system architecture. It also presents an overview of the specific functions and processes of the MOSS Operating System.
1975
Reflections on an operating system design
The main features of a general purpose multiaccess operating system developed for the CDC 6400 at Berkeley are presented, and its good and bad points are discussed as they appear in retrospect. Distinctive features of the design were the use of capabilities for protection, and the organization of the system into a sequence of layers, each building on the facilities provided by earlier ones and protecting itself from the malfunctions of later ones. There were serious problems in maintaining the protection between layers when levels were added to the memory hierarchy; these problems are discussed and a new solution is described.
Network Operating Systems
1975
The network Unix system
A Network Interface Program (NIP) is that part of an operating system which interfaces with similar entities in a network. Normally, the NIP is a collection of software routines which implement interprocess communication, interhost protocols, data flow controls, and other necessary executive functions. This paper discusses the organization of the NIP currently being used with the Unix operating system on the ARPA network. The Network Unix system is noteworthy because of the natural way that network and local functions are merged. As a result the network appears as a logical extension to the local system - from the point of view of both the interactive terminal and user program.
1975
Some constraints and tradeoffs in the design of network communications
A number of properties and features of interprocess communication systems are presented, with emphasis on those necessary or desirable in a network environment. The interactions between these features are examined, and the consequences of their inclusion in a system are explored. Of special interest are the time-out feature which forces all system table entries to “die of old age” after they have remained unused for some period of time, and the insertion property which states that it is always possible to design a process which may be invisibly inserted into the communication path between any two processes. Though not tied to any particular system, the discussion concentrates on distributed systems of sequential processes (no interrupts) with no system buffering.
1975
An operational system for computer resource sharing
Users and administrators of a small computer often desire more service than it can provide. In a network environment additional services can be provided to the small computer, and in turn to the users of the small computer, by one or more other computers. An operational system for providing such “resource sharing” is described; some “fundamental principles” are abstracted from the experience gained in constructing the system; and some generalizations are suggested.
Virtual Machines
1975
Sharing data and services in a virtual machine system
Experimental additions have been made to a conventional virtual machine system (VM/370) in order to support a centralized program library management service for a group of interdependent users. These additions enable users to share read/write access to a data base as well as processing services. Although the primary motivation was the enhancement of performance, considerable attention was given to retaining the inherent advantages of the virtual machine system. Extended applications of the basic technique are described and the implications of such extensions on operating system design are considered.
1975
Formal properties of recursive Virtual Machine architectures.
A formal model of hardware/software architectures is developed and applied to Virtual Machine Systems. Results are derived on the sufficient conditions that a machine architecture must verify in order to support VM systems. The model deals explicitly with resource mappings (protection) and with I/O devices. Some already published results are retrieved and other ones, more general, are obtained.
1975
The PDP-11 virtual machine architecture: A case study
At UCLA, a virtual machine system prototype has been constructed for the Digital Equipment Corporation PDP-11/45. In order to successfully implement that system, a number of hardware changes have been necessary. Some overcome basic inadequacies in the original hardware for this purpose, and others enhance the performance of the virtual machine software. Steps in the development of the modified hardware architecture, as well as relevant aspects of the software structure, are discussed. In addition, a case study of interactions between hardware and software developments is presented, together with conclusions motivated by that experience.
Correctness and Reliability
1975
Proving monitors
Interesting scheduling and sequential properties of monitors can be proved by using state variables which record the monitors' history and by defining extended proof rules for their wait and signal operations. These two techniques are defined, discussed, and applied to examples to prove properties such as freedom from indefinitely repeated overtaking or unnecessary waiting, upper bounds on queue lengths, and historical behavior.
1975
Verifying properties of parallel programs: an axiomatic approach
An axiomatic method for proving a number of properties of parallel programs is presented. Hoare has given a set of axioms for partial correctness, but they are not strong enough in most cases. This paper defines a more powerful deductive system which is in some sense complete for partial correctness. A crucial axiom provides for the use of auxiliary variables, which are added to a parallel program as an aid to proving it correct. The information in a partial correctness proof can be used to prove such properties as mutual exclusion, freedom from deadlock, and program termination. Techniques for verifying these properties are presented and illustrated by application to the dining philosophers problem.
1975
Error resynchronization in producer-consumer systems
This paper is concerned with error processing for parallel producer-consumer interactions such as encountered in the desing of multi-process operating systems. Solutions to resynchronization problems that occur when a consumer process detects errors in information received from a producer process are presented. Fundamental properties of this error processing are discussed. It is shown that explicit error processing results in an increase in program complexity and a decrease in the ease of understanding a program.
System Design
1975
A multi-microprocessor computer system architecture
The development of microprocessors has suggested the design of distributed processing and multiprocessing computer architectures. A computer system design incorporating these ideas is proposed, along with its impact on memory management and process control aspects of the system's operating system. The key design feature is to identify system processes with microprocessors and interconnect them in a hierarchy constructed to minimize intercommunication requirements.
1975
Control of processes in operating systems: the Boss-Slave relation
This paper describes a boss-slave relationship between processes, different from the normal relationships between processes, which is useful for a number of purposes. These purposes include debugging of programs, analysis of process behavior, control of a process for security purposes, and simulation of a different operating system for a slave process. This mechanism was easily added to one operating system, and the ideas apply to a number of different operating systems.
1975
Modularization and hierarchy in a family of operating systems
This paper describes the design philosophy used in the construction of a family of operating systems. It is shown that the concepts of module and level do not coincide in a hierarchy of functions. Family members can share much software as a result of the implementation of run-time modules at the lowest system level.
The Hydra Operating System
1975
Overview of the Hydra operating system development
An overview of the hardware and philosophic context in which the Hydra design was done is discussed. The construction methodology is discussed together with some data which suggests the success of this methodological approach.
1975
Policy/mechanism separation in Hydra
The extent to which resource allocation policies are entrusted to user-level software determines in large part the degree of flexibility present in an operating system. In Hydra the determination to separate mechanism and policy is established as a basic design principle and is implemented by the construction of a kernel composed (almost) entirely of mechanisms. This paper presents three such mechanisms (scheduling, paging, protection) and examines how external policies which manipulate them may be constructed. It is shown that the policy decisions which remain embedded in the kernel exist for the sole purpose of arbitrating conflicting requests for physical resources, and then only to the extent of guaranteeing fairness.
1975
Protection in the Hydra Operating System
This paper describes the capability based protection mechanisms provided by the Hydra Operating System Kernel. These mechanisms support the construction of user-defined protected subsystems, including file and directory subsystems, which do not therefore need to be supplied directly by Hydra. In addition, we discuss a number of well known protection problems, including Mutual Suspicion, Confinement and Revocation, and we present the mechanisms that Hydra supplies in order to solve them.
Processor Scheduling
1975
Computational processor demands of Algol-60 programs
The characteristics of computational processor requirements of a sample of Algol-60 programs have been measured. Distributions are presented for intervals of processor activity as defined by input-output requests and segment allocation requests occurring within the Johnston contour model and within a stack model in which array allocations are treated separately. The results provide new empirical data concerning the behavior of this class of programs. Some implications of the empirical results which may influence computer system design and performance are presented and discussed.
1975
Scheduling partially ordered tasks with probabilistic execution times
The objective of this paper is to relate models of multi-tasking in which task times are known or known to be equal to models in which task times are unknown. We study bounds on completion times and the applicability of optimal deterministic schedules to probabilistic models. Level algorithms are shown to be optimal for forest precedence graphs in which task times are independent and identically distributed exponential or Erlang random variables. A time sharing system simulation shows that multi-tasking could reduce response times and that response time is insensitive to multi-tasking scheduling disciplines.
1975
Analysis of a level algorithm for preemptive scheduling
Muntz and Coffman give a level algorithm that constructs optimal preemptive schedules on identical processors when the task system is a tree or when there are only two processors. A variation of their algorithm is adapted for processors of different speeds. The algorithm is shown to be optimal on two processors for arbitrary task systems, but not on three or more processors even for trees. Taking the algorithm as a heuristic on m processors and using the ratio of the lengths of the constructed and optimal schedules as a measure, we show that, on identical processors, its performance is bounded by 2 - 2/m. Moreover 2 - 2/m is a best bound in that there exist task systems for which this ratio is approached arbitrarily closely. On processors of different speeds, we derive an upper bound of its performance in terms of the speeds of the given processor system and show that √1.5m is an upper bound over all processor systems. We also give an example of a system for which the bound √m/2 √2 can be approached asymptotically, thus establishing that the √1.5m bound can at most be improved by a constant factor.
1975
Selecting a scheduling rule that meets pre-specified response time demands
In this paper we study the problem of designing scheduling strategies when the demand on the system is known and waiting time requirements are pre-specified. This important synthesis problem has received little attention in the literature, and contrasts with the common analytical approach to the study of computer service systems. This latter approach contributes only in-directly to the problem of finding satisfactory scheduling rules when the desired (or required) response-time performance is specifiable in advance.
Security and Protection
1975
A comment on the confinement problem
The confinement problem, as identified by Lampson, is the problem of assuring that a borrowed program does not steal for its author information that it processes for a borrower. An approach to proving that an operating system enforces confinement, by preventing borrowed programs from writing information in storage in violation of a formally stated security policy, is presented. The confinement problem presented by the possibility that a borrowed program will modulate its resource usage to transmit information to its author is also considered. This problem is manifest by covert channels associated with the perception of time by the program and its author; a scheme for closing such channels is suggested. The practical implications of the scheme are discussed.
1975
The enforcement of security policies for computation
Security policies define who may use what information in a computer system. Protection mechanisms are built into a system to enforce security policies. In most systems, however, it is quite unclear what policies a mechanism can or does enforce. This paper defines security policies and protection mechanisms precisely and bridges the gap between them with the concept of soundness: whether a protection mechanism enforces a policy. Different sound protection mechanisms for the same policy can then be compared. We also show that the “union” of mechanisms for the same program produces a more “complete” mechanism. Although a “maximal” mechanism exists, it cannot necessarily be constructed.
1975
A lattice model of secure information flow
This paper investigates mechanisms that guarantee secure information flow in a computer system. These mechanisms are examined within a mathematical framework suitable for formulating the requirements of secure information flow among security classes. The central component of the model is a lattice structure derived from the security classes and justified by the semantics of information flow. The lattice properties permit concise formulations of the security requirements of different existing systems and facilitate the construction of mechanisms that enforce security. The model provides a unifying view of all systems that restrict information flow, enables a classification of them according to security objectives, and suggests some new approaches. It also leads to the construction of automatic program certification mechanisms for verifying the secure flow of information through a program.
Memory Measurement and Modelling
1975
Characteristics of program localities
The term “locality” has been used to denote that subset of a program’s segments which are referenced during a particular phase of its execution. A program’s behavior can be characterized in terms of its residence in localities of various sizes and lifetimes, and the transitions between these localities. In this paper the concept of a locality is made more explicit through a formal definition of what constitutes a phase of localized reference behavior, and by a corresponding mechanism for the detection of localities in actual reference strings. This definition provides for the existence of a hierarchy of localities at any given time, and the reasonableness of the definition is supported by examples taken from actual programs. Empirical data from a sample of production Algol 60 programs is used to display distributions of locality sizes and lifetimes, and these results are discussed in terms of their implications for the modeling of program behavior and memory management in virtual memory systems.
1975
A study of program locality and lifetime functions
A program model can be regarded as decomposible into two main parts. The macromodel captures the phase-transition behavior by specifying locality sets and their associated reference intervals (phases). The micromodel captures the reference patterns within phases. A semi-Markov model can be used at the macro level, while one of the simple early models (such as the random-reference or LRU stack) can be used at the micro level. This paper shows that, even in simplest form, this type of model is capable of reproducing known properties of empirical lifetime functions. A micromodel, alone without a macromodel, is incapable of doing so.
1975
Models of memory scheduling
Queueing theoretic models of single and multi-processor computer systems have received wide attention in the computer science literature. Few of these models consider the effect of finite memory size of a machine and its impact on the memory scheduling problem. In an effort to formulate an analytical model for memory scheduling we propose four simple models and examine their characteristics using simulation. In this paper, we discuss some interesting results of these simulations.

4th SOSP — October 15-17, 1973 — Yorktown Heights, New York, USA
Systems I
1973
Interprocess communication in real-time systems
A variety of solutions have been proposed for ensuring data integrity in nonreal-time systems (i.e. batch or on-line systems). A brief review is made of some of the techniques employed in these solutions. It is indicated why the data integrity problem is different in a real-time system than in a nonreal-time system. Two models of interprocess communication are presented and it is demonstrated that the models are sufficient to preserve data integrity in a real-time system.
1973
An experimental implementation of the kernel/domain architecture
As part of its effort to periodically investigate various new promising concepts and techniques, the Digital Equipment Corporation has sponsored a research project whose purpose it was to effect a limited implementation of a protective operating system framework, based on the kernel/domain architecture which has increasingly been propounded in recent years. The project was carried out in 1972, and its successful completion has led to a substantial number of observations and insights. This paper reports on the more significant ones, specifically: 1) the techniques used in mapping a conceptual model onto commercially available hardware (the PDP-11/45 mini-computer), 2) the domain's memory mapping properties, and their impact on programming language storage-class semantics, 3) this architecture's impact on the apparent simplification of various traditionally-complex operating systems monitor functions, and 4) the promise this architecture holds in terms of increased functional flexibility for future-generation geodesic operating systems.
1973
On the structure and control of commands
An interactive command language, with its underlying data, defines a command environment. In general a command environment supports a number of commands which once issued perform non-interactively, and which when finished leave the old command environment in control. It also supports some special commands which move to other command environments, after which commands are interpreted according to a different set of rules. The usefulness of a command environment can be extended by programming it, i.e. by dynamically constructing and conditionally executing sequences of its commands; but, unlike a programming language, a command language does not usually contain any general-purpose variables or means for conditional execution. These facilities can however be provided by a command control language, which makes it possible to construct sequences or commands to be issued to the currently active command environment from a program. A command control language is described, and the usefulness, limitations and repercussions of command language programming are discussed.
Systems II
1973
The UNIX time-sharing system
UNIX is a general-purpose, multi-user, interactive operating system for the Digital Equipment Corporation PDP-1 1/40 and 11/45 computers. It offers a number of features seldom found even in larger operating systems, including 1. A hierarchical file system incorporating demountable volumes, 2. Compatible file, device, and inter-process I/O, 3. The ability to initiate asynchronous processes, 4. System command language selectable on a per-user basis, 5. Over 100 subsystems including a dozen languages. This paper discusses the usage and implementation of the file system and of the user command interface.
1973
ARGOS: An operating system for a computer utility supporting interactive instrument control
“ARGOS” (ARGonne Operating System), which runs on a Xerox Sigma 5 hardware configuration, provides a dynamic multiprogrammed environment which supports the following: data acquisition and interactive control for numerous (currently 19) independently running on-line laboratory experiments; three interactive graphics terminals; FORTRAN IV-H executing at each of 23 remote time-shared terminals; a jobstream from open-shop batch processing; long-term low priority computations (100 CPU hours). The system guarantees the protection of each user's interests by the utilization of the hardware memory-protection feature, internal clocks and disallowing the execution of privileged instructions by user programs. The system is interrupt-driven, with each task running to completion, contingent on its priority. System resources are provided on a first come first served basis, except that rotating memory is queued by request position. System CALLs provide users full access to hardware capability, thus providing user-directed file formats and insuring a minimum of system overhead. However, at some sacrifice in overhead, the user can make use of FORTRAN record-blocking. Core memory, disk space and magnetic tape usage, are assigned dynamically. Parametrization of the system is such that terminal characteristics are specified (one parameter card/terminal) at boot-in time (once/week after preventive maintenance).
1973
Multiprocessor self diagnosis, surgery, and recovery in air terminal traffic control
The rapid growth of global aviation for business and pleasure has created the need for automated terminal systems of increasing complexity and capability. Continued increases in the aircraft population will require higher levels of automation. Sperry Univac is responding to this challenge with a multiprocessing system, including hardware and software, currently under development which will enable controllers to safely manage the crowded skies.
1973
Online system performance measurements with software and hybrid monitors
Two monitors were implemented to collect information about the behavior of the online system developed and run at Stanford. The response of this online system was slow and main memory was a critical resource. The goal was to extract desired information by a method that requires only a negligible amount of monitored system resources. Results presented in this paper indicate that this effort has been successful. A software monitor that requires less than 700 bytes of main memory collects statistics about utilization of special online system resources and about the scheduler mechanism, detects system deadlocks, and measures online executive overhead. This software monitor helped to discover various facts about the system behavior; however, to understand the reasons behind certain situations, it was necessary to learn more about properties of individual terminal tasks. Since a software monitor would cause an intolerable system degradation and hardware monitoring is not directly applicable for such measurements, a hardware/software monitor interface was implemented which enables recording of software events by a hardware monitor. The monitoring artifact is thus kept close to zero. This technique has been applied to measure time a task spends in various states and it has many other uses.
Memory Management
1973
Minimal-total-processing-time drum and disk scheduling disciplines
This article investigates the application of minimal-total-processing-time (MTPT) scheduling disciplines to rotating storage units when random arrival of requests is allowed. Fixed-head drum and moving-head disk storage units are considered and particular emphasis is placed on the relative merits of the MTPT scheduling discipline with respect to the shortest-latency-time-first (SLTF) scheduling discipline. The results of the simulation studies presented show that neither scheduling discipline is unconditionally superior to the other. For most fixed-head drum applications the SLTF discipline is preferable to MTPT, but for intra-cylinder disk scheduling the MTPT discipline offers a distinct advantage over the SLTF discipline. An implementation of the MTPT scheduling discipline is discussed and the computational requirements of the algorithm are shown to be comparable to SLTF algorithms. In both cases, the sorting procedure is the most time consuming phase of the algorithm.
1973
Optimal folding of a paging drum in a three level memory system
This paper describes a drum space allocation and accessing strategy called “folding”, whereby effective drum storage capacity can be traded off for reduced drum page fetch time. A model for the “folded drum” is developed and an expression is derived for the mean page fetch time of the drum as a function of the degree of folding. In a hypothetical three-level memory system of primary (directly addressable), drum, and tertiary (usually disk) memories, the tradeoffs among drum storage capacity, drum page fetch time, and page fetch traffic to tertiary memory are explored. An expression is derived for the mean page fetch time of the combined drum-tertiary memory system as a function of the degree of folding. Measurements of the MULTICS three-level memory system are presented as examples of improving multi-level memory performance through drum folding. A methodology is suggested for choosing the degree of folding most appropriate to a particular memory configuration.
1973
A page allocation strategy for multiprogramming systems
In a multiprogramming, virtual-memory computing system, many processes compete simultaneously for system resources, which include CPU's, main memory page frames, and the transmission capacity of the paging drum. (We define a “process” here as a program with its own virtual memory, requiring an allocation of real memory and a CPU in order to execute). This paper studies ways of allocating resources to processes in order to maximize throughput in systems which are not CPU-bound. As is customary, we define the multiprogramming “set” (MPS) as the set of processes eligible for allocation of resources at any given time. Each process in the MPS is allocated a certain number of page frames and allowed to execute, interrupted periodically by page faults. A process remains in the MPS until it finishes or exhausts its “time slice”, at which time it is demoted. We assume the existence of two resource managers within the operating system: The Paging Manager and the Scheduler. The function of the Paging Manager is to control the size of the MPS, and to allocate main storage page frames among those processes in the MPS. The function of the Scheduler is to assign time-slice lengths to the various processes, and to define a promotion order among those processes not currently in the MPS. The Scheduler must ensure that system responsiveness is adequate, while the Paging Manager is primarily concerned with throughput. This paper studies possible strategies for the Paging Manager. A strategy for the Scheduler is proposed in (2).
1973
Dynamic storage partitioning
A model of paged multiprogramming computer systems using variable storage partitioning is considered. A variable storage partitioning policy is one which allocates storage among the active tasks according to a sequence of fixed partitions of main storage. The basic result obtained is, mean processing efficiency is increased and mean fault-rate decreased under a variable partition, provided that the curves of efficiency and fault-rate as a function of allocated space are concave up.
1973
On reference string generation processes
Efficient memory management is important for optimizing computer usage. Intuition, simulation, experience, and analysis have contributed to the design of space management algorithms. Analytical models require accurate and concise descriptions of the system's environment. The referencing pattern, describing the sequence of memory addresses, is the environment for memory management problems. One referencing model assigns probabilities to positions in an LRU stack of memory pages. “Local” behavior is easily described using this model. However, the distinctly different behavior among instruction and data references is lost. In this paper we generalize the LRU stack model to an arbitrary number of memory spaces which are selected by the transitions of a Markov chain. The additional detail affects the behavior of the models for a working set management strategy.
Scheduling Theory
1973
On the probability of deadlock in computer systems
As the number of processes and resources increases within a computer system, does the probability of that system's being in deadlock increase or decrease? This paper introduces Probabilistic Automata as a model of computer systems. This allows investigation of the above question for various scheduling algorithms. A theorem is proven which indicates that, within the types of systems considered, the probability of deadlock increases.
1973
Polynomial complete scheduling problems
We show that the problem of finding an optimal schedule for a set of jobs is polynomial complete even in the following two restricted cases. (1) All jobs require one time unit. (2) All jobs require one or two time units, and there are only two processors. As a consequence, the general preemptive scheduling problem is also polynomial complete. These results are tantamount to showing that the scheduling problems mentioned are intractable.
1973
Scheduling independent tasks to reduce mean finishing-time
In this paper we study the problem of scheduling a set of independent tasks on m ≥ 1 processors to minimize the mean finishing-time (mean time in system). The importance of the mean finishing-time criterion is that its minimization tends to reduce the mean number of unfinished tasks in the system. In the paper we give a reduction of our scheduling problem to a transportation problem and thereby extend the class of known non enumerative scheduling algorithms [1]. Next we show that the inclusion of weights (weighted mean finishing-time) complicates the problem and speculate that there may be no non enumerative algorithm for this case. For the special case of identical processors we study the maximum finishing-time properties of schedules which are optimal with respect to mean finishing-time. Finally we give a scheduling algorithm having desirable properties with respect to both maximum finishing-time and mean finishing-time.
1973
Bounds on scheduling with limited resources
A number of authors (of. [12],[6], [7],[3],[11],[4],[5],[9]) have recently been concerned with scheduling problems associated with a certain model of an abstract multiprocessing system (to be described in the next section) and, in particular, with bounds on the worst-case behavior of this system as a function of the way in which the inputs are allowed to vary. In this paper, we introduce an additional element of realism into the model by postulating the existence of a set of “resources” with the property that at no time may the system use more than some predetermined amount of each resource. With this extra constraint taken into consideration, we derive a number of bounds on the behavior of this augmented system. It will be seen that this investigation leads to several interesting results in graph theory and analysis.
1973
A task-scheduling algorithm for a multiprogramming computer system
This paper presents a description and analysis of a task scheduling algorithm which is applicable to third generation computer systems. The analysis is carried out using a model of a computer system having several identical task processors and a fixed amount of memory. The algorithm schedules tasks having different processor-time and memory requirements. The goal of the algorithm is to produce a task schedule which is near optimal in terms of the time required to process all of the tasks. An upper bound on the length of this schedule is the result of deterministic analysis of the algorithm. Computer simulations demonstrate the applicability of the algorithm in actual systems, even when some of the basic assumptions are violated.
Protection and Addressing
1973
Protection and control of information sharing in multics
This paper describes the design of mechanisms to control sharing of information in the Multics system. Seven design principles help provide insight into the tradeoffs among different possible designs. The key mechanisms described include access control lists, hierarchical control of access specifications, identification and authentication of users, and primary memory protection. The paper ends with a discussion of several known weaknesses in the current protection mechanism design.
1973
The case for capability based computers
The idea of a capability which acts like a ticket authorizing the use of some resource was developed by Dennis and Van Horn as a generalization of addressing and protection schemes such as the code- words of the Rice computer, the descriptors of the Burroughs machines, and the segment and page tables in computers such as the GE-645 and IBM 360/67. Dennis and Van Horn generalized the earlier schemes by extending them to include not just memory, but all systems resources: memory, processes, input/output devices, and so on; and by stressing the explicit manipulation of access control by nonsystem programs. The idea is that a capability is a special kind of address for an object, that these addresses can be created only by the supervisor, and that in order to use any object, one must address it via one of these addresses. The name comes from the fact that having one of these special kinds of addresses for a resource provides one with the capability to use the resource. The use of capabilities as a protection mechanism has been the subject of considerable interest and is now fairly well understood. Access control schemes using capabilities and capability -like notions are, as a whole, the most flexible and general schemes available. It will in fact be assumed that the reader is familiar with the advantages of capabilities for protection put-poses; a somewhat different advantage of capabilities will be developed here.
1973
Formal requirements for virtualizable third generation architectures
Virtual machine systems have been implemented on a limited number of third generation computer systems, for example CP-67 on the IBM 360/67. The value of virtual machine techniques to ease the development of operating systems, to aid in program transferability, and to allow the concurrent running of disparate operating systems, test and diagnostic programs has been well recognized. However, from previous empirical studies, it is known that many third generation computer systems, e.g. the DEC PDP-10, cannot support a virtual machine system. In this paper, the hardware architectural requirements for virtual machine systems are discussed. First, a fairly specific definition of a virtual machine is presented which includes the aspects of efficiency, isolation, and identical behavior. A model of third generation-like computer systems is then developed. The model includes a processor with supervisor and user modes, memory that has a simple protection mechanism, and a trap facility. In this context, instruction behavior is then carefully characterized.
1973
Limitations of Dijkstra's Semaphore Primitives and Petri nets
Recently various attempts have been made to study the limitations of Dijkstra's Semaphore Primitives for the synchronization problem of cooperating sequential processes [3,4,6,8]. Patil [8] proves that the semaphores with the P and V primitives are not sufficiently powerful. He suggests a generalization of the P primitive. We prove that certain synchronization problems cannot be realized with the above generalization and even with arrays of semaphores. We also show that even the general Petri nets will not be able to handle some synchronization problems, contradicting a conjecture of Patil (P.28 [7]).

3rd SOSP — October 18-20, 1971 — Palo Alto, California, USA
Systems, Real and Virtual
1971
TENEX, a paged time sharing system for the PDP-10
This paper appears in the March, 1972, issue of the Communications of the ACM. Its abstract is reproduced below. TENEX is a new time sharing system implemented on a DEC PDP-10 augmented by special paging hardware developed at BBN. This report specified a set of goals which are important for any time sharing system. It describes how the TENEX design and implementation achieve these goals. These include specifications for a powerful multiprocess large memory virtual machine, intimate terminal interaction, comprehensive uniform file and I/O capabilities, and clean flexible system structure. Although the implementation described here required some compromise to achieve a system operational within six months of hardware checkout, TENEX has met its major goals and provided reliable service at several sites and through the ARPA network.
1971
The design of the Venus Operating System
The Venus Operating System is an experimental multiprogramming system which supports five or six concurrent users on a small computer. The system was produced to test the effect of machine architecture on complexity of software. The system is defined by a combination of micro-programs and software. The microprogram defines a machine with some unusual architectural features; the software exploits these features to define the operating system as simply as possible. In this paper the development of the system is described, with particular emphasis on the principles which guided the design.
1971
An operating system based on the concept of a supervisory computer
This paper appears in the March, 1972, issue of the Communications of the ACM. Its abstract is reproduced below. An operating system which is organized as a small supervisor and a set of independent processes are described. The supervisor handles I/O with external devices - the file and directory system - schedules active processes and manages memory, handles errors, and provides a small set of primitive functions which it will execute for a process. A process is able to specify a request for a complicated action on the part of the supervisor (usually a wait on the occurrence of a compound event in the system) by combining these primitives into a “supervisory computer program.” The part of the supervisor which executes these programs may be viewed as a software implemented “supervisory computer.” The paper develops these concepts in detail, outlines the remainder of the supervisor, and discusses some of the advantages of this approach.
1971
A multiprogramming system for process control
This paper describes an operational small computer multiprogramming system developed for the control of the Stanford Two Mile Linear Accelerator (SLAC). The system has many features of larger systems such as dynamic memory allocation and interprocess control, but does not have to handle typical batch type jobs which need large arrays and many other system resources. This difference results in a drastic reduction in the complexity of both design and implementation. The accelerator control problem is discussed in terms of what requirements are imposed on the system by the environment. Then the basic subsystems are described with sufficient examples to show the reader other areas where such a system may be applicable.
Implementation Techniques
1971
Extensible data features in the operating system language OSL/2
The extensible data facilities of OSL/2, an operating system language, are described. These facilities allow one to create new data types, such as queues, files, and tables, and describe complex access algorithms for these new types. When used in operating system codes, data type extension facilities help the programmer isolate macro and micro levels of complex data manipulation and provide logical places to insert and remove system measurement code. The implementation of these facilities is compatible with existing block structured languages like ALGOL 60 and PL/1 and can be accomplished in a straightforward manner.
1971
The Multics Input/Output system
An I/0 system has been implemented in the Multics system that facilitates dynamic switching of I/0 devices. This switching is accomplished by providing a general interface for all I/O devices that allows all equivalent operations on different devices to be expressed in the same way. Also particular devices are referenced by symbolic names and the binding of names to devices can be dynamically modified. Available I/0 operations range from a set of basic I/0 calls that require almost no knowledge of the I/O System or the I/0 device being used to fully general calls that permit one to take full advantage of all features of an I/O device but require considerable knowledge of the I/0 System and the device. The I/O System is described and some popular applications of it, illustrating these features, are presented.
1971
A hardware architecture for implementing protection rings
This paper appears in the March, 1972, issue of the Communications of the ACM. Its abstract is reproduced below. Protection of computations and information is an important aspect of a computer utility. In a system which uses segmentation as a memory addressing scheme, protection can be achieved in part by associating concentric rings of decreasing access privilege with a computation. The mechanisms allow cross-ring calls and subsequent returns to occur without trapping to the supervisor. Automatic hardware validation of references across ring boundaries is also performed. Thus, a call by a user procedure to a protected subsystem (including the supervisor) is identical to a call to a companion user procedure. The mechanisms of passing and referencing arguments are the same in both cases as well.
1971
Handling difficult faults in operating systems
It is commonplace to build facilities into operating systems to handle faults which occur in user-level programs. These facilities are often inadequate for their task; some faults or incidents are regarded as so bad that the user cannot be allowed to act on them and this makes it difficult or impossible to write subsystems which give proper diagnostics in all cases, or which are adequately secure, or which are adequately robust. This paper looks into why there is a need for very complete facilities and why there is a problem about providing them, and proposes an outline structure which could be used.
1971
Storage reallocation in hierarchical associative memories
Two recent trends in computing, namely parallelism and programming generality, imply that increasing importance will be placed on developing location-independent schemes for computer memories. This paper examines several issues that arise when hierarchical associative memories are used to accomplish this objective. First, it presents methods for constructing economical associative memories to be utilized as lower levels in the hierarchy. Then, it considers the ramifications of storage reallocation in the memory and develops a new unit of storage transfer, the paragraph. Finally, it demonstrates that difficulties that might arise due to duplicate storage of data words in the memory can be negated by a simple scheme.
Process Interactions and System Correctness
1971
Some deadlock properties of computer systems
First, a “meta theory” of computer systems is developed so that the terms “process” and “deadlock” can be defined. Next, “reusable resources” are introduced to model objects which are shared among processes, and “consumable resources” are introduced to model signals or messages passed among processes. Then a simple graph model of computer systems is developed, and its deadlock properties are investigated. This graph model is useful for teaching purposes, unifies a number of previous results, and leads to efficient deadlock detection and prevention algorithms.
1971
A concurrent algorithm for avoiding deadlocks in multiprocess multiple resource systems
In computer systems in which resources are allocated dynamically, algorithms must be executed whenever resources are assigned to determine if the allocation of these resources could possibly result in a deadlock, a situation in which two or more processes remain in an idle or blocked state indefinitely. In previous research, execution of the process requesting resources is suspended while an algorithm is executed to determine that the assignment could not cause a deadlock. In this paper, an algorithm is used to calculate all possible safe requests before they are made. This algorithm is executed concurrently with other processes between requests for resource allocations. If the determination of all safe requests has been completed and a process makes a request, the calculations required by the resource allocation are trivial.
1971
Synchronization of communicating processes
This paper appears in the March, 1972, issue of the Communications of the ACM. Its abstract is reproduced below. Formalization of a well-defined synchronization mechanism can be used to prove that concurrently running processes of a system communicate correctly. This is demonstrated for a system consisting of many sending processes which deposit messages in a buffer and many receiving processes which remove messages from that buffer. The formal description makes it very easy to prove that the buffer will neither overflow nor underflow, that senders and receivers will never operate on the same message frame in the buffer nor will they run into a deadlock.
1971
An approach to systems correctness
First, the problem of proving the correctness of an operating system is defined. Then a simple model is presented. Several examples are given to show how this model allows derivation of proofs about small systems.
Queueing and Scheduling
1971
Process synchronization without long-term interlock
A technique is presented for replacing long-term interlocking of shared data by the possible repetition of unprivileged code in case a version number (associated with the shared data) has been changed by another process. Four principles of operating system architecture (which have desirable effects on the intrinsic reliability of a system) are presented; implementation of a system adhering to these principles requires that long-term lockout be avoided.
1971
Short-term scheduling in multiprogramming systems
This paper defines a set of scheduling primitives which have evolved from multiprogramming systems described by Dijkstra, Lampson, Saltzer, and the present author. Compared to earlier papers on the same subject, the present one illustrates a more concise description of operating system principles by means of algorithms. This is achieved by (1) describing the primitives as the instructions of an abstract machine which in turn is defined by its instruction execution algorithm; (2) introducing a notation which distinguishes between the use of synchronizing variables (semaphores) to achieve mutual exclusion of critical sections, and to exchange signals between processes which have explicit input/output relationships; (3) considering the influence of critical sections on preemption and resumption; and (4) using a programming language, Pascal, which includes natural data types (records, classes, and pointers) for the representation of process descriptions and scheduling queues. The algorithms are written at a level of detail which clarifies the fundamental problems of process scheduling and suggests efficient methods of implementation.
1971
Process selection in a hierarchical operating system
This paper presents a new model for use in scheduling processes for the sharing of a processor. The model may be used in various modes of operation, including multiprogramming, real time and time sharing. Because the model is applicable for various modes of operation and because it is significantly useful only within a well-defined system hierarchy, the context for discussion is Hansen's system for the RC 4000.3,4,5 Only the preliminary work has been reported in this paper, but the author believes that the model is worth investigating further, and, therefore, research into its effects and properties will continue.
1971
The dependence of computer system queues upon processing time distribution and central processor scheduling
The stationary distribution of the number of jobs being served by a processor-sharing central server is independent of both the distribution of service times and the distribution of interarrival times when those distributions have rational Laplace-Stieltjes transforms. This result holds for both finite source and infinite source models. The steady state is identical to the steady state when all distributions are exponential. The expected response time, queue size, and central processor idle time of the finite source model under processor-sharing and FCFS scheduling are compared. These measures of system performance are all larger under processor-sharing for a class of central processor service time distributions with a coefficient of variation less than one. The measures are all smaller under processor-sharing for a class of distributions with a coefficient of variation greater than one. Experiments with data collected from actual computer systems indicate that these results extend to more general models and have practical applications in existing computer systems.
1971
A comparative analysis of disk scheduling policies
This paper appears in the March, 1972, issue of the Communications of the ACM. Its abstract is reproduced below. Five well-known scheduling policies for movable head disks are compared using the performance criteria of expected seek time (system oriented) and expected waiting time (individual I/O request oriented). Both analytical and simulation results are obtained. The variance of waiting time is introduced as another meaningful measure of performance, showing possible discrimination against individual requests. Then the choice of a utility function to measure total performance including system oriented and individual request oriented measures is described. Such a function allows one to differentiate among the scheduling policies over a wide range of input loading conditions. The selection and implementation of a maximum performance two-policy algorithm are discussed.
Memory Management and System Performance
1971
A study of storage partitioning using a mathematical model of locality
This paper appears in the March, 1972, issue of the Communications of the ACM. Its abstract is reproduced below. Both fixed and dynamic storage partitioning procedures are examined for use in multiprogramming systems. The storage requirement of programs is modeled as a stationary Gaussian process. Experiments justifying this model are described. By means of this model dynamic storage partitioning is shown to provide substantial increases in storage utilization and operating efficiency over fixed partitioning.
1971
Properties of the working-set model
This paper appears in the March, 1972, issue of the Communications of the ACM. Its abstract is reproduced below. A program's working set W(t,T) at time t is the set of distinct pages among the T most recently referenced pages. Relations between the average working-set size, the missing-page rate, and the interreference-interval distribution may be derived both from time-average definitions and from ensemble-average (statistical) definitions. An efficient algorithm for estimating these quantities is given. The relation to LRU (least recently used) paging is characterized. The independent-reference model, in which page references are statistically independent, is used to assess the effects of interpage dependencies on working-set size observations. Under general assumptions, working-set size is shown to be normally distributed.
1971
An algorithm for drum storage management in time-sharing systems
An algorithm for efficiently managing the transfer of pages of information between main and secondary memory is developed and analyzed. The algorithm applies to time-shared computer systems that use rotating magnetic drums as secondary storage devices. The algorithm is designed to provide efficient system performance when referenced pages are predominantly pre-loaded. However, the algorithm also provides optimum results for systems where all pages are loaded on demand. The nature of the improved performance which can be derived from page pre-loading strategy is discussed. Simulation results are presented which plot system performance as a function of main memory size.
1971
Simulation studies of a virtual memory, time-shared, demand paging operating system
SIM/61 is a large (3500 lines of Simscript code), highly detailed simulation model of a virtual memory, time-shared, demand paging operating system. SIM/61 provides the capability for parameterized modeling of both hardware and software. The current model contains algorithms for interrupt analysis, task scheduling, I/O scheduling and demand paging. This paper reports the results of studies made using SIM/61. The studies fall into two main categories: (1) load and configuration studies and (2) alternate algorithm studies. The approach taken for the former was to establish a fixed load, and measure its performance on various hardware configurations. The results are particularly interesting with respect to the paging capability of various paging device configurations, and various sizes of main memory.
1971
Experimental data on how program behavior affects the choice of scheduler parameters
A theory of combined scheduling of processor and main memory has begun to emerge in the last few years. It has been proposed that schedulers use the working set concept to avoid thrashing. To our knowledge, no data has been published on experimental measurements of working sets or on the influence that program behavior may have on the choice of quantum size. In this paper empirical data are presented and its influence on the choice of system parameters is discussed.

2nd SOSP — October 20-22, 1969 — Princeton, New Jersey, USA
1969
Front Matter
General principles of operating systems design
1969
The system design cycle
Technical competence in system design encompasses at least an appreciation of a number of issues that are often dismissed as being merely managerial in nature, since most system programmers seek a purely technical definition of what constitutes a system design, if they think about the problem at all. This paper reviews the area between the purely technical and the purely managerial as it pertains to the design portion of the system development cycle and finds this area non-empty. "System", in the author's lexicon, is an antonym of "chaos"; the two main sections of the paper discuss the components of the design activity and byways which lead to chaotic systems.
1969
Theory and practice in operating system design
In designing an operating system one needs both theoretical insight and horse sense. Without the former, one designs an ad hoc mess; without the latter one designs an elephant in best Carrara marble (white, perfect, and immobile). We try in this paper to explore the provinces of the two needs, suggesting places where we think horse sense will always be needed, and some other places where better theoretical understanding than we now have would seem both possible and helpful.
1969
The role of motherhood in the pop art of system programming
Numerous papers and conference talks have recently been devoted to the affirmation or reaffirmation of various common-sense principles of computer program design and implementation, particularly with respect to operating systems and to large subsystems such as language translators. These principles are nevertheless little observed in practice, often to the detriment of the resulting systems. This paper attempts to summarize the most significant principles, to evaluate their applicability in the real world of large multi-access systems, and to assess how they can be used more effectively.
1969
Machine independent software
The techniques of abstract machine modelling and macro processing can be used to construct software which is easily moved from one computer to another. This paper describes a system developed at the University of Colorado which has been implemented on 8 different machines, with efforts ranging from 1 man-day to 1 man-week. We feel that these techniques offer a possible solution to the "software crisis" which is plaguing the computer industry today.
Virtual memory implementation
1969
Measurements of segment size
Distributions of segment sizes have been measured under routine operating conditions on a computer system which utilizes variable-sized segments (the Burroughs B5500). The most striking feature of the measurements is the large number of small segments - about 60% of the segments in use contain less than 40 words. Although the results are certainly not installation-independent, they should be relevant to the design of new computer systems, especially with respect to the organization of paging schemes.
1969
The multics virtual memory
As experience with use of on-line operating systems has grown, the need to share information among system users has become increasingly apparent. Many contemporary systems permit some degree of sharing. Usually, sharing is accomplished by allowing several users to share data via input and output of information stored in files kept in secondary storage. Through the use of segmentation, however, Multics provides direct hardware addressing by user and system programs of all information, independent of its physical storage location. Information is stored in segments each of which is potentially sharable and carries its own independent attributes of size and access privilege.Here, the design and implementation considerations of segmentation and sharing in Multics are first discussed under the assumption that all information resides in a large, segmented main memory. Since the size of main memory on contemporary systems is rather limited, it is then shown how the Multics software achieves the effect of a large segmented main memory through the use of the GE 645 segmentation and paging hardware.
1969
Complementary replacement — a meta scheduling principle
A principle of scheduling is presented that includes a wide class of time and space allocation problems met in time sharing and virtual systems. Its essence is a method, based on symmetry, to use any of several rules for admission-to-service in a single server system to derive a space-replacement rule. This method, called complementary replacement includes several known job dispatching rules as well as some page-replacement algorithms such as the MIN and LRU (last-recently-used). A fundamental but unsolved problem with the principle is its range of applicability and the conditions under which it guarantees an optimum.
1969
Optimal segmentation points for programs
A program may be modeled as a directed graph on n "nodes," where the nodes are instructions, or data items, or contiguous groups of these. This paper discusses the problem of partitioning such sets of nodes into pages to minimize the number of transitions between pages during execution of the program.The node's are assumed to have a given ordering which may not be changed. We require that nodes on any page must be contiguous, so the only degree of freedom is in selecting "break points" between the pages.We show that if the expected number of transitions between each node of the program graph and its successors is known, then there is an algorithm for selecting the optima break points. The algorithm requires an execution time which grows linearly with the number of nodes in almost all cases.
1969
Strategies for structuring two level memories in a paging environment
The strategies for effective structuring of data and program code in a two level, directly addressable paged memory system are presented along with the experience gained from deriving and implementing appropriate algorithms in one time-shared system. The experience indicated that a reduction in the paging rate can result through the addition of a slower second-level bulk memory thus leading to improved performance. While improved techniques for structuring these memories coupled with faster and / or cheaper bulk memories offer further improvements in performance, reduction in the paging overhead might offset some of these performance advantages.
Process management and communication
1969
Process control and communication
The paper contains a description of the structure of processes and the interprocess communication facility implemented within a general purpose operating system. Processes within the system operate asynchronously. The communication facility allows a process to signal another process, to send information to it, to cause it to be suspended or to cause it to be terminated.
1969
Process management and resource sharing in the multiaccess system “ESOPE”
This paper describes the main design principles of the multiaccess system ESOPE. Emphasis is placed on basic ideas underlying the design rather than on implementation details. The main features of the system include the ability given to any user to schedule his own parallel processes, using system primitive operations, and the allocation/scheduling policy, which dynamically takes into account recent information about user behaviour.
1969
Basic time-sharing: a system of computing principles
Basic Time-Sharing is an attempt to systematize the computing principles which should underlie the design of operating systems. These principles emphasize the fundamental separateness of work and resources and the parallelism inherent in the computing process. Primary constructions are described in which all computing activity can be realized.
1969
Structure of multiple activity algorithms
In conjunction with an earlier paper, this step toward machine design based on the structure of nested-declaration pure-procedure multiple-activity algorithms describes: the declaration, flow-of-control, and some data-addressing mechanisms of such algorithms; the dynamic Saguaro-Garden data-access-providing structure of their Records of Execution; and implementation in a segmented environment.
1969
The Multics interprocess communication facility
Essential to any multi-process computer system is some mechanism to enable coexisting processes to communicate with one another. The basic inter-process communication (IPC) mechanism is the exchange of messages among independent processes in a commonly accessible data base and in accordance with some pre-arranged convention. By introducing several system wide conventions for initiating communication, and by utilizing the Traffic Controller it is possible to expand the basic IPC mechanism into a general purpose IPC facility. The Multics IPC facility is an extension of the central supervisor which assumes the burden of managing the shared data base and of respecting the IPC conventions, thus providing a simple and easy way for the programmer to use the interface.
Systems and techniques
1969
The structure of the ILLIAC IV operating system
The paper outlines the structure of the operating system for the ILLIAC IV, a large array computer being built by the University of Illinois. The system is unique in that it resides primarily on a second control computer and distributes operating system functions among independent programs in a multi-programming, multi-processing mix, where inter-program communication is accomplished by buffer files. The job control language interpreter and hardware interrupt dispatcher are independent modules which may be user defined, invoked dynamically, and co-exist with the standard system modules.
1969
A program simulator by partial interpretation
In promoting the ETSS project a program simulator based on an idea of partial interpretation has been constructed, and its principle and design are described in the paper. This new approach has been introduced to provide the simulator with such features as high speed and high accuracy in simulation and simplification in implementation. The essence of the idea of partial interpretation is using direct execution of instructions by hardware and simulation of them by an interpreter in combination, wherewith the hardware interrupt mechanism intermediates the two phases of the whole simulation. An interruption takes place when executing a "privileged" instruction, which triggers the simulation of the instruction. The other type of instructions are normally rendered to direct execution by hardware. The simulation method for devices operating in parallel is also described with respect to the timing control and scheduling. A program simulator of this type provides a powerful tool for debugging "supervisor" programs and opens a new approach to system expansion.
1969
HELPER: an interactive extensible debugging system
HELPER is a symbol oriented debugging system used for programs which have been compiled previously. It is easily extensible employing an incremental compiler to process commands given in a natural language. Programs are executed by simulation permitting a number of analysis actions to take place on each instruction. Debugging is both dynamic and static. Many program errors are caught before execution and a back up feature is provided to check code in the neighborhood of an error. Free access to more machine oriented debugging tools is possible and a complete record of each computer session is kept.
1969
Multiprogramming in a small-systems environment
This paper describes developments of the RAMP system, a multiprogramming system designed for use in small machines of the PDP-8 class for operation in real-time processing environments. Several of these systems are currently operating in process control, message switching and terminal control applications. The development of this system was largely the result of the need for relatively sophisticated data communications systems for use by remotely located process control and interactive graphics terminals. Early versions of this system, as described in [1, 2, and 3], sustained aggregate transmission rates of about 500-1000 characters-per-second for as many as twenty-four terminals.
1969
Basic I/O handling on Burroughs B6500
The approach to processing basic Input/Output in B6500 hardware design and software implementation is discussed in this paper. Hardware I/O structure necessary to the understanding of the approach is described first. The representation of the I/O queue and the algorithms used in handling I/O requests are described in detail to emphasize the ease with which the I/O handling portions of the executive system may be modified to suit any installation. Some of the I/O tables for coordinating I/O activity are discussed. The concept of asynchronous processes running as extended arms of the executive system is discussed and the implementation of it for updating status of peripheral units, for handling I/O errors, etc. is described. The usage and implementation of I/O related events and software interrupts is discussed. The locking concept is presented. Finally, the complete I/O initiation and completion cycle is described. The description throughout this paper is based on the actual working system implemented with language ESPOL - an extended ALGOL language used for writing executive systems. Some of the special language constructs pertinent to I/O handling are illustrated.
Instrumentation and measurement
1969
Measurement and performance of a multiprogramming system
This paper presents a brief description of a general multiprogramming (and time-sharing) system in operation at the University of Michigan. A number of time and space distributions obtained from a typical month of operation are presented. The performance is summarized by a load diagram. Efficiency as a comparison with ideal performance is discussed and a model suitable for idealization is proposed.
1969
The UTS time-sharing system: performance analysis and instrumentation
An approach is presented to analyzing and controlling hardware and software configurations of a time-sharing system given its load characteristics and desired performance. A mathematical model is developed which predicts response time to both interactive terminal requests and to compute bound programs. Estimates are obtained which relate performance to CPU load, file I/O load, and swap I/O load. Instrumentation contained within the UTS time-sharing system measures actual load and actual response during system operation, These measures suggest standards by which both load and response of time-shared systems could be compared, Extrapolation of the modeled and measured results allows prediction of performance for a variety of hardware configurations and user loads. Administrative controls allow dynamic tuning of the system during actual operation utilizing displays of current performance.
1969
Two approaches for measuring the performance of time-sharing systems
There are two significantly different approaches that can be used for measuring the performance of time-sharing systems. The "stimulus" approach which conceptualizes the system as a "black box" containing a limited number of known functions, involves applying a controlled set of stimuli to the black box in order to activate its functions, and then observing the results. In the "analytic" approach, probes are inserted into the system to allow the recording of any level of the system's behavior. Both approaches are being developed for, and have been used to measure, System Development Corporation's ADEPT time-sharing system.
1969
The instrumentation of Multics
This paper reports an array of measuring tools devised to aid in the implementation of a prototype computer utility. These tools include special hardware clocks and data channels, general purpose programmed probing and recording tools, and specialized measurement facilities. Some particular measurements of interest in a system which combines demand paging with multi-programming are described in detail. Where appropriate, insight into effectiveness (or lack thereof) of individual tools is provided.
1969
Performance monitors for multi-programming systems
This paper proposes a scheme for improving the throughput of multi-programming systems. The basis of the proposal is the collection of a suitable set of measures of total system performance as well as corresponding per-process measures. Two types of control over system performances are suggested: (1) dynamic tuning of allocation policies to reflect the total system load, and (2) adjustment of the mix of processes competing for system resources in such a way as to keep the system within the range over which its allocation policies function properly.

1st SOSP — October 1-4, 1967 — Gatlinburg, Tennessee, USA
1967
Program
Virtual Memory
1967
Dynamic storage allocation systems
In many recent computer system designs, hardware facilities have been provided for easing the problems of storage allocation. This paper presents a method of characterizing dynamic storage allocation systems, according to the functional capabilities provided, and the underlying techniques used. The basic purpose of the paper is to provide a useful perspective from which the utility of various hardware facilities may be assessed. The paper includes as an appendix, a brief survey of storage allocation facilities in several representative computer systems.
1967
Virtual memory, processes, and sharing in Multics
The value of a computer system to its users is greatly enhanced if a user can, in a simple and general way, build his work upon procedures developed by others. The attainment of this essential generality requires that a computer system possess the features of equipment-independent addressing, an effectively infinite virtual memory, and provision for the dynamic linking of shared procedure and data objects. The paper explains how these features are realized in the Multics system.
1967
A study of the effect of user program optimization in a paging system
Much attention has been directed to paging algorithms and little to the role of the user in this environment. This paper describes an experiment which is an attempt to determine the significance of efforts by the user to improve the paging characteristics of his program. The problem of throughput in a computing system is primarily one of balancing the flow of data and programs through a hierarchy of storages. The problem is considered solved when for every available processor cycle there is a matching demand for that cycle in the primary (execution) store. Since programs and their data usually originate in a location other than the execution store, there is a delay associated with the movement of data and programs to the primary store. The delay has two components, the operational speed (data transfer time) and the positioning, or access time, of the secondary storage device. Since the access time usually exceeds the data transfer time by an order of magnitude, the problem of transferring information to the primary store has been named the “access gap” problem.
Memory Management
1967
An empirical study of the behavior of programs in a paging environment
This paper reports initial results from an empirical study directed at the measurement of program operating behavior in those multiprogramming systems in which programs are organized into fixed length pages. The data collected from the interpretive execution of a number of paged programs is used to describe the frequency of page faults; i.e. the frequency of those instants at which an executing program requires a page of data or instructions not in main (core) memory. These data are used also for the evaluation of two page replacement algorithms and for assessing the effects on performance of changes in the amount of storage allocated to executing programs.
1967
Resource management for a medium scale time sharing operating system
Task scheduling and resource balancing for a medium-size virtual memory paging machine are discussed in relation to a combined batch processing and time sharing environment. A synopsis is given of the task scheduling and paging algorithms that were implemented and the results of comparative simulation are given by tracing the development of the algorithms through six predecessor versions. Throughout the discussion particular emphasis is placed on balancing the system performance relative to the characteristics of all the system resources. Simulation results relative to alternate hardware characteristics and the effects of program mix and loading variations are also presented.
1967
The working set model for program behavior
Probably the most basic reason behind the absence of a general treatment of resource allocation in modern computer systems is an adequate model for program behavior. In this paper a new model is developed, the “working set model”, which enables us to decide which information is in use by a running program and which is not. Such knowledge is vital for dynamic management of paged memories. The working set of pages associated with a process, defined to be the collection of its most recently used pages, is a useful allocation concept. A proposal for an easy-to-implement allocation policy is set forth; this policy is unique, inasmuch as it blends into one decision function the heretofore independent activities of process-scheduling and memory-management.
Extended Core Memory Systems
1967
Consideration in the design of a multiple computer system with extended core storage
This paper discusses the recent innovation of the use of large quantities of addressable (but not executable) fast random access memory in order to heighten the multiprogramming performance of a multicomputer system. The general design of the hardware arrangement and the software components and functions of such a system are based on Brookhaven's future configuration of dual CDC 6600's sharing one million words of Extended Core Storage. In the generalization of such a design, special emphasis is placed on estimating expected gains compared to the traditional configuration of separate and independent computers without ECS. An observation is made on the use of conventional slower speed random access storage devices in place of the faster memory.
1967
A philosophy for computer sharing
The remarks which follow concern four design objectives for a shared computer system, and their relation to the selection of computing equipment and the design and programming of a time-sharing program at IDA-CRD during the last year. The principle objective was to bring as much computing power to our users as possible, consistent with budgetary bounds, and to provide as intimate a connection to the computer for each user as possible. A second objective was to provide a computing environment which allows each user the maximum freedom and control over the machine possible within balancing of our third objective, namely, maximum reliability of the system. Our fourth objective was one which we choose for any program of a lasting nature; that the programs be as simple as possible to still achieve their purpose. Adhering to this maxim usually, although not always, reduces the time involved in realizing the programs; and in addition, holds promise for increasing their reliability and in any event reduces maintenance problems if only by reducing the possible number of wrong symbols.
Philosophies of Process Control
1967
The structure of the “THE”-multiprogramming system
A multiprogramming system is described in which all activities are divided over a number of sequential processes. These sequential processes are placed at various hierarchical levels, in each of which one or more independent abstractions have been implemented. The hierarchical structure proved to be vital for the verification of the logical soundness of the design and the correctness of its implementation.
1967
A scheduling philosophy for multi-processing systems
One of the essential parts of any computer system is a mechanism for allocating the processors of the system among the various competitors for their services. These allocations must be performed in even the simplest system, for example, by the action of an operator at the console of the machine. In larger systems more automatic techniques are usually adopted; batching of jobs, interrupts and interval timers are the most common ones. As the use of such techniques becomes more frequent, it becomes increasingly difficult to maintain the conventional view of a computer as a system which does only one job at a time; even though it may at any given instant be executing a particular sequence of instructions, its attention is switched from one such sequence to another with such rapidity that it appears desirable to describe the system in a manner which accommodates this multiplexing more naturally. It is worthwhile to observe that these remarks apply to any large modern computer system and not just to one which attempts to service a number of on-line users simultaneously.
1967
An implementation of a multiprocessing computer system
A PDP-1 computer was donated (by the Digital Equipment Corporation) to the Electrical Engineering Department of the Massachusetts Institute of Technology in late 1961. In May, 1963 the first time-sharing system was operational. Since 1963 this PDP-1 has undergone substantial modifications (c.f. Appendix). Presently the machine has twelve thousand words (18-bit) of five microsecond memory arranged in pages of four thousand words. One of these pages is reserved for the system code and is protected from user references.
System Theory and Design
1967
Three criteria for designing computing systems to facilitate debugging
The designer of a computing system should adopt explicit criteria for accepting or rejecting proposed system features. Three possible criteria of this kind are input recordability, input specifiability, and asychronous reproducibility of output. These criteria imply that a user can, if he desires, either know or control all of the influences affecting the content and extent of his computer's output. To define the scope of the criteria, the notion of an abstract machine of a programming language and the notion of a virtual computer are explained. Examples of applications of the criteria concern the reading of a time-of-day clock, the assignment of capability indices, and memory reading protection in multiprogrammed systems.
1967
Dynamic Supervisors - their design and construction
The paper demonstrates the technology necessary to bring the facilities of Supervisor construction and modification to the level at which a user can, without a great deal of research and analysis modify his installation's Operating System The Supervisor is seen to be a set of processes linked by a formalised control mechanism.
1967
Protection in an information processing utility
In this paper we will define and discuss a solution to some of the problems concerned with protection and security in an information processing utility. This paper is not intended to be an exhaustive study of all aspects of protection in such a system. Instead, we concentrate our attention on the problems of protecting both user and system information (procedures and data) during the execution of a process. We will give special attention to this problem when shared procedures and data are permitted.
Computer Networks and Communications
1967
Multiple computer networks and intercomputer communication
There are many reasons for establishing a network which allows many computers to communicate with each other to interchange and execute programs or data. The definition of a network within this paper will always be that of a network between computers, not including the network of typewriter consoles surrounding each computer. Attempts at computer networks have been made in the past; however, the usual motivation has been either load sharing or interpersonal message handling. Three other more important reasons for computer networks exist, at least with respect to scientific computer applications. Definitions of these reasons for a computer network follow.
1967
A digital communication network for computers giving rapid response at remote terminals
Those computer applications which involve rapid response to events at distant points create special problems in digital communication. Such applications are increasing in number, and could increase more rapidly if better solutions existed to the communication problems. The present-day methods for communication of data in rapid-response systems employ 'private wires' for the transmission paths or, where the available data rate and reliability is sufficient, employ voice channels from the switched telephone network. Given these rather arbitrary transmission facilities the user adds the terminal equipment necessary to make a communication system and sometimes integrates a number of paths into a private network.
1967
A position paper on computing and communications
The effective operation of free enterprise in creating the envisioned information service industry is dependent on three accomplishments: 1. The restructuring of our information processing industry to provide a clear separation among costs for computing, communications, and the development of information services. 2. The wide use of multi-access system concepts so that information services may share in the use of computer installations, and so their cost of construction is reasonable. 3. The development of public, message-switched communications services with adaquate provisions for information security.