Birds of a Feather (BoFs) Program

There are four BoFs split into two hours. In each hour, two BoFs run in parallel in separate locations.

First Session: November 4, Tuesday 6-7 PM

Title: Energy-efficiency and Resilience Challenges for Long-running Application on Leadership-scale Machines

Scribe Notes: bof1.docx

Organizer: Devesh Tiwari (ORNL)

Location: Josephs Room

Description: Oak Ridge Leadership Computing Facility (OLCF) enables faster scientific discovery through its supercomputers (currently, Titan supercomputer ranked at no.2 in the Top 500 supercomputer list). OLCF has helped scientists analyze scientific problems at a scale that was not even imaginable a few years ago. But, we are facing new challenges as we prepare for the exa-scale era. Energy-efficiency and resilience being the two major challenges of the lot. The purpose of this BoF is to learn, share and explore collaborative-opportunities for improving the energy-efficiency and resilience of long-running applications.

Through this BoF session, we aim to explore opportunity for joint activities to understand the computational, communication and I/O requirements of emerging large-scale applications, e.g. next-generation sequencing, blood-flow simulations, etc. How can we partition computer systems resources in an energy-efficient way? How can these application benefit from a supercomputer as opposed to small-scale clusters in the cloud? What are the challenges and trade-offs in scaling-out these applications, with respect to energy-efficiency and resilience? What can system researchers do to alleviate soft-errors and data-movement cost in the supercomputer setting? So, bring on your favorite application. You are also welcome to present a 5 min lightning talk about your potential use-case or application requirements. Hopefully, this BoF and follow-on community activities will fill the gap between traditional systems research and the HPC challenges.


Title: Should we build clouds with strong properties or stop wringing our hands?

Scribe Notes: bof2.txt

Organizer: Ken Birman (Cornell)

Location: Club Room

Description: This BOF will debate whether and how to scale up strong guarantees in the cloud. Practitioners are successfully building scalable cloud systems that work well practically all the time but offer no strong guarantees. Yet, the SOSP community has a tradition of working with strong properties/assurances (about reliability, consistency, security, etc) and building systems that work well and provably satisfy these properties. What's the right balance for the cloud? Is it time to rebuild the cloud with strong properties, or is it time for the SOSP community to realize that there are better ways to get to the same place?


Second Session: November 4, Tuesday 7-8 PM

Title: Availability of PRObE: 1000 Nodes for Systems Research Experiments

Scribe Notes: bof3.docx

Organizers: Garth Gibson (chair) (CMU), Nitin Agrawal (NEC Labs), Jonathan Appavoo (Boston U.), Andree Jacobson (New Mexico Consortium), Wyatt Lloyd (Princeton), Jun Wang (U. Central Florida)

Location: Josephs Room

Description: NSF's PRObE (www.nmc-probe.org) operates four clusters to support systems research at scale. The largest is Kodiak (https://www.nmc-probe.org/wiki/Machines:Kodiak), which is 1000 nodes (two core x86, 8GB DRAM, two 1TB disks, 1GE and 8Gbps IB) donated by Los Alamos National Laboratory. Today Kodiak is hosting researchers from Georgia Tech, Carnegie Mellon and Los Alamos. Princeton researchers have published results from Kodiak at the most recent NSDI (Wyatt Lloyd, "Stronger Semantics for Low-Latency Geo-Replicated Storage", NSDI 2013). On PRObE staging clusters are researchers from U Central Florida, UT Austin, Georgia Tech and Carnegie Mellon.

PRObE resources are intended for (infrastructure) systems researchers committed to public release of their research results, typically publishing in distributed systems (eg. OSDI or SOSP), cloud computing (e.g. SOCC), supercomputing (e.g. SC or HPDC), storage (e.g. FAST), or networking (e.g. NSDI). PRObE resources are managed by Emulab (www.emulab.org) a cluster manager for allocating physical nodes that has been in use for systems research for over a decade (Brian White, "An Experimental Environment for Distributed Systems and Networks," OSDI 2002). Users start by porting and demonstrating their code on a 100-node staging cluster such as Denali built from the same equipment donation from Los Alamos. With demonstrated success on a staging cluster, and a compelling research goal, Kodiak can be requested and allocated, possibly exclusively, for hours to days.

A second style of large PRObE resource has recently come online. Susitna is 34 nodes of 64 core x86 processors, for a total of more than 2000 x86 cores. Susitna also has NVidia donated K20 GPU coprocessors with 2496 cuda cores each, for a total of 84,864 cuda cores. With 128 GB DRAM, a hard disk and an SSD each, Susitna nodes are interconnected by 40Gbps ethernet, 40 Gbps infiniband and 1Gbps ethernet.

To start using PRObE resources:

  • visit www.nmc-probe.org to learn about the resources
  • visit portal.nmc-probe.org to request a PRObE-specific Emulab account
  • have a research leader or faculty member get an account and define a project on portal.nmc-probe.org
  • use Portal to get onto Denali, to allocate a single node experiment, login into that node to customize and resave the OS image for your project, then launch a multi-node experiment to demonstrate your system at less than 100 node scale
  • use https://www.nmc-probe.org/request/ to request a large allocation on Kodiak (this is a HotCRP paper review web site, where your paper is a short justification for your research, your preparedness for using Kodiak, and your credientials and appropriateness for using NSF resources)
  • PRObE managers will review, approve and schedule your use of large allocations of Kodiak time

NSF PRObE resources will be available for at least the next two years. All uses of PRObE resources are obligated to publish their results, either in conferences or one their web sites, and acknowledge NSF PRObE resources used in these publications. See also our PRObE introduction article in the June 2013 USENIX ;login: vol 38, no 3, 2013 (www.cs.cmu.edu/~garth/papers/07_gibson_036-039_final.pdf).


Title: Cloud based network services, Resilience, Virtual machines, TPMs, Key Management

Scribe Notes: bof4.txt

Organizers: Robert Broberg (Cisco), Herbert Bos (VU University Amsterdam), Chase Cotton (U. Delaware), Robbert van Renesse (Cornell), Jonathan Smith (U. Pennsylvania in Philadelphia), Stacy Cannady (Digital Management Inc)

Location: Club Room

Description: Security for cloud services is an area ripe for research and development. Most current security services and network services assume a closed network for the application that is deployed. As these services become more heterogeneous one might assume trust must be anchored in some basic root that propagates through an underlying infrastructure that mitigates competition for resources, misconfiguration, and "bad" behaviours along with resilience and forensics. TCG, TPM, vTPMs and TNC are thrusts that have been working this space for ~10 years with varying degrees of success. We will discuss these efforts and how they relate to protected vm's, containers, etc., and how they might provide a flexible and secure system for non-infrastructure-aware applications such that they can run in a "safe" environment. Issues such as key management and distribution will be critical.


  • Platinum Sponsors

    • The Murty
      Family Foundation
  • Gold Sponsors

    • Apple
  • Silver Sponsors

  • Bronze Sponsors