Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling---computing a uniform sample from the join results---is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.
To reduce their environmental impact, cloud datacenters' are increasingly focused on optimizing applications' carbon-efficiency, or work done per mass of carbon emitted. To facilitate such optimizations, we present Carbon Containers, a simple system-level facility, which extends prior work on power containers, that automatically regulates applications' carbon emissions in response to variations in both their work-load's intensity and their energy's carbon-intensity. Specifically, Carbon Containers enable applications to specify a maximum carbon emissions rate (in g.CO2e/hr), and then transparently enforce this rate via a combination of vertical scaling, container migration, and suspend/resume while maximizing either energy-efficiency or performance.
Carbon Containers are especially useful for applications that i) must continue running even during high-carbon periods, and ii) execute in regions with few variations in carbon-intensity. These low-variability regions also tend to have high average carbon-intensity, which increases the importance of regulating carbon emissions. We implement a Carbon Container prototype by extending Linux Containers to incorporate the mechanisms above and evaluate it using real workload traces and carbon-intensity data from multiple regions. We compare Carbon Containers with prior work that regulates carbon emissions by suspending/resuming applications during high/low carbon periods. We show that Carbon Containers are more carbon-efficient and improve performance while maintaining similar carbon emissions.
This paper introduces Golgi, a novel scheduling system designed for serverless functions, with the goal of minimizing resource provisioning costs while meeting the function latency requirements. To achieve this, Golgi judiciously over-commits functions based on their past resource usage. To ensure overcommitment does not cause significant performance degradation, Golgi identifies nine low-level metrics to capture the runtime performance of functions, encompassing factors like request load, resource allocation, and contention on shared resources. These metrics enable accurate prediction of function performance using the Mondrian Forest, a classification model that is continuously updated in real-time for optimal accuracy without extensive offline training. Golgi employs a conservative exploration-exploitation strategy for request routing. By default, it routes requests to non-overcommitted instances to ensure satisfactory performance. However, it actively explores opportunities for using more resource-efficient overcommitted instances, while maintaining the specified latency SLOs. Golgi also performs vertical scaling to dynamically adjust the concurrency of overcommitted instances, maximizing request throughput and enhancing system robustness to prediction errors. We have prototyped Golgi and evaluated it in both EC2 cluster and a small production cluster. The results show that Golgi can meet the SLOs while reducing the resource provisioning cost by 42% (30%) in EC2 cluster (our production cluster).
The advances in virtualization technologies have sparked a growing transition from virtual machine (VM)-based to container-based infrastructure for cloud computing. From the resource orchestration perspective, containers' lightweight and highly configurable nature not only enables opportunities for more optimized strategies, but also poses greater challenges due to additional uncertainties and a larger configuration parameter search space. Towards this end, we propose Drone, a resource orchestration framework that adaptively configures resource parameters to improve application performance and reduce operational cost in the presence of cloud uncertainties. Built on Contextual Bandit techniques, Drone is able to achieve a balance between performance and resource cost on public clouds, and optimize performance on private clouds where a hard resource constraint is present. We show that our algorithms can achieve sub-linear growth in cumulative regret, a theoretically sound convergence guarantee, and our extensive experiments show that Drone achieves an up to 45% performance improvement and a 20% resource footprint reduction across batch processing jobs and microservice workloads.
State-of-the-art consensus protocols like Paxos reveal the values being agreed upon to all nodes, but some deployment scenarios involving a subset of nodes outsourced to public cloud providers motivate hiding the value. In this work, we present the primary-backup secret-shared state machine (PBSSM) architecture and an underlying consensus protocol Oblivious Paxos (OPaxos) that enable strong consistency, high availability, privacy, and fast common-case performance. OPaxos enables privacy-preserving consensus by allowing acceptors to safely agree on a secret-shared value without utrusted acceptors knowing the value. We also present Fast Oblivious Paxos (Fast-OPaxos), which enables consensus over secret-shares in three one-way delays under low concurrency settings. Our prototype-driven microbenchmarks and smarthome case study show that OPaxos induces a negligible latency overhead of at most 0.1 ms compared to Paxos while maintaining more than 85% of Paxos' capacity for small requests, and can provide lower latency and higher capacity compared to Paxos for large request sizes.
Function as a Service (FaaS) and the associated serverless computing paradigm alleviates users from resource management and allows cloud platforms to optimize system infrastructure under the hood. Despite significant advances, FaaS infrastructure still leaves much room to improve performance and resource efficiency. We argue that both higher performance and resource efficiency are possible --- while maintaining secure isolation --- if we are willing to revisit the FaaS programming model and system software design. We propose Dandelion, a clean-slate FaaS system that rethinks the programming model by treating serverless functions as pure functions, thereby explicitly separating computation and I/O. This new programming model enables a lightweight yet secure function execution system. It also makes functions more amenable to hardware acceleration and enables dataflow-aware function orchestration. Our initial prototype of Dandelion achieves 45× lower tail latency for cold starts compared to Firecracker. For 95% hot function invocations, Dandelion achieves 5× higher peak throughput.
To improve resource utilization and reduce costs many Cloud providers adopt virtual machines (VMs) overcommitment. While effective, this strategy may lead to adverse outcomes, significantly affecting a VM IO performance when one virtual CPU (vCPU) is preempted by another vCPU within the same runqueue of the VM scheduler -- i.e., same physical CPU (pCPU). Additionally, the responsiveness of a VM is reduced during the inactive time of the vCPU, and it necessitates an extra schedule timeslice to react to any IO event. While such problems have been studied in academia and industry, no previous solution has been deployed in production. This is because for example certain solutions require modifications of the guest VM, which is in contrast with industry requirements.
We propose Anubis, a new IO-aware VM scheduler targeting Linux KVM, the most popular VMM in today's Clouds, without requiring any guest VM modifications. Anubis shortens the IO event pending time by lightweight monitoring IO events including interrupt delivery and KVM exit. For the vCPU running the IO activity, Anubis provides an accurate boost, which is exclusively active only during the period when the vCPU has IO activity. While the IO performance is maximized, Anubis still guarantees fairness among VMs. The vCPU that doesn't have IO activity and belongs to the same VM will voluntarily yield the computing resources to counterbalance the unfairness created by the vCPU that has been given a performance boost. Overall, Anubis is a practical solution that provides close-to-non-overcommit performance for IO workloads in VM overcommitted scenarios.
With the rapid development of cloud computing, the increasing scale of clusters and task parallelism put forward higher requirements on the scheduling capability at scale. To this end, the shared-state scheduler architecture has emerged as the popular solution for large-scale scheduling due to its high scalability and utilization. In such an architecture, a central resource state view periodically updates the global cluster status to distributed schedulers for parallel scheduling. However, the schedulers obtain broader resource views at the cost of intermittently stale states, rendering resources released invisible to schedulers until the next view update. These fleeting resource fragments are referred to as shadow resources in this paper. Current shared-state solutions overlook or fail to systematically utilize the shadow resources, leaving a void in fully exploiting these invisible resources.
In this paper, we present a thorough analysis of shadow resources through theoretic modeling and extensive experiments. In order to systematically utilize these resources, we propose Resource Miner (RMiner), a hybrid scheduling sub-system on top of the shared-state scheduler architecture. RMiner comprises three cooperative components: a shadow resource manager that efficiently manages shadow resources, an RM filter that selects suitable tasks as RM tasks, and an RM scheduler that allocates shadow resources to RM tasks. In total, our design enhances the visibility of shared-state scheduling solutions by adding manageability to invisible resources. Through extensive trace-driven evaluation, we show that RMiner greatly outperforms current shared-state schedulers in terms of resource utilization, task throughput, and job wait time with only minor costs of scheduling conflicts and overhead.
Federated learning (FL) is an emerging machine learning (ML) paradigm that enables heterogeneous edge devices to collaboratively train ML models without revealing their raw data to a logically centralized server. However, beyond the heterogeneous device capacity, FL participants often exhibit differences in their data distributions, which are not independent and identically distributed (Non-IID). Many existing works present point solutions to address issues like slow convergence, low final accuracy, and bias in FL, all stemming from client heterogeneity.
In this paper, we explore an additional layer of complexity to mitigate such heterogeneity by grouping clients with statistically similar data distributions (cohorts). We propose Auxo to gradually identify such cohorts in large-scale, low-availability, and resource-constrained FL populations. Auxo then adaptively determines how to train cohort-specific models in order to achieve better model performance and ensure resource efficiency. Our extensive evaluations show that, by identifying cohorts with smaller heterogeneity and performing efficient cohort-based training, Auxo boosts various existing FL solutions in terms of final accuracy (2.1%--8.2%), convergence time (up to 2.2×), and model bias (4.8% - 53.8%).
Service meshes play a central role in the modern application ecosystem by providing an easy and flexible way to connect microservices of a distributed application. However, because of how they interpose on application traffic, they can substantially increase application latency and its resource consumption. We develop a tool called MeshInsight to help developers quantify the overhead of service meshes in deployment scenarios of interest and make informed trade-offs about their functionality vs. overhead. Using MeshInsight, we confirm that service meshes can have high overhead---up to 269% higher latency and up to 163% more virtual CPU cores for our benchmark applications---but the severity is intimately tied to how they are configured and the application workload. IPC (inter-process communication) and socket writes dominate when the service mesh operates as a TCP proxy, but protocol parsing dominates when it operates as an HTTP proxy. MeshInsight also enables us to study the end-to-end impact of optimizations to service meshes. We show that not all seemingly-promising optimizations lead to a notable overhead reduction in realistic settings.
Deep learning inference on streaming media data, such as object detection in video or LiDAR feeds and text extraction from audio waves, is now ubiquitous. To achieve high inference accuracy, these applications typically require significant network bandwidth to gather high-fidelity data and extensive GPU resources to run deep neural networks (DNNs). While the high demand for network bandwidth and GPU resources could be substantially reduced by optimally adapting the configuration knobs, such as video resolution and frame rate, current adaptation techniques fail to meet three requirements simultaneously: adapt configurations (i) with minimum extra GPU or bandwidth overhead (ii) to reach near-optimal decisions based on how the data affects the final DNN's accuracy, and (iii) do so for a range of configuration knobs. This paper presents OneAdapt, which meets these requirements by leveraging a gradient-ascent strategy to adapt configuration knobs. The key idea is to embrace DNNs' differentiability to quickly estimate the accuracy's gradient to each configuration knob, called AccGrad. Specifically, OneAdapt estimates AccGrad by multiplying two gradients: InputGrad (i.e., how each configuration knob affects the input to the DNN) and DNNGrad (i.e., how the DNN input affects the DNN inference output). We evaluate OneAdapt across five types of configurations, four analytic tasks, and five types of input data. Compared to state-of-the-art adaptation schemes, OneAdapt cuts bandwidth usage and GPU usage by 15-59% while maintaining comparable accuracy or improves accuracy by 1-5% while using equal or fewer resources.
Serverless computing is a new paradigm that aims to remove the burdens of cloud management from developers. Yet rightsizing serverless functions remains a pain point for developers. Choosing the right memory configuration is necessary to ensure cost and/or performance optimality for serverless workloads. In this work, we identify that using parametric regression can significantly simplify function rightsizing compared to black-box optimization techniques currently available. With this insight, we build a tool, called Parrotfish, which finds optimal configurations through an online learning process. It also allows users to communicate constraints on execution time, or to relax cost optimality to gain performance. Parrotfish achieves substantially lower exploration costs (1.81-9.96×) compared with the state-of-the-art tools, while delivering similar or better recommendations.
Existing time-series anomaly detection (AD) pipelines for cloud monitoring at scale commonly rely on isolated training per cloud service or cloud infrastructure component. However, with the increasing volume of data generated from thousands of services and components, there is an untapped opportunity for a more effective approach to detect key performance indicator (KPI) anomalies by capitalizing on the abundance of data available. In this paper, we propose MADDoC, an unsupervised transfer learning framework for reconstruction based anomaly detection on multivariate time-series data. We show how to efficiently leverage available KPIs in the realm of cloud infrastructure monitoring to generalize unsupervised time-series AD across infrastructure components. Compared to state-of-the-art approaches relying on isolated component-wise training, the MADDoC framework achieves superior Precision and F1 scores on public and internal time-series AD datasets, by learning a strong reconstruction backbone on the time-series data across many components, before fine-tuning to a specific component. Moreover, MADDoC achieves substantial cost savings in model training, with reductions of 60% to 75% when monitoring thousands of storage infrastructure components. Further, the framework overcomes the trade-off between training efficiency and AD performance of previous AD transfer learning approaches.
Decision forest, including RandomForest, XGBoost, and LightGBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. While these frameworks are fully optimized for inference computations, they are all decoupled with databases and general data management frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database inference, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function (UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine-grained SQL operations. The relation-centric representation can achieve significantly better performance for large models. We optimized both implementations and conducted a comprehensive benchmark to compare these two implementations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark-SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.
Technological advancements in the past decades have substantially increased the capacity and performance of Solid State Drives (SSDs). Provisioning such high-capacity SSDs among tenants can reap multiple benefits, such as elevated performance, efficient resource utilization, and cost savings through reduced Total Cost of Ownership. However, workloads perform poorly when co-located with others on the same SSD due to IO Interference, potentially violating Service Level Objectives (SLOs). High overprovisioning can address the SLO issue, however, it entails low utilization. Prior works proposed Machine Learning (ML) techniques to predict SSD performance in the presence of interfering tenants for optimizing workload placement. However, we find that these works suffer from two notable limitations. First, previous ML models do not capture interference impact due to the non-uniform workload characteristics and SSD internals. Second, they fail to compute interference of an arbitrary number of workloads due to a lack of feature aggregation. As a result, these works still offer low utilization and can only enforce weak SLOs. To address these limitations, we propose a Gray-box feature representation and aggregation technique to capture the IO interference impact of multiple non-uniform workloads based on internal SSD characteristics. Our technique improves prediction accuracy by 12x (lower mean absolute error) over prior works, resulting in up to 60% higher resource utilization or enforcing up to 2.5× stricter SLOs.
AMD's Secure Encrypted Virtualization (SEV) is a hardware-based Trusted Execution Environment (TEE) designed to secure tenants' data on the cloud, even against insider threats. The latest version of SEV, SEV-Secure Nested Paging (SEV-SNP), offers protection against most well-known attacks such as cold boot and hypervisor-based attacks. However, it remains susceptible to a specific type of attack known as Active DRAM Corruption (ADC), where attackers manipulate memory content using specially crafted memory devices. The in-memory key-value store (KVS) on SEV is a prime target for ADC attacks due to its critical role in cloud infrastructure and the predictability of its data structures. To counter this threat, we propose KVSEV, an in-memory KVS resilient to ADC attacks. KVSEV leverages SNP's Virtual Machine Management (VMM) and attestation mechanism to protect the integrity of key-value pairs, thereby securing the KVS from ADC attacks. Our evaluation shows that KVSEV secures in-memory KVSs on SEV with a performance overhead comparable to other secure in-memory KVS solutions.
Trusted execution environments (TEEs) have been proposed to protect GPU computation for machine learning applications operating on sensitive data. However, existing GPU TEE solutions either require CPU and/or GPU hardware modification to realize TEEs for GPUs, which prevents current systems from adopting them, or rely on untrusted system software such as GPU device drivers. In this paper, we propose using CPU secure enclaves, e.g., Intel SGX, to build GPU TEEs without modifications to existing hardware. To tackle the fundamental limitations of these enclaves, such as no support for I/O operations, we design and develop GEVisor, a formally verified security reference monitor software to enable a trusted I/O path between enclaves and GPU without trusting the GPU device driver. GEVisor operates in the Virtual Machine Extension (VMX) root mode, monitors the host system software to prevent unauthorized access to the GPU code and data outside the enclave, and isolates the enclave GPU context from other contexts during GPU computation. We implement and evaluate GEVisor on a commodity machine with an Intel SGX CPU and an NVIDIA Pascal GPU. Our experimental results show that our approach maintains an average overhead of 13.1% for deep learning and 18% for GPU benchmarks compared to native GPU computation while providing GPU TEEs for existing CPU and GPU hardware.
Cloud vendors are now providing cloud gaming services with GPUs. GPUs in cloud gaming experience periods of idle because not every frame in a game always keeps the GPU busy for rendering. Previous works temporally co-locate games with best-effort applications to harvest these idle cycles. However, these works ignore the spatial sharing of GPUs, leading to not maximized throughput improvement. The newly introduced RT (ray tracing) Cores inside GPU SMs for ray tracing exacerbate the situation.
This paper presents Combo, which efficiently leverages two-level spatial sharing: intra-SM and inter-SM sharing, for throughput improvement while guaranteeing the QoS of rendering games' frames. Combo is novel in two ways. First, based on the investigation of programming models for RT Cores, Combo devises a neat compilation method to convert the kernels that use RT Cores for fine-grained resource management. We utilize the fine-grained kernel management to construct spatial sharing schemes. Second, since the performance of spatial sharing varies with the actual co-located kernels, two efficient spatial sharing schemes are proposed: exact integrated SM sharing and relaxed intra-SM sharing. In order to maximize the throughput of BE applications, Combo identifies the best-fit scenarios for these two schemes by considering runtime rendering load. Our evaluation shows Combo can achieve up to 38.2% (14.0% on average) throughput improvement compared with the state-of-the-art temporal-only solution.
The scale of deep learning models has grown tremendously in recent years. State-of-the-art models have reached billions of parameters and terabyte-scale model sizes. Training of these models demands memory bandwidth and capacity that can only be accommodated distributively over hundreds to thousands of GPUs. However, large-scale distributed training suffers from GPU memory inefficiency, such as memory under-utilization and out-of-memory events (OOMs). There is a lack of understanding of actual GPU memory behavior of distributed training on terabyte-size models, which hinders the development of effective solutions to such inefficiency. In this paper, we present a systematic analysis of GPU memory behavior of large-scale distributed training jobs in production at Meta. Our analysis is based on offline training jobs of multi-terabyte Deep Learning Recommendation Models from one of Meta's largest production clusters. We measure GPU memory inefficiency; characterize GPU memory utilization, and provide fine-grained GPU memory usage analysis. We further show how to build on the understanding to develop a practical GPU provisioning system in production.
We present a vision for the future of an emerging category of cloud service: the metaverse of 3D virtual worlds. Today, hundreds of millions of users are active daily in such worlds, but they are partitioned into small groups of at most a few hundred players. Each group joins a different virtual world instance, and players can only interact in 3D with others players in the same group during that session. Current platforms are designed in ways that simply cannot scale much further, and solutions from other cloud services do not generalize to the more interactive, bidirectional, and latency-sensitive interactive 3D domain. We outline some of the technical challenges that currently stand in the way of a metaverse without inherent technical limitations on the number of users in a shared experience. We argue that, although these obviously touch on many other areas of Computer Science such as computer graphics and numerical simulation, the core challenges lie squarely within the systems domain.
Over the last few years, at ByteDance, our compute infrastructure scale has been expanding significantly due to expedited business growth. In this journey, to meet hyper-scale growth, some business groups resorted to managing their own compute infrastructure stack running different scheduling systems such as Kubernetes, YARN which created two major pain points: the increasing resource fragmentation across different business groups and the inadequate resource elasticity between workloads of different business priorities. Isolation across different business groups (and their compute infrastructure management) leads to inefficient compute resource utilization and prevents us from serving the business growth needs in the long run.
To meet these challenges, we propose a resource management and scheduling system named Gödel, which provides a unified compute infrastructure for all business groups to run their diverse workloads under a unified resource pool. It co-locates various workloads on every machine to achieve better resource utilization and elasticity. Gödel is built upon Kubernetes, the de facto open-source container orchestration system, but with significant components replaced or enhanced to accommodate various workloads at a large scale. In production, it manages clusters with tens of thousands of machines, achieves high overall resource utilization of over 60%, and scheduling throughput of up to 5000 pods per second. This paper reports on our design and implementation with Gödel. Moreover, it discusses the lessons and best practices we learned in developing and operating it in production at ByteDance's scale.
Recent advances in deep learning (DL) have spawned various intelligent cloud services with well-trained DL models. Nevertheless, it is nontrivial to maintain the desired end-to-end latency under bursty workloads, raising critical challenges on high-performance while resource-efficient inference services. To handle burstiness, some inference services have migrated to the serverless paradigm for its rapid elasticity. However, they neglect the impact of the time-consuming and resource-hungry model-loading process when scaling out function instances, leading to considerable resource inefficiency for maintaining high performance under burstiness.
To address the issue, we open up the black box of DL models and find an interesting phenomenon that the sensitivity of each layer to the computing resources is mostly anti-correlated with its memory resource usage. Motivated by this, we propose asymmetric functions, where the original Body Function still loads a complete model to meet stable demands, while the proposed lightweight Shadow Function only loads a portion of resource-sensitive layers to deal with sudden demands effortlessly. By parallelizing computations on resource-sensitive layers, the surging demand can be well satisfied, though the rest of the layers are performed serially in Body Functions only. We implement asymmetric functions on top of Knative and build a high-performance and resource-efficient inference serving system named AsyFunc with a new auto-scaling and scheduling engine. Evaluation results driven by production traces show that compared with the state of the art, AsyFunc saves computing and memory resources by up to 31.1% and 32.5%, respectively, while providing consistent performance guarantees under burstiness.
Distributed machine learning approaches, including a broad class of federated learning (FL) techniques, present a number of benefits when deploying machine learning applications over widely distributed infrastructures. The benefits are highly dependent on the details of the underlying machine learning topology, which specifies the functionality executed by the participating nodes, their dependencies and interconnections. Current systems lack the flexibility and extensibility necessary to customize the topology of a machine learning deployment. We present Flame, a new system that provides flexibility of the topology configuration of distributed FL applications around the specifics of a particular deployment context, and is easily extensible to support new FL architectures. Flame achieves this via a new high-level abstraction Topology Abstraction Graphs (TAGs). TAGs decouple the ML application logic from the underlying deployment details, making it possible to specialize the application deployment with reduced development effort. Flame is released as an open source project, and its flexibility and extensibility support a variety of topologies and mechanisms, and can facilitate the development of new FL methodologies.
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32× training time and 26× cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2×, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy.
Main memory dominates data center server cost, and hence data center operators are exploring alternative technologies such as CXL-attached and persistent memory to improve cost without jeopardizing performance. Introducing multiple tiers of memory introduces new challenges, such as selecting the appropriate memory configuration for a given workload mix. In particular, we observe that inefficient configurations increase cost by up to 2.6× for clients, and resource stranding increases cost by 2.2× for cloud operators. To address this challenge, we introduce TMC, a system for recommending cloud configurations according to workload characteristics and the dynamic resource utilization of a cluster. Whereas prior work utilized extensive simulation or costly machine learning techniques, incurring significant search costs, our approach profiles applications to reveal internal properties that lead to fast and accurate performance estimations. Our novel configuration-selection algorithm incorporates a new heuristic, packing penalty, to ensure that recommended configurations will also achieve good resource efficiency. Our experiments demonstrate that TMC reduces the search cost by up to 4× over the state-of-the-art, while improving resource utilization by up to 17% as compared to a naive policy that requests optimal tiered memory allocations in isolation.
Unexpected long query latency of a database system can cause domino effects on all the upstream services and severely degrade end users' experience with unpredicted long waits, resulting in an increasing number of users disengaged with the services and thus leading to a high user disengagement ratio (UDR). A high UDR usually translates to reduced revenue for service providers. This paper proposes UTSLO, a UDR-oriented SLO guaranteed system, which enables a database system to support multi-tenant UDR targets in a cost-effective fashion through UDR-oriented capacity planning and dynamic UDR target enforcement. The former aims to estimate the feasibility of UDR targets while the latter dynamically tracks and regulates per-connection query latency distribution needed for accurate UDR target guarantee. In UTSLO, the database service capacity can be fully exploited to efficiently accommodate tenants while minimizing resources required for UDR target guarantee.
Our analysis of a large public cloud ML training service shows that resources remain unused likely because users statically (over-)allocate resources for their jobs given a desire for predictable performance, and state-of-the-art schedulers do not exploit idle resources lest they slow down some jobs excessively. We consider if an anticipatory scheduler, which schedules based on predictions of future job arrivals and durations, can improve over the state-of-the-art. We find that realizing gains from anticipation requires dealing effectively with prediction errors, and even the best predictors have errors that do not conform to simple models (such as bounded or i.i.d. error). We devise a novel anticipatory scheduler called SIA that is robust to such errors. On real workloads, SIA reduces job latency by an average of 2.83× over the current production scheduler, while reducing the likelihood of job slowdowns by orders of magnitude relative to schedulers that naïvely share resources.
Modern web-facing applications such as e-commerce comprise tens or hundreds of distributed and loosely coupled microservices that promise to facilitate high scalability. While hardware resource scaling approaches [28] have been proposed to address response time fluctuations in critical microservices, little attention has been given to the scaling of soft resources (e.g., threads or database connections), which control hardware resource concurrency. This paper demonstrates that optimal soft resource allocation for critical microservices significantly impacts overall system performance, particularly response time. This suggests the need for fast and intelligent runtime reallocation of soft resources as part of microservices scaling management. We introduce μConAdapter, an intelligent and efficient framework for managing concurrency adaptation. It quickly identifies optimal soft resource allocations for critical microservices and adjusts them to mitigate violations of service-level objectives (SLOs). μConAdapter utilizes fine-grained online monitoring metrics from both the system and application levels and a Deep Q-Network (DQN) to quickly and adaptively provide optimal concurrency settings for critical microservices. Using six realistic bursty workload traces and two representative microservices-based benchmarks (SockShop and SocialNetwork), our experimental results show that μConAdapter can effectively mitigate large response time fluctuation and reduce the tail latency at the 99th percentile by 3× on average when compared to the hardware-only scaling strategies like Kubernetes Autoscaling and FIRM [28], and by 1.6× to the state-of-the-art concurrency-aware system scaling strategy like ConScale [21].
This paper releases and analyzes two new Huawei cloud serverless traces. The traces span a period of over 7 months with over 1.4 trillion function invocations combined. The first trace is derived from Huawei's internal workloads and contains detailed per-second statistics for 200 functions running across multiple Huawei cloud data centers. The second trace is a representative workload from Huawei's public FaaS platform. This trace contains per-minute arrival rates for over 5000 functions running in a single Huawei data center. We present the internals of a production FaaS platform by characterizing resource consumption, cold-start times, programming languages used, periodicity, per-second versus per-minute burstiness, correlations, and popularity. Our findings show that there is considerable diversity in how serverless functions behave: requests vary by up to 9 orders of magnitude across functions, with some functions executed over 1 billion times per day; scheduling time, execution time and cold-start distributions vary across 2 to 4 orders of magnitude and have very long tails; and function invocation counts demonstrate strong periodicity for many individual functions and on an aggregate level. Our analysis also highlights the need for further research in estimating resource reservations and time-series prediction to account for the huge diversity in how serverless functions behave.
File systems that store metadata on a single machine or via a shared-disk abstraction face scalability challenges, especially in contexts demanding the management of billions of files. Recent work has shown that employing shared-nothing, distributed database system (DDBMS) for metadata storage can alleviate these scalability challenges without compromising on high availability guarantees. However, for low-scale deployments -- where metadata can fit in memory on a single machine -- these DDBMS-based systems typically perform an order of magnitude worse than systems that store metadata in memory on a single machine. This has limited the impact of these distributed database approaches, since they are only currently applicable to file systems of extreme scale.
This paper describes FileScale, a three-tier architecture that incorporates a DDBMS as part of a comprehensive approach to file system metadata management. In contrast to previous approaches, FileScale performs comparably to the single-machine architecture at a small scale, while enabling linear scalability as the file system metadata increases1.
With the emergence of the serverless computing paradigm in the cloud, researchers have explored many challenges of serverless systems and proposed solutions such as snapshot-based booting. However, we have noticed that some of these optimizations are based on oversimplified assumptions that lead to infeasibility and hide real-world issues. This paper aims to analyze the gap between current serverless research and real-world systems from a perspective of industry, and present new observations, challenges, opportunities, and insights that may address the discrepancies.
With the increasing adoption of containerization in cloud services, container networking has become a critical concern, as it enables the agile deployment of microservices but also introduces new vulnerabilities susceptible to network attacks, posing a threat to container environments. While several security solutions have been introduced to address this concern, they unfortunately exhibit significant shortcomings, including security vulnerabilities and limited performance. We thus propose Helios, a novel hardware-based network security extension that addresses the security and performance limitations in existing solutions. Leveraging a smartNIC, Helios enhances both the security and performance facets of container networking through two key mechanisms: (i) the establishment of physically isolated container communication channels and (ii) the network security engines fully offloaded to the smartNIC. Our evaluation shows that Helios mitigates various network threats initiated from both container- and host-side while performing up to 3x faster than the existing solutions in container communication.
End-to-end latency estimation in web applications is crucial for system operators to foresee the effects of potential changes, helping ensure system stability, optimize cost, and improve user experience. However, estimating latency in microservices-based architectures is challenging due to the complex interactions between hundreds or thousands of loosely coupled microservices. Current approaches either track only latency-critical paths or require laborious bespoke instrumentation, which is unrealistic for end-to-end latency estimation in complex systems.
This paper presents LatenSeer, a modeling framework for estimating end-to-end latency distributions in microservice-based web applications. LatenSeer proposes novel data structures to accurately represent causal relationships between services, overcoming the drawbacks of simple dependency representations that fail to capture the complexity of microservices. LatenSeer leverages distributed tracing data to practically and accurately model end-to-end latency at scale. Our evaluation shows that LatenSeer predicts latency within a 5.35% error, outperforming the state-of-the-art that has an error rate of more than 9.5%.
A key to video streaming systems is knowing how sensitive quality of experience (QoE) is to quality metrics (e.g., buffering ratio and average bitrate). In the conventional wisdom, such quality sensitivity should be profiled by offline user studies because QoE is equally sensitive to quality metrics everywhere for an entire genre of videos. However, recent studies show that quality sensitivity varies substantially both across videos and within a video, giving rise to a new potential for improving QoE and serving more users without using more bandwidth. Unfortunately, offline profiling cannot capture the variability of quality sensitivity within a new video (e.g., a new TV show episode or live sports event), if users join to watch it within a short time window.
This short paper makes a case for a new architecture that online profiles the quality sensitivity of a video by gathering and analyzing QoE-related feedback (e.g., exit or skip) from actual users, while the video is being streamed to users. The key component is a QoE-driven feedback loop, called SensitiFlow, run by video content providers to make adaptive-bitrate (ABR) decisions for concurrent and future video sessions. We evaluated QoE in user engagement (view time) using real traces of 7.6 million video sessions from a content provider. Our preliminary results show that SensitiFlow can realize up-to 80% of the improvement obtained by a hypothetical "oracle" system that knows quality sensitivity in advance. Admittedly, our evaluation is not a real deployment by a large-scale commercial content provider, but we hope our preliminary results will inspire follow-up efforts to test similar ideas at scale.
Recent research has proposed the use of trusted execution environments (TEEs), such as SGX, in serverless computing to safeguard against threats from insecure system software, malicious co-located tenants, or suspicious cloud operators. However, integrating SGX, one of the most mature TEE, with serverless computing results in significant performance degradation due to the function startup latency caused by enclave creation. This performance degradation arises because SGX is not designed with serverless function startup procedures in mind, where numerous application codes, libraries, and data are re-initialized upon each function invocation. The inherent limitations of SGX contribute to significant performance degradation, whether through the addition of every page into the enclave, or the restriction of page permissions, which ultimately cause TLB flushes, context switches, and re-entering the enclave. In this paper, we first take key observations resident in the intrinsic features of the server-less function and propose Cryonics, a method of serving snapshot-based enclave that accelerates the startup time of the function instance by creating a future-proof working set of that. We consider the page locality and obsolete pages of the enclaved function instance to create a lightweight working set used for serving requests. Our evaluation shows that Cryonics achieves up to 100x outperformed startup time compared to existing cold-start-based methods and reveals the stability of the startup time.
Robust forecasts of future resource usage in cloud computing environments enable high efficiency in resource management solutions, such as autoscaling and overcommitment policies. Production-level systems use lightweight combinations of historical information to enable practical deployments. Recently, Machine Learning (ML) models, in particular Long Short Term Memory (LSTM) neural networks, have been proposed by various works, for their improved predictive capabilities. Following this trend, we train LSTM models and observe high levels of prediction accuracy, even on unseen data. Upon meticulous visual inspection of the results, we notice that although the predicted values seem highly accurate, they are nothing but versions of the original data shifted by one time step into the future. Yet, this clear shift seems to be enough to produce a robust forecast, because the values are highly correlated across time. We investigate time series data of various resource usage metrics (CPU, memory, network, disk I/O) across different cloud providers and levels, such as at the physical or virtual machine-level and at the application job-level. We observe that resource utilization displays very small variations in consecutive time steps. This insight can enable very simple solutions, such as data shifts, to be used for cloud resource forecasting and deliver highly accurate predictions. This is the reason why we ask whether complex machine learning models are even necessary to use. We envision that practical resource management systems need to first identify the extent to which simple solutions can be effective, and resort to using machine learning to the extent that enables its practical use.
Modern computer systems are highly configurable, with hundreds of configuration options that interact, resulting in an enormous configuration space. As a result, optimizing performance goals (e.g., latency) in such systems is challenging due to frequent uncertainties in their environments (e.g., workload fluctuations). Lately, there has been a utilization of transfer learning to tackle this issue, leveraging information obtained from configuration measurements in less expensive source environments, as opposed to the costly or sometimes impossible interventions required in the target environment. Recent empirical research showed that statistical models can perform poorly when the deployment environment changes because the behavior of certain variables in the models can change dramatically from source to target. To address this issue, we propose Cameo---a method that identifies invariant causal predictors under environmental changes, allowing the optimization process to operate in a reduced search space, leading to faster optimization of system performance. We demonstrate significant performance improvements over state-of-the-art optimization methods in MLperf deep learning systems, a video analytics pipeline, and a database system.
The sharing of clusters with various on-NIC offloads by high-level entities (users, containers, etc.) has become increasingly common. Performance isolation across these entities is desired because the offloads can become bottlenecks due to the limited capacity of hardware. However, the existing works that provide scheduling and resource management to NIC offloads all require customization of the NIC or offloads, while commodity off-the-shelf NICs and offloads with proprietary implementation have been widely deployed in datacenters. This paper presents Yama, the first solution to enable per-entity isolation in the sharing of such black-box NIC offloads. Yama provides a generic framework that captures a common abstraction to the operation of most offloads, which allows operators to incorporate existing offloads. The framework proactively probes for the performance of the offloads with auxiliary workload and enforces isolation at the initiator side. Yama also accommodates chained offloads. Our evaluation shows that 1) Yama achieves per-entity max-min fairness for various types of offloads and in complicated offload chaining scenarios; 2) Yama quickly converges to changes in equilibrium and 3) Yama adds negligible overhead to application workload.
As research and deployment of AI grows, the computational burden to support and sustain its progress inevitably does too. To train or fine-tune state-of-the-art models in NLP, computer vision, etc., some form of AI hardware acceleration is virtually a requirement. Recent large language models require considerable resources to train and deploy, resulting in significant energy usage, potential carbon emissions, and massive demand for GPUs and other hardware accelerators. However, this surge carries large implications for energy sustainability at the HPC/datacenter level. In this paper, we study the effects of power-capping GPUs at a research supercomputing center on GPU temperature and power draw; we show significant decreases in both temperature and power draw, reducing power consumption and potentially improving hardware life-span, with minimal impact on job performance. To our knowledge, our work is the first to conduct and make available a detailed analysis of the effects of GPU power-capping at the supercomputing scale. We hope our work will inspire HPCs/datacenters to further explore, evaluate, and communicate the impact of power-capping AI hardware accelerators for more sustainable AI.
Serverless workflows are characterized as multi-stage computing, while downstream functions require accessing intermediate states or the output of upstream functions for running. The workflow's performance can be easily affected due to the inefficiency of data access. Studies accelerate data access with various policies, such as direct and indirect methods. However, these methods may fail due to various limitations such as resource availability.
In this paper, we propose asynchronous state replication pipelines (ASRP) to speed up workflows for general applications, replacing the sequential computing pattern of current workflows. Chitu is built based on the insight with three main points. First, differentiable data types (DDT) are provided at the programming model level to support incremental state sharing and computation. Second, ASRP works by continuously delivering changes of DDT objects in real-time so that downstream functions can consume the objects without waiting for the ending of upstream functions. Third, we make a systematic design to support DDT and ASRP in Chitu framework, including direct communication and change propagation. We implement Chitu atop OpenFaaS, compare it with popular serverless workflow frameworks, and evaluate it with three commonly seen cases. The results show that Chitu accelerates data transmission in general serverless workflows up to 1.7×, and speeds up end-to-end applications by up to 57%.