Transient Guarantees: Maximizing the Value of Idle Cloud Capacity
SessionClouds & Job Scheduling
Session ChairAli R. Butt
Event Type
Paper
Clouds and Distributed Computing
Intermediate
Performance
Location355-BC
DescriptionTo reduce waste, platforms have begun to offer idle capacity in the form of transient servers, which they may unilaterally revoke, for much lower prices—∼70-90% less—than on-demand servers. However, transient servers’ revocation characteristics—their volatility and predictability—influence their performance, since they affect the overhead of fault-tolerance mechanisms applications use to handle revocations. Unfortunately, current cloud platforms offer no guarantees on revocation characteristics, which makes it difficult for users to optimally configure (and value) transient servers.
To address the problem, we propose the abstraction of a transient guarantee, which offers probabilistic assurances on revocation characteristics. We present policies for partitioning a variable amount of idle capacity into classes with different transient guarantees to maximize performance and value. We then implement and evaluate these policies on job traces from a production Google cluster. We show that our approach can increase the aggregate revenue from idle server capacity by ∼6.5× compared to existing approaches.
To address the problem, we propose the abstraction of a transient guarantee, which offers probabilistic assurances on revocation characteristics. We present policies for partitioning a variable amount of idle capacity into classes with different transient guarantees to maximize performance and value. We then implement and evaluate these policies on job traces from a production Google cluster. We show that our approach can increase the aggregate revenue from idle server capacity by ∼6.5× compared to existing approaches.








