Are long VM instance spin-up times in the cloud costing you money?
Are long VM instance spin-up times in the cloud costing you money? That's the question that immediately came to mind when James Urquhart, in an interview at the Stata Conference, made this thought provoking comment: the faster you can get the resources into the hands of the people who use them, the more money you save overall.
One of the many super powers of the cloud is elasticity, the ability to dynamically acquire and release resources in response to demand. But like any good superhero, their strength must also form the basis of a not quite fatal flaw. Years and years of angsty episodes are usually required to explore this contradiction.
In the case of the cloud, the weakness reveals itself in slow VM spin-up times. Spinning up a VM in EC2 can take a little as 1-3 minutes, or can average 5-10 minutes, or it can take much longer if there's heavy usage in your availability zone. EC2 is not alone. A common complaint about Google App Engine is the cold-start problem. When a request comes in, an application must be initialized to handle it, which takes time, which means the end-user experiences increased latency.
This means with VM oriented systems--be they IaaS or PaaS--your ability to deal with bursty traffic is much more limited in the cloud than you might have expected. You could of course reserve capacity, but that kind of defeats the point of elasticity, and really just moves the point where the problem occurs further down the curve. App Engine will have a feature to keep warm instances around, but you will pay for those too, which again defeats the point of on-demand pay for what you use elasticity.
All this might not matter, if it weren't for the idea that those spin-up times could cost you. Joe Weinman has taken a more formal look at this problem in his paper Time is Money: The Value of “On-Demand," and this is the paper James was referring to when he made his observation.
Joe Weinman, as the founder of Cloudonomics, a rigorous analytical approach leveraging mathematics and Monte Carlo simulation to characterize the sometimes counterintuitive multi-dimensional business of cloud computing and pay-per-use business models, has written a string of interesting papers on his website. Some of the titles include: Smooth Operator: The Value of Demand Aggregation (PDF); Cloud Computing is NP-Complete (PDF); Mathematical Proof of the Inevitability of Cloud Computing(PDF). At the core of these papers are many many pages of rigorous mathematical analysis, but fortunately this creamy goodness is bookended with chocolatey cookies explaining what it all means.
From the abstract of Time is Money: The Value of “On-Demand” :
Cloud computing and related services offer resources and services "on demand." Examples include access to "video on demand" via IPTV or over-the-top streaming; servers and storage allocated on demand in "infrastructure as a service;" or "software as a service" such as customer relationship management or sales force automation. Services delivered "on demand" certainly sound better than ones provided "after an interminable wait," but how can we quantify the value of on-demand, and the scenarios in which it creates compelling value?
We show that the benefits of on-demand provisioning depend on the interplay of demand with forecasting, monitoring, and resource provisioning and de-provisioning processes and intervals, as well as likely asymmetries between excess capacity and unserved demand.
In any environment with constant demand or demand which may be accurately forecasted to an interval greater than the provisioning interval, on-demand provisioning has no value. However, in most cases, time is money. For linear demand, loss is proportional to demand monitoring and resource provisioning intervals. However, linear demand functions are easy to forecast, so this benefit may not arise empirically.
For exponential growth, such as found in social networks and games, any non-zero provisioning interval leads to an exponentially growing loss, underscoring the critical importance of ondemand in such environments.
For environments with randomly varying demand where the value at a given time is independent of the prior interval—similar to repeated rolls of a die—on-demand is essential, and generates clear value relative to a strategy of fixed resources, which in turn are best overprovisioned.
For demand where the value is a random delta from the prior interval—similar to a Random Walk—there is a moderate benefit from time compression. Specifically, reducing process intervals by a factor of n results in loss being reduced to a level of 1/square root of n of its prior value. Thus, a two-fold reduction in cost requires a four-fold reduction in time.
Finally, behavioral economic factors and cognitive biases such as hyperbolic discounting, perception of wait times, neglect of probability, and normalcy and other biases modulate the hard dollar costs addressed here.
The degree of effect is related to traffic patterns:
We have seen that not only is there a time value of money, there is a money value of time, specifically, increased agility and responsiveness lead to reduced loss, including a reduction in missed opportunities. Time is money.
From a business perspective, one has to ask whether the reduction in monitoring or provisioning time that potentially results in reduced loss due to unserved demand or unused resources is worth it. I believe in most cases the answer is yes. The reason is that the costs of implementing such on-demand strategies are largely fixed, are a relatively minor portion of the total cost, or are already incorporated, say, into a cloud provider's offerings. For example, the cost for an enterprise or cloud provider to acquire and deploy dynamic provisioning software compared to the losses associated with unserved demand or unutilized capacity make it an attractive proposition.
For linearly growing or declining demand, a reduction in time (monitoring cycle or resource provisioning) offers a proportional reduction in cost.
For exponential demand, the loss associated with even fixed interval provisioning grows exponentially, so on-demand provisioning is essential.
The VM spin-up interval is your period of lost opportunity. If your traffic is bursty and/or growing exponentially, then you may be losing out on more profitable opportunities than you thought, because cloud elasticity doesn't match demand elasticity. While not quite the cloud's kryptonite, it is a flaw worth considering in your architecture.
Reader Comments (6)
I couldn't agree with this post more. Not only does provisioning cost you money, it prevents new use cases that would be possible with real-time, instantaneous provisioning.
<relevant shameless plug>
We build technology that allows provisioning of running, stateful virtual machines in seconds, and it's amazing how people don't realize that the game changes when these is little or zero provisioning cost. When things take minutes, you have no choice but to do complex prediction and over-provisioning. I wrote a blog post on a similiar theme just today.
</relevant shameless plug>
Spin-up time for newly added Google AppEngine instances can be reduced using initial state caching.
Usually the majority of spin-up time for the newly created GAE instance is spent in the pre-populating of the initial state, which is created from many data pieces loaded from slow data sources such as GAE's datastore. If the initial state is identical among GAE instances, then the entire state can be serialized and stored in a shared memory (either in the memcache or in the datastore) by the first created instance, so newly created instances could load and quickly unserialize the state from a single blob loaded from shared memory instead of spending a lot of time for creation of the state from multiple data pieces loaded from the datastore.
I reduced spin-up time for new instances of my GAE application from 15 seconds to 1.5 seconds using this technique.
Theoretically the same approach could be used for VM-powered clouds such as Amazon EC2, if the cloud will be able fork()'ing new VMs from the given initial state. Then application developers could boot and pre-configure required services in the 'golden' VM, which then will be stored in a snapshot somewhere in a shared memory. The snapshot will be used for fast fork()'ing of new VMs. The VM's fork() can be much faster comparing to the cold boot of a new VM with required services.
Some organizations are choosing to use stateless VM's with JEOS and doing a just-in-time installation of the application / platform software. Even if you do 'eager acquisition' and cache the base VM, you'll still have to deal with the delays associated with the installation routine. You can also speed up the launch by placing those resources on a higher tiered network/storage service, http://bit.ly/icWuaR
I don't think there is any silver bullet here - but plenty of ways to improve.
That's a good approach valyala for application state that's under 1 megabyte. That state could also be evicted. And there's also the problem of allocating the instance and then including large code libraries that needs to be solved on google's side.
A bit out of date, with respect to Google App Engine. They now support the Always On feature.
Thought I'd check back in. @Valyala, that's exactly what our technology does, so it's not so theoretical :).
The initial research, SnowFlock, was published in 2009 at Eurosys, where it won best paper. It's a great read if you're interested in exploring that direction.