End-To-End Performance Study of Cloud Services
Cloud computing promises a number of advantages for the deployment of data-intensive applications. Most prominently, these include reducing cost with a pay-as-you-go pricing model and (virtually) unlimited throughput by adding servers if the workload increases. At the Systems Group, ETH Zurich, we did an extensive end-to-end performance study to compare the major cloud offerings regarding their ability to fulfill these promises and their implied cost.
The focus of the work is on transaction processing (i.e., read and update work-loads), rather than analytics workloads. We used the TPC-W, a standardized benchmark simulating a Web-shop, as the baseline for our comparison. The TPC-W defines that users are simulated through emulated browsers (EB) and issue page requests, called web-interactions (WI), against the system. As a major modification to the benchmark, we constantly increase the load from 1 to 9000 simultaneous users to measure the scalability and cost variance of the system. Figure 1 shows an overview of the different combinations of services we tested in the benchmark.
Figure 1: Systems Under Test |
The main results are shown in Figure 2 and Table 1 - 2 and are surprising in several ways. Most importantly, it seems that all major vendors have adopted a different architecture for their cloud services (e.g., master-slave replication, partitioning, distributed control and various combinations of it). As a result, the cost and performance of the services vary significantly depending on the workload. A detailed description of the architectures is provided in the paper. Furthermore, only two architectures, the one implemented on top of Amazon S3 and MS Azure using SQL Azure as the database, were able to scale and sustain our maximum workload of 9000 EBs, resulting in over 1200 Web-interactions per second (WIPS). MySQL installed on EC2 and Amazon RDS are able to sustain a maximum load of approximate 3500 EBs. MySQL Replication performed similar to MySQL standalone with EBS, so we left it off the picture. Figure 1 shows that the WIPS of Amazon’s SimpleDB grow up to about 3000 EBs and more than 200 WIPS. In fact, SimpleDB was already overloaded at about 1000 EBs and 128 WIPS in our experiments. At this point, all write requests to hot spots failed. Google AppEngine already dropped out at 500 emulated browsers with 49 WIPS. This is mainly due to Google’s transaction model not being built for such high write workloads. When implementing the benchmark, our policy was to always use the highest offered consistency guarantees, which come closest to the TPC-W requirements. Thus, in the case of AppEngine, we used the offered transaction model inside an entity group. However, it turned out, that this is a big slow-down for the whole performance. We are now in the process of re-running the experiment without transaction guarantees and curios about the new performance results.
Figure 2: Comparison of Architectures [WIPS] |
Table 1 shows the total cost per web-interaction in milli dollar for the alternative approaches and a varying load (EBs). Google AE is cheapest for low workloads (below 100 EBs) whereas Azure is cheapest for medium to large workloads (more than 100 EBs). The three MySQL variants (MySQL, MySQL/R, and RDS) have (almost) the same cost as Azure for medium workloads (EB=100 and EB=3000), but they are not able to sustain large workloads.
Table 1: Cost per WI [m$], Vary EB |
The success of Google AE for small loads has two reasons. First, Google AE is the only variant that has no fixed costs. There is only a negligible monthly fee to store the database. Second, at the time these experiments were carried out, Google gave a quota of six CPU hours per day for free. That is, applications which are below or slightly above this daily quota are particularly cheap.
Azure and the MySQL variants win for medium and large workloads because all these approaches can amortize their fixed cost for these workloads. Azure SQL server has a fixed cost per month of USD 100 for a database of up to 10 GB, independent of the number of requests that need to be processed by the database. For MySQL and MySQL/R, EC2 instances must be rented in order to keep the database online. Likewise, RDS involves an hourly fixed fee so that the cost per WIPS decreases in a load situation. It should be noted that network traffic is cheaper with Google than with both Amazon and Microsoft.
Table 2 shows the total cost per day for the alternative approaches and a varying load (EBs). (A "-" indicates that the variant was not able to sustain the load.) These results confirm the observations made previously: Google wins for small workloads; Azure wins for medium and large workloads. All the other variants are somewhere in between. The three MySQL variants come close to Azure in the range of workloads that they sustain. Azure and the three MySQL variants roughly share the same architectural principles (replication with master copy architectures). SimpleDB is an outlier in this experiment. With the current pricing scheme, SimpleDB is an exceptionally expensive service. For a large number of EBs, the high cost of SimpleDB is particularly annoying because users must pay even though SimpleDB drops many requests and is not able to sustain the workload.
Table 2: Total Cost per Day [$], Vary EB |
Turning to the S3 cost in Table 2, the total cost grows linearly with the workload. This behavior is exactly what one would expect from a pay-as-you-go model. For S3, the high cost is matched by high throughputs so that the high cost for S3 at high workloads is tolerable. This observation is in line with a good Cost/WI metric for S3 and high workloads (Table 1). Nevertheless, S3 is indeed more expensive than all the other approaches (except for SimpleDB) for most workloads. This phenomenon can be explained by Amazon's pricing model for EBS and S3. For instance, a write operation to S3 is hundred times more expensive than a write operation to EBS which is used in the MySQL variant. Amazon can justify this difference because S3 supports concurrent updates with an eventual consistency policy whereas EBS only supports a single writer (and reader) at a time.
In addition to the here presented results, the paper also compares the overload behavior and presents the different cost-factors leading to the here presented numbers. If you are interested in these results and additional information about the test-setup, the paper will be presented at this year's SIGMOD conference and can also be downloaded here.
Reader Comments (16)
I am no expert at TPC-W, but I believe this benchmark is not using it properly. From the TPC-W description: "For each of the emulated browsers, the database must maintain 2880 customer records and associated order information."
But instead this benchmark locked down the database size at only 100 EBs: "The TPC-W benchmark specifies that the size of the benchmark database grows linearly with the number of EBs. Since we were specifically interested in the scalability of the services with regard to the transactional workload and elasticity with changing workloads, we carried out all experiments with a fixed benchmark database which complies to a standard TPC-W database for 100 EBs. This database involved 10,000 items and had 315 MB of raw data which typically resulted in database sizes of about 1GB with indexes (Section 6.5)."
But then the scaling numbers are completely 'bogus'. A database of 1GB fits in-memory, so this is purely a cache-access test and not a DB test at all. And certainly not the more realistic benchmark that TPC-W is trying to be.
Further the correct TPC-W 9K EBs number would result in something like a 90GB DB size. And Azure wouldn't even be a contender beyond 1K EB: "SQL Azure provides two database sizes: 1 GB (Web Edition) or 10 GB (Business Edition). If the size of your database reaches its MAXSIZE, you will receive an error code 40544. When this happens, you cannot insert or update data, or create new objects, such as tables, stored procedures, views, and functions."
So Azure makes it no further than SimpleDB [assuming no effects of going from 100 -> 1K EB for either of them], and isn't even in the league of RDS or EC2+MySQL [or EC2+SQL Server for that matter].
Am I misreading something here?
I would be interested to see the change in performance of GAE if the db was sharded as per Googles recommendations. GAE was bound to perform poorly with the decision to use only one entity group as this is expressly recommended against by google.
I don't use any of the above services so I'm not a Google apologist, but would be keen to see comparisons using a test bed that actually leverages the designs of the underlying systems.
It's almost like the test was designed to fail on GAE.
Is the source code you used available?
Great study however did you compare the cloud setup to dedicated hardware at the low/medium scale?
I feel virtual hardware/clouds fill some needs wonderfully, however I would be interested to find that line in the sand when real hardware beats/meets virtual hardware.
I also need to take a look at the TPC-W suite to see if it includes how long it takes a page to render since that is an important criteria (slow page load are never good!).
Off to read more!
I'd be interested in seeing the GAE benchmarks use something like Objectify, since it more closely matches how data is actually stored in GAE. JDO/JPA is so slow! GAE can't be treated like an RDBMS, doesn't seem like this benchmark was optimized for GAE.
In the following I try to answer some of the comments above:
MS SQL Azure
Indeed, we fixed the database size to the size corresponding to data size of 100 EBs as specified by the TPC-W. We couldn’t increase the data-size with the workload, as such an experiment would have made the data increase the dominant factor of all experiments and this would not have been realistic at all. Furthermore, serious bigger data sizes are pretty hard and expensive to upload to the cloud.
Thus, the presented results show the scalability in the workload/transactions and not in the data size. For bigger data sizes, MS SQL Azure has a serious limitation, which we also stated in the paper (today, you get all these nice properties and comfort up to 10GB).
However, we are in contact with all the providers and for some we might be able to re-run the experiments with bigger data sizes in the near future.
Google App Engine
Regarding GAE: For the presented results, our policy was to always use the consistency level offered by the system that comes closest to the TPC-W requirements and to use the standard out-of-the-box services without using additional techniques to scale beyond their limitations (e.g., sharding, caches, …) . For GAE, we already did the exception to use their mem-cache offering, which didn’t help a lot. For the presented results we decided against sharding, as the transaction support is only given inside a single entity group and there exists no natural partitioning as all products could end up in the same basket.
Nevertheless, as Google’s transaction support seems to be the main bottleneck, we are now re-running the experiments to measure the scaling behavior and cost with sharding. However, it should be noted, that sharding in general is a technique which could also be applied to all the other services including MySQL, SimpleDB and MS SQL Azure (e.g., to overcome the data size limitation) and, if done manually, just pushes the scalability complexity back to the developer instead of leaving it with the service provider.
Dedicated Hardware
Running the same experiments on dedicated hardware was beyond the scope of this study. However, we did run the experiments on different EC2 virtual machine sizes. On the TPC-W web-page as well as in various research papers, you can find the numbers for dedicated machines. This gives at least an idea on how the trade-off line looks like. However, this discussion quickly goes to the point of the need for over-provisioning, administration cost, etc.
Source Code
The source code is not yet published, but a student is working on packaging everything up. So stay tuned.
Any chance to include rackspace into the tests? Some sources say they can be even faster than EC2.
You wrote that Google AppEngine already dropped out at 500 emulated browsers with 49 WIPS. Maybe that was because App Engine has a limit on maximum number of requests per minute:
Excellent paper, really enjoyed it.
I'm curious as to why you did not consider using another storage option for Azure, which is Azure Blob Storage. As far as I can tell it is not even mentioned in the paper as something you were aware of but didn't make use of (for whatever reason). ABS is quite analogous to S3, and in addition supports real transactional semantics (as opposed to eventual consistency) within a given partition (although utilizing that will likely be somewhat of a bottleneck). Among other reasons for looking at it, there are no data size limitations.
Thanks,
Mark
I'd be interested in learning more about your SimpleDB experiment. A few questions:
1. Did you design your experiment to fit within the paradigm of SDB? It sounds like you worked to adapt SDB to SQL instead of working within the constraints provided by SDB.
2. Did you spread your data across multiple domains? I've found this significantly improves performance at the cost of some extra complexity, of course.
3. How many connections were you maintaining to the SDB services?
4. Was your application running within the Amazon data center? Clearly, internal network traffic will be much higher quality than external.
steve
Tim - I work for the Windows Azure team. Any particular reason you didn't do a separate test using the Azure table storage service (apart from SQL Azure)?
Have you got any figure about response time according to the load?
Great Study,
Sabri.
It is not quite correct to say that the problem with GAE is related to sharding.
As you know, you can only perform transactions within a single entity group. The bottleneck is not that this restricts processing to a single machine; the bottleneck is that each entity group (and thus each transaction) has a single optimistic timestamp. By placing your entire data structure in a single entity group, you have guaranteed that *every single operation* contends for a single optimistic lock in the database.
The analogue to an RDBMS is not condensing all requests to a single machine; the analogue is putting an update of the exact same piece of data in *every* transaction. This simply won't work.
To properly implement the GAE test, you can't just "add sharding"; you must eliminate the transaction requirement and allow every piece of data to live in its own entity group. Transactions in GAE are useful in a few isolated cases but they are NOT equivalent to RDBMS transactions and cannot be treated as such. Transactions as you are accustomed to are not a supported feature of the GAE datastore.
I'm very interested in how GAE stacks up. You shouldn't need to rewrite your tests to support sharding in any explicit way, just remove the parent references and get rid of any transaction-related code. This doesn't offer the same consistency requirements as the other systems, but GAE doesn't support those kinds of consistency requirements.
In the following I try to answer the questions since Tim's last comment:
Rackspace
It is currently not planned to run the benchmark on Rackspace in the near future. Nonetheless, we are continuously observing the evolutions on the cloud market and are open for suggestions for new services to benchmark.
To keep people up-to-date, we are currently also looking for the best way to publish new results of already tested as well as new services. In addition, we are also considering the possibility to allow others to upload own results.
Google App Engine Quotas and Limits
Of course we checked the limits of the App Engine before running the test. In our benchmark, the bottleneck is clearly related to the limited throughput of the entity groups.
We are also in ongoing contact with Google and will certainly soon re-run the benchmark on their platform.
Azure Simple Data Storage
While setting up our benchmark on the Azure platform, using a service providing almost complete SQL support seemed the most natural choice to us. However, thanks for pointing out the table support of Azure Data Storage. It seems very interesting and maybe we can include it soon.
SimpleDB
For SimpleDB we used the same policy as for the App Engine (explained by Tim in his last comment).
We stored the data in several domains, but we always kept the data belonging logically together in the same domain (e.g., all products). As mentioned in the paper, once a domain is overloaded it starts returning errors and parts of the applications stop working. That is why, it is possible that splitting up data any further might increase the performance. As each additional partitioning step increases application complexity to access this data, the question comes up into how many domains data should be partitioned. In addition the more data is partitioned, the more requests have to be executed to query data and thus the cost increase (e.g., for a range-scan).
We used a connection pool with a varying number of connections to SDB. The application was running inside an Amazon data center.
Thanks for the work to benchmark all these solutions. On EC2 I have noticed significant differences between using mysql on top of the instance ephemeral disks (e.g. the 4 420 disks that come as part of an xlarge instance) vs mysql on top of EBS. I've also noticed improvements in performance by RAIDing those disks together instead of using them directly. I'm curious if you tried any of these alternatives an how they compare in your benchmark. Thanks again!
Matt Conway:
Paper:
In regards to GAE, they now do support cross-entity-group transactions, with a quota.
However, I would also like to mention, that from experience in using GAE to scale up processes over >50,000 concurrent tasks, you can't just dump your code on a fresh instance and expect it to immediately scale to your needs. It has a learning algorithm which adjusts how high you can scale, so if you have a bug in your code and suddenly start spiking through the roof, you will get shut down. There are specific error codes in the log messages which denote the difference between a datastore throughput error, and a failure due to quotas.
If you run the tests on a GAE backend, you won't hit the time quotas, and will be able to scale as high as the processor and your ram allow. Also, Google is using a new high replication system which has increased our throughput on single-entity group transactions > 3.5x.
You should really run these tests again, and perhaps offer up some source code for the community to optimize.