Entries in AWS (40)

Tuesday
Jul202010

Strategy: Consider When a Service Starts Billing in Your Algorithm Cost

At Monday's Cloud Computing Meetup, Paco Nathan gave an excellent Getting Started on Hadoop talk (slides). I found one of Paco's strategies particularly interesting: consider when a service starts charging in cost calculations. Depending on your use case it may be cheaper to go with a more expensive service that charges only for work accomplished rather than charging for both work + startup time.

Click to read more ...

Thursday
Jul082010

Cloud AWS Infrastructure vs. Physical Infrastructure

This is a guest post by Frédéric Faure (architect at Ysance) on the differences between using a cloud infrastructure and building your own. Frédéric was kind enough to translate the original French version of this article into English.

I’ve been noticing many questions about the differences inherent in choosing between a Cloud infrastructure such as AWS (Amazon Web Services) and a traditional physical infrastructure. Firstly, there are a certain number of preconceived notions on this subject that I will attempt to decode for you. Then, it must be understood that each infrastructure has its advantages and disadvantages: a Cloud-type infrastructure does not necessarily fulfill your requirements in every case, however, it can satisfy some of them by optimizing or facilitating the features offered by a traditional physical infrastructure. I will therefore demonstrate the differences between the two that I have noticed, in order to help you make up your own mind.

Click to read more ...

Monday
Aug312009

Scaling MySQL on Amazon Web Services

I've recently started working with a large company who is looking to take one of their heavily utilized applications and move it to Amazon Web Services. I'm not looking to start a debate on the merits of EC2, the decision to move to aws is already made (and is a much better decision than paying a vendor millions to host it).

I've done my reasearch and I'm comfortable with creating this environment with one exception, scaling MySQL. I havent done much work with MySQL, i'm more of an Oracle guy up to now. I'm struggling to determine a way to scale MySQL on the fly in a way so that replication works, the server takes its proper place in line for master candidacy, and the apache servers become aware of it.

So this is really three questions:

1. What are some proven methods of load balancing the read traffic going from apache to MySQL.
2. How do I let the load balancing mechanism know when I scale up / down a new Mysql Server?
3. How to alert the master of the new server and initiate replication in an automated environment?

Personally, I dont like the idea of scaling the databases, but the traffic increases exponentially for three hours a day, and then plummets to almost nothing. So this would provide a significant cost savings.

The only way I've read to manage this sort of scaling I read here on slides 18-25:
http://assets.en.oreilly.com/1/event/21/Tricks%20and%20Tradeoffs%20of%20Deploying%20MySQL%20Clusters%20in%20the%20Cloud%20Presentation
Has anyone tried this method and either had success or have scripts available to do this? I try not to remake the wheel when I dont have to. Thanks in advance.

Monday
Jun012009

HotPads on AWS

HotPads abandoned our managed hosting in December and took the leap over to EC2 and its siblings. The presentation has a lot of detail on costs and other things to watch out for, so if you're currently planning your "cloud" architecture, you'll find some of this really helpful.

Click to read more ...

Friday
Apr242009

Heroku - Simultaneously Develop and Deploy Automatically Scalable Rails Applications in the Cloud

Update 4: Heroku versus GAE & GAE/J

Update 3: Heroku has gone live!. Congratulations to the team. It's difficult right now to get a feeling for the relative cost and reliability of Heroku, but it's an impressive accomplishment and a viable option for people looking for a delivery platform.

Update 2: Heroku Architecture. A great interactive presentation of the Heroku stack. Requests flow into Nginx used as a HTTP Reverse Proxy. Nginx routes requests into a Varnish based HTTP cache. Then requests are injected into an Erlang based routing mesh that balances requests across a grid of dynos. Dynos are your application "VMs" that implement application specific behaviors. Dynos themselves are a stack of: POSIX, Ruby VM, App Server, Rack, Middleware, Framework, Your App. Applications can access PostgreSQL. Memcached is used as an application caching layer.

Update: Aaron Worsham Interview with James Lindenbaum, CEO of Heroku. Aaron nicely sums up their goal: Heroku is looking to eliminate all the reasons companies have for not doing software projects.


Adam Wiggins of Heroku presented at the lollapalooza that was the Cloud Computing Demo Night. The idea behind Heroku is that you upload a Rails application into Heroku and it automatically deploys into EC2 and it automatically scales using behind the scenes magic. They call this "liquid scaling." You just dump your code and go. You don't have to think about SVN, databases, mongrels, load balancing, or hosting. You just concentrate on building your application. Heroku's unique feature is their web based development environment that lets you develop applications completely from their control panel. Or you can stick with your own development environment and use their API and Git to move code in and out of their system.

For website developers this is as high up the stack as it gets. With Heroku we lose that "build your first lightsaber" moment marking the transition out of apprenticeship and into mastery. Upload your code and go isn't exactly a heroes journey, but it is damn effective...

I must confess to having an inherent love of Heroku's idea because I had a similar notion many moons ago, but the trendy language of the time was Perl instead of Rails. At the time though it just didn't make sense. The economics of creating your own "cloud" for such a different model wasn't there. It's amazing the niches utility computing will seed, fertilize, and help grow. Even today when using Eclipse I really wish it was hosted in the cloud and I didn't have to deal with all its deployment headaches. Firefox based interfaces are pretty impressive these days. Why not?

Adam views their stack as:
1. Developer Tools
2. Application Management
3. Cluster Management
4. Elastic Compute Cloud

At the top level developers see a control panel that lets them edit code, deploy code, interact with the database, see logs, and so on. Your website is live from the first moment you start writing code. It's a powerful feeling to write normal code, see it run immediately, and know it will scale without further effort on your part. Now, will you be able toss your Facebook app into the Heroku engine and immediately handle a deluge of 500 million hits a month? It will be interesting to see how far a generic scaling model can go without special tweaking by a certified scaling professional. Elastra has the same sort of issue.

Underneath Heroku makes sure all the software components work together in Lennon-McCartney style harmony. They take care (or will take care of) starting and stopping VMs, deploying to those VMs, billing, load balancing, scaling, storage, upgrades, failover, etc. The dynamic nature of Ruby and the development and deployment infrastructure of Rails is what makes this type of hosting possible. You don't have to worry about builds. There's a great infrastructure for installing packages and plugins. And the big hard one of database upgrades is tackled with the new migrations feature.

A major issue in the Rails world is versioning. Given the precambrian explosion of Rails tools, how does Heroku make sure all the various versions of everything work together? Heroku sees this as their big value add. They are in charge of making sure everything works together. We see a lot companies on the web taking on the role of curator ([1], [2], [3]). A curator is a guardian or an overseer. Of curators Steve Rubel says: They acquire pieces that fit within the tone, direction and - above all - the purpose of the institution. They travel the corners of the world looking for "finds." Then, once located, clean them up and make sure they are presentable and offer the patron a high quality experience. That's the role Heroku will play for their deployable Rails environment.

With great automated power comes great restrictions. And great opportunity. Curating has a cost for developers: flexibility. The database they support is Postgres. Out of luck if you wan't MySQL. Want a different Ruby version or Rails version? Not if they don't support it. Want memcache? You just can't add it yourself. One forum poster wanted, for example, to use the command line version of ImageMagick but was told it wasn't installed and use RMagick instead. Not the end of the world. And this sort of curating has to be done to keep a happy and healthy environment running, but it is something to be aware of.

The upside of curation is stuff will work. And we all know how hard it can be to get stuff to work. When I see an EC2 AMI that already has most of what I need my heart goes pitter patter over the headaches I'll save because someone already did the heavy curation for me. A lot of the value in services like rPath offers, for example, is in curation. rPath helps you build images that work, that can be deployed automatically, and can be easily upgraded. It can take a big load off your shoulders.

There's a lot of competition for Heroku. Mosso has a hosting system that can do much of what Heroku wants to do. It can automatically scale up at the webserver, data, and storage tiers. It supports a variery of frameworks, including Rails. And Mosso also says all you have to do is load and go.

3Tera is another competitor. As one user said: It lets you visually (through a web ui) create "applications" based on "appliances". There is a standard portfolio of prebuilt applications (SugarCRM, etc.) and templates for LAMP, etc. So, we build our application by taking a firewall appliance, a CentOS appliance, a gateway, a MySql appliance, glue them together, customize them, and then create our own template. You can specify down to the appliance level, the amount of cpu, memory, disk, and bandwidth each are assigned which let's you scale up your capacity simply by tweaking values through the UI. We can now deploy our Rails/Java hosted offering for new customers in about 20 minutes on our grid. AppLogic has automatic failover so that if anything goes wrong, it reploys your application to a new node in your grid and restarts it. It's not as cheap as EC2, but much more powerful. True, 3Tera won't help with your application directly, but most of the hard bits are handled.

RightScale is another company that combines curation along with load balancing, scaling, failover, and system management.

What differentiates Heroku is their web based IDE that allows you to focus solely on the application and ignore the details. Though now that they have a command line based interface as well, it's not as clear how they will differentiate themselves from other offerings.

The hosting model has a possible downside if you want to do something other than straight web hosting. Let's say you want your system to insert commercials into podcasts. That sort of large scale batch logic doesn't cleanly fit into the hosting model. A separate service accessed via something like a REST interface needs to be created. Possibly double the work. Mosso suffers from this same concern. But maybe leaving the web front end to Heroku is exactly what you want to do. That would leave you to concentrate on the back end service without worrying about the web tier. That's a good approach too.

Heroku is just getting started so everything isn't in place yet. They've been working on how to scale their own infrastructure. Next is working on scaling user applications beyond starting and stopping mongrels based on load. They aren't doing any vertical scaling of the database yet. They plan on memcaching reads, implementing read-only slaves via Slony, and using the automatic partitioning features built into Postgres 8.3. The idea is to start a little smaller with them now and grow as they grow. By the time you need to scale bigger they should have the infrastructure in place.

One concern is that pricing isn't nailed down yet, but my gut says it will be fair. It's not clear how you will transfer an existing database over, especially from a non-Postgres database. And if you use the web IDE I wonder how you will normal project stuff like continuous integration, upgrades, branching, release tracking, and bug tracking? Certainly a lot of work to do and a lot of details to work out, but I am sure it's nothing they can't handle.

Related Articles

  • Heroku Rails Podcast
  • Heroku Open Source Plugins etc
  • Tuesday
    Apr072009

    Six Lessons Learned Deploying a Large-scale Infrastructure in Amazon EC2 

    Lessons learned from OpenX's large-scale deployment to Amazon EC2:

  • Expect failures; what's more, embrace them
  • Fully automate your infrastructure deployments
  • Design your infrastructure so that it scales horizontally
  • Establish clear measurable goals
  • Be prepared to quickly identify and eliminate bottlenecks
  • Play wack-a-mole for a while, until things get stable

    Click to read more ...

  • Thursday
    Mar052009

    Strategy: In Cloud Computing Systematically Drive Load to the CPU

    Update 2: Linear Bloom Filters by Edward Kmett. A Bloom filter is a novel data structure for approximating membership in a set. A Bloom join conserves network bandwith by exchanging cheaper, more plentiful local CPU utilization and disk IO. Update: What are Amazon EC2 Compute Units?. Cloud providers charge for CPU time in voodoo units like "compute units" and "core hours." Geva Perry takes on the quest of figuring out what these mean in real life. I attended Sebastian Stadil's AWS Training Camp Saturday and during the class Sebastian brought up a wonderfully counter-intuitive idea: CPU (EC2) costs a lot less than storage (S3, SDB) so you should systematically move as much work as you can to the CPU. This is said to be the Client-Cloud Paradigm. It leverages the well pummeled trend that CPU power follows Moore's Law while storage follows The Great Plains' Law (flat). And what sane computing professional would do battle with Sir Moore and his trusty battle sword of a law? Embedded systems often make similar environmental optimizations. CPU rich and memory poor means operate on compressed serialized data structures. Deserialized data structures use a lot of memory, so why use them? It's easy enough to create an object wrapper around a buffer. Programmers shouldn't care how their objects are represented anyway. Yet we waste ginormous amounts of time and memory uselessly transforming XML in and out of different representations. Just transport compressed binary objects around and use them in place. Serialization and deserialization happen only on access (Pimpl Idiom). It never occurred to me that in the land of AWS plenty similar "tricks" would make sense. But EC2 is a loss leader in AWS. CPU is plentiful and cheap. It's IO and storage that costs you... The implication is that in your system design you should try and use EC2 as much as possible:

  • Compress data. Saves on bandwidth and storage (the expensive bits) and uses cheaper CPU to compress/decompress.
  • Slurp data. Latency cost is higher than performing operations locally. SDB can take up to 400 msecs between data centers and 200 msecs inside the same data center. This is very slow. It's usually faster, but it can take that long. Following the more traditional serial processing path of "get a record do a record" will take forever and cost more. Slurp up all your records from SDB and farm them out to your CPU nodes to be worked on in parallel.
  • Think parallel. Do multiple operations at once on your cheap CPUs rather than serially performing high latency operations on expensive storage. With enough nodes, total execution time approaches max latency.
  • Client side joins. Pull all data from the relatively expensive SDB and perform client side joins on relatively cheap EC2 nodes.
  • Leverage SQS. It's a relatively cheap part of the ecosystem. Keeping a work queue in SDB would be far more expensive. When all the implications are fully explored it's a little different take on designing a system. I found some interesting numbers in a Slashdot thread comparing values: No persistent storage; not great value: And it's still not a great value. It seems cheap. $72/mo for a 1.7GB RAM server. Well, look at Slicehost and you can get a 2GB RAM Xen instance (same virtualization software as EC2) for $140 WITH persistent storage and 800GB of bandwidth. That doesn't sound like a great deal UNTIL you calculate what EC2 bandwidth costs. 800GB would cost you $144 at $0.18 per GB bringing the total cost to $216 ($76 more than Slicehost). That 18 cents doesn't sound like much, but it adds up. The same situation happens with Joyent. For $250 you get a 2GB RAM server from them (running under Solaris' Zones) with 10TB of bandwidth. That would cost you $1,872 with EC2. Even if you assume that you'll only use 10% of what Joyent is giving you, EC2 still comes in at a cost of $252 - and without persistent storage!

    Click to read more ...

  • Wednesday
    Mar042009

    Its time for auto scaling – avoid peak load provisioning for web applications

    Many web applications, including eBanking, Trading, eCommerce and Online Gaming, face large, fluctuating loads. In this post will describe how to achieve Right Sizing using virtualization and cloud computing. Will use a standard JEE web application to demonstrate how auto-scaling works on AWS Cloud without changing your application code.

    Click to read more ...

    Saturday
    Feb212009

    Google AppEngine - A Second Look

    Update 6:: Back to the Future for Data Storage. We are in the middle of a renaissance in data storage with the application of many new ideas and techniques; there's huge potential for breaking out of thinking about data storage in just one way.
    Update 5: Building Scalable Web Applications with Google App Engine by Brett Slatkin.

    Update 4: Why Google App Engine is broken and what Google must do to fix it by Aral Balkan. We don't care that it can scale. We care that it does scale. And that it scales when you need it the most. Issues: 1MB limit on data structures; 1MB limit on data structures; the short-term high CPU quota; quotas in general; Admin? What's that?
    Update 3: BigTable Blues. Catherine Devlin couldn't port an application to GAE because it can't do basic filtering and can't search 5,000 records without timing out: "Querying from 5000 records - too much for the mighty BigTable, apparently." Followup: not the future database. "90% of the work of this project has been trying to figure out workarounds and kludges for its bizzare limitations."
    Update 2: Having doubts about AppEngine. Excellent and surprisingly civil debate on if GAE is a viable delivery platform for real applications. Concerns swirl over poor performance, lack of a roadmap, perpetual beta status, poor support, and a quota system as torture chamber model of scalability. GAE is obviously part of Google's grand plan (browser, gears, android, etc) to emasculate Microsoft, so the future looks bright, but is GAE a good choice now?
    Update: Here are a few experience reports of developers using GAE. Diwaker Gupta likes how easy it is to get started on the good documentation. Doesn't like all the limits and poor performance. James here and here also likes the ease of use but finds the data model takes some getting used to and is concerned the API limits won't scale for a real site. He doesn't like how external connections are handled and wants a database where the schema is easier to manage. These posts mirror some of my own concerns. GAE is scalable for Google, but it may not be scalable for my application.

    It's been a few days now since GAE (Google App Engine) was released and we had our First Look. It's high time for a retrospective. Too soon? Hey, this is Internet time baby. So how is GAE doing? I did get an invite so hopefully I'll have a more experience grounded take a little later. I don't know Python and being the more methodical type it may take me a while. To perform our retrospective we'll take a look at the three sources of information available to us: actual applications in the AppGallery, blogspew, and developer issues in the forum.

    The result: a cautious thumbs up. The biggest issue so far seems to be the change in mindset needed by developers to use GAE. BigTable is not MySQL. The runtime environment is not a VM. A service based approach is not the same as using libraries. A scalable architecture is not the same as one based on optimizing speed. A different approach is needed, but as of yet Google doesn't give you all the tools you need to fully embrace the red pill vision.

    I think this quote by Brandon Smith in a thread on how to best implement sessions in GAE nicely sums up the new perspective:

    Consider the lack of your daddy's sessions a feature. It's what will make your app scale on Google's infrastructure.

    In other words: when in Rome. But how do we know what the Romans do when the Romans do what they do?

    Brett Morgan expands our cultural education in a thread on slow GAE databases performance when he talks about why MySQL thinking won't work on BigTable:

    It might look almost look like a sql db when you squint, but it's
    optimized for a totally different goal. If you think that each
    different entity you retrieve could be retrieving a different disk
    block from a different machine in the cluster, then suddenly things
    start to make sense. avg() over a column in a sql server makes sense,
    because the disk accesses are pulling blocks in a row from the same
    disk (hopefully), or even better, all from the same ram on the one
    computer. With DataStore, which is built on top of BigTable, which is
    built on top of GFS, there ain't no such promise. Each entity in
    DataStore is quite possibly a different file in gfs.

    So if you build things such that web requests are only ever pulling a
    single entity from DataStore - by always precomputing everything -
    then your app will fly on all the read requests. In fact, if that
    single entity gets hot - is highly utilized across the cluster - then
    it will be replicated across the cluster.

    Yes, this means that everything that we think we know about building
    web applications is suddenly wrong
    . But this is actually a good thing.
    Having been on the wrong side of trying to scale up web app code, I
    can honestly say it is better to push the requirements of scaling into
    the face of us developers so that we do the right thing from the
    beginning. It's easier to solve the issues at the start, than try and
    retrofit hacks at the end of the development cycle.


    A truly excellent explanation of the differences between MySQL thinking and GAE thinking.

    Now, if you can't use MySQL's avg feature, how can an average be calculated using BigTable? Brett advises:

    Instead of calculating the results at query time, calculate them when
    you are adding the records. This means that displaying the results is
    just a lookup, and that the calculation costs are amortized over each
    record addition.

    Clearly this is more work for the programmer and at first blush doesn't seem worth the effort, especially when you are used to the convenience of MySQL. That's why in the same thread Barry Hunter insightfully comments that GAE may not be for everyone:

    This might be a very naive observation, but I perhaps wonder then if
    GAE is the tool for you.

    As I see it the App Engine is for applications that are meant to
    scale, scale and really scale. Sounds like an application with a few
    hundred hits daily could easily run on traditional hosting platforms.

    It's a completely different mindset.
    ...
    Again maybe I am missing something, but the DataStore isn't designed to
    be super fast at the small scale, but rather handle large amounts of
    data, and be distributed
    (and because its distributed it can appear
    very fast at large scale).

    So you break down your database access into very simple processes.
    Assume your database access is VERY slow, and rethink how you do
    things. (Of course the piece in the puzzle 'we' are missing is
    MapReduce! - the 'processing' part of the BigTable mindset)


    Before developers can take full advantage of GAE these types of lessons need to be extracted and popularized with the same ferocity the multi-tier RDBMS framework has been marketed. It will be a long difficult transition.

    Interestingly, many lessons from AWS are not transferable to GAE. AWS has a VM model whereas GAE has an application centric model. They are inverses of each other.

    In AWS you have a bag of lowish level components out of which you architect your application. You can write all the fine low level implementations bits you desire. A service layer is then put in front of everything to hide the components. In GAE you have a high level application component and you build out your application using services. You can't build any low level components in GAE. In AWS the goal is to drive load to the CPU because CPU and bandwidth are plentiful. In GAE you get very limitted CPU, certainly none to burn on useless activities like summing up an average over a whole slice of data returned from SimpleDB. And in GAE the amount of data returnable from the database is small so your architecture needs to be very smart about how data is stored and accessed.

    Very different approaches that lead to very different applications.

    Applications

    The number of applications has exploded. I am always amazed at how enthusiastic and productive people can be when they are actually interested in what they are doing. It happens so rarely. True, most applications aren't even up to Facebook standards yet, but it's early days. What's impressive is how fast they were created and deployed. That speaks volumes about the efficacy of the application centric development model.Will it be as effective delivering "real" apps? That's a question I'm not sure about.

    So far application performance is acceptable. Certainly nothing spectacular. What can you do about it? Nada.

    I like the sketch application because people immediately and quite predictably drew lewd depictions of various body parts. I also like this early incarnation of a forum app. A forum is one of the ideas I thought might work well on AppEngine because the scalable storage problem is solved. I do wonder how the performance will be with a fine tuned caching layer? Vorby is a movie quote site showing a more realistic level of complexity. It has tabs, long lists of text, some graphical elements, some more complex screens, and ratings. It shows you can make applications you wouldn't mind people using.

    An option I'd like to see in the App Gallery is a view source link. Developers could indicate when adding an application if others can view their application source. Then when browsing the gallery we could all learn by looking at real working code. This is how html spread so quickly. Anyone could view the source for any page, copy paste, and you're on your way! With an application centric model the view source viral spread approach would also work.

    Blogspew

    As expected there's lots of blog activity on GAE:

  • As to all those people complaining their favorite language isn't available, take a chill pill, Urubatan asks us When will programmers learn that a language is just a tool?. I mostly agree with this take, but I also agree with a commenter who observed that it's a lot harder for a team of developers to turn on a dime and adopt a whole new everything.
  • Garrick Van Buren says Free & Open Is Its Own Lock-in. The idea being it's worthing paying something you know works, allows you to experiment, and you are aligned with their zeitgeist. Leaving that for "free" isn't a good deal.
  • evan_tech: google app engine limitations. Don't focus on minor problems. The big problems are: all code runs only in response to HTTP fetches, No long connections means no "comet" (server-push messaging), playing around with your data is hard as there's no way to perform operations on your data except by uploading code to the server, Table scans are slow and you can't cache because it's so slow you hit your CPU limit, bulk operations are hard, and no arbitrary queries.
  • RedMonk Clouds Rolling In: The Google App Engine Q&A gives covers a lot of GAE territory. List some of the cons: Python only, not database export, lock-in, and no cron. "...all of the current offerings have limitations that throttle their usage. Many of which are related to the lack of open standards. Apart from the mostly standard Python implementation, App Engine is decidedly non-standard."
  • Alex Bosworth pits AWS vs Google App Engine in a death match. Alex thinks: To be succinct, based on where the Google App Engine is today, I would say AWS still has a strong lead in application hosting, and I would not currently consider writing an application for Google’s current platform. Cons: Lockin, The page-view limitation is quite low, no memcache, No long running pages, or cron jobs, Storage size limitation, One language, No requests unless they are through Google’s API. Pros: it’s free, looks pretty rocking, integrates with Google accounts.
  • Joyent is countering by offering free infrastructure for high volume python applications. Joyent only asks "that you provide Joyent unlimited access to your customer information and clickstream data." Your data has a lot of value. Google is also very aware of that. More in my Why Does Google Do What Google Does? post. Though the Joyent's building blocks approach is very different than Google's application centric approach. We'll see which matters more: the model or facilities?
  • Niall Kennedy in Google App Engine for developers does a great job contrasting the complexity of your normal website setup with an application approach. Normally you: purchase dedicated servers or virtualized slices, capacity plan, configure web server, install Python, Apache, setup MySQL in scalable fault tolerant configuration, insert caching layer, add monitoring layer, add static file serving and bulk file serving, make it all work together, spend your life keeping it working and responding to failures. Nicely drawn contrast to upload and go.
  • TechCrunch's AppEngine test application couldn't handle a TechCrunch level of load, which is a little concerning. This means usage limits are set a bit low and with no pricing model to work from it's reasonable to be concerned about the cost. Nobody wants a cell phone overage nightmare for their website costs.
  • Groovy: Google Datastore and the shift from a RDBMS. An excellent comparison of how BigTable differs from a RDBMS. The conclusion: The end result of this, is that the standard way a developer writes out the table schema for a RDBMS should be dumped almost entirely when considering an app using Google Datastore.
  • Service Level Automation in the Datacenter: What Google App Engine is NOT. It's a web play only, it's not a cloud in the sense of datacenter infrastructure IT can move to. You can't implement: Portal Services, SOA architectures, Business Process Automation, Enterprise integration, HPC, and Server and desktop virtualization.

    A lot has been made of the risk of lock-in. I don't really agree with this as everything is based around services, which you can port to another infrastructure. What's more the problem is developers will be acquiring a sort of learned helplessness. It's not that developers can't port to another environment, they simply won't know how to anymore because they will have never had to do it themselves. Their system design and infrastructure muscles will have atrophied so much from disuse that they'll no longer be able to walk without the aide of their Google crutches. More in another post.

    Developer Forum

    The best way to figure out how a system is doing is to read the developer support forum. What problems and successes are real developers experiencing trying to get real work done? The forum is a hoppin'. As of this writing over 1300 developers have registered and nearly 400 topics are active. What are developers talking about?

  • Please support my favorite language: PHP, Ruby, etc. Hey, they had to start somewhere and Python is as good as anything else. A language is just a tool you know :-)
  • The usual this doesn't work in my environment type of questions. Far fewer than I would expect though.
  • The switch away from RDBMS thinking isn't coming naturally. A lot of questions wondering how to access BigTable like MySQL and that won't work. There are no joins in GQL, so how do you do normal things like get all the comments for a blog post?
  • Lots of how do use this or that API questions. Lack of certain commonly used APIs, like XML parsers is being being encountered.
  • Concern there's no database export. You can bulk upload data, but you have to write your own program to get it out again.
  • People are hitting limits like the 1MB upload limit on all requests. The 1000 database return limit is mentioned a lot. This is very different than the AWS model which advocates moving work to your CPU so it makes sense to return large sets of data. Google limits your CPU usage and the amount of data you can return so you have to be smart how you store and query data.
  • The pure service model has profound limitations for certain application types. An issue of how to do image processing came up. Usually a compiled class is used because using pure Python is slow. But you can't load these types of classes in AppEngine. And you can't parallelize the work by farming it out to other CPUs. You are stuck. Here's were a .Net type managed object model might help.
  • Surprisingly, fulltext search is not supported.
  • Sessions are another how do I it on GAE question. People are used to frameworks handling session storage.
  • One user was surprised at how slow database access was with BigTable. It takes GAE almost 3 seconds to save 50 of dummy records (consisting of just 2 text fields). A nice thread about how best to use BigTable developed. BigTable is meant to scale and you have to do things differently than you do in a MySQL world.

    Many "how do I" questions come up because of the requirement for service level interfaces. For example, something as simple as a hostname to IP mapping can't be done because you don't have socket level access. Someone, somewhere must make a service out of it. Make an external service is a common response to problems. You must make a service external to the GAE environment to get things to work which means you have to develop in multiple environments. This sort of sucks. To get cron functionality do I really need to create an external service outside of GAE?

    The outcome of all this is probably an accelerated servicifaction of everything. What were once simple library calls must now be exposed with service level interfaces. It's not that I think HTTP is too heavy, but as development model it is extremely painful. You are constantly hitting road blocks instead of getting stuff done.
  • Tuesday
    Jan202009

    Product: Amazon's SimpleDB

    Update 35: How and Why Glue is Using Amazon SimpleDB instead of a Relational Database. Discusses a key design decision that required duplicating data in order to mimic RDBMS joins: Given the trade off between potential inconsistencies and scalability, social services have to choose the latter. Update 34: Apparently Amazon pulled this article. I'm not sure what that means. Maybe time went backwards or something? Amazon dramatically drops SimpleDB pricing to $0.25 per GB per month from $1.50 per GB. This puts SimpleDB on par with Google App Engine. They also announced a few new features: a SQL-like SELECT API as well as a Batch Put operation to streamline uploading of multiple items or attributes. One of the complaints against SimpleDB is that programmers end up writing too much code to do simple things. These features and a much cheaper price should help considerably. And you can store lots of data now. GAE is still capped. Update 33: Amazon announces Elastic Block Store (EBS), which provides lots of normal looking disk along with value added features like snapshots and snapshot copying. But database's may find EBS too slow. RightScale tells us Why Amazon’s Elastic Block Store Matters. Update 32: You can now get all attributes for a property when querying. Previously only the ID was returned and the attributes had to be returned in separate calls. This makes the programmer's job a lot simpler. Artificial levels of parallelization code can now be dumped. Update 31: Amazon fixes a major hole in SimpleDB by adding the ability to sort query results. Previously developers had to sort results by hand which was a non-starter for many. Now you can do basic top 10 type queries with ease. Update 30: Amazon SimpleDB - A distributed, highly-scalable, light-weight, query-able, attribute store by Sebastian Stadil. It introduces the CAP theorem and the basics of SimpleDB. Sebastian does a lot of great work in the AWS world and in what must be his limited free time, runs the AWS Meetup group. Update 29: A stroll down the history of a previous RDBMS killer, object databases. Lots of fond memories of the new kid on the block showing us how objects and code were one, the endless OO vs. relational wars, writing a OODBMS training course, dealing with object migration and querying etc, and the slow decline followed by groveling in front of the old master. It would be a terrible irony if a hash table succeeded where OODBMSs failed. Update 28: I didn't make the beta program :-( Update 27: IBM has hired CouchDB creator Damien Katz as their player in the game. Teams Microsoft, IBM, and Amazon have all entered the race. Amazon is 10 furlongs ahead, but watch for team Google, a fast finisher on the outside. Update 26: Red Monk says Microsoft's Astoria project is SDBish, but developers are afraid of lock-in. Update 25: Nati Shalom thinks SDB isn't even a database. Update 24: Igvita asks why do you need SDB when Thrudb is faster and cheaper? It provides a memcached layer in front of a database storing data in S3. And even better, all its service names start with "thru" instead of "S". Update 23: For all you Perl haters, the Perl interface to SDB is clean and beautiful. Update 22: On an Erlang email list Jim Larson says the proper model is to store bulk data in S3 and indexable metadata in SimpleDB. The cost of SimpleDB is 10x for storing data versus S3. We are supposed to build our own inverted index for text searching, which is one of those decisions that sounds good in the meeting room (yay, we don't have to do all that work), but is not a good decision in the real world. Update 21: Sensepost is already creating attack models to drain your bank account through repeated queries. Update 20: Grow some stones, smoothspan says Eventual Consistency Is Not That Scary. Update 19: Jacob Harris in A First Look at Amazon SimpleDB offers up some beta Ruby libraries for accessing SDB. Update 18: Erlang folks hope to get some run, but Erlang the language is too different to go mainstream, though Erlang's concurrency model rocks. A while back I talked about how The Solution to C++ Threading is Erlang and how Java's concurrency approach is fundamentally broken. Update 17: Subbu tirelessly provides a A RESTful version of Amazon's SimpleDB. Update 16: Snarfed sees it as a sort of tuplespace implementation. Compare it to Facebook's API. Ning also has a data API. Update 15: Uncom thinks Winer & Scoble Fail In Tandem. SDB's XML response has 1,755% transmission overhead, which is genius for a per byte pricing model. And I love this one: if you are starting a business whose success hinges on scalability of a data store, you had best figure out how to shard across N machines before you launch. Using a single instance of MySQL for the whole thing is a strong indicator that you have failed at life. Update 14: Styled Bits sees SDB as more of a way to add metadata to S3 objects. Update 13: Bex Huff makes the point you'll still need a caching layer in front of SDB. Update 12: Shahzad Bhatti has been coding for SimpleDB for a few months and gives us a cool Java and Erlang API for basic CRUD operations. Update 11: DBA4Life says Amazon has just flux capacited us back to 1980s style database management. Update 10: Bob Warfield of SmoothSpan explains Why the Amazon SimpleDB is a Huge Next Step. It helps achieve the necessary "16:1 operations cost advantages over conventional software." Update 9: SimpleDB is berkleyDB and 90% of all computing will live in cloud city. Will the Troglyte's revolt? Update 8: Dave Winer says Amazon removes the database scaling wall by adding a storage ramp that scales up when needed and scales down when unneeded. You no longer need to buy expensive VC funded database talent to take your product to the next level. Update 7: Kevin Burton in Google vs Amazon in Open Infrastructure has doubts about the entire hosted model. Bandwidth costs too much, it might hurt your acquisition chances, and you can't trust 'em. He just wants to lease managed raw machine power. Update 6: Amazon SimpleDB and CouchDB compared. Some key differences: SimpleDB is hosted. CouchDB is REST/JSON and SimpleDB is REST/SOAP/XML. In SimpleDB attribute updates are atomic in CouchDB record updates are atomic. CouchDB supports JSON data types and SimpleDB thinks everything is a string. CouchDB has much more flexible indexing and queries. Update 5: Sriram Krishnan gives a more technical overview of SimpleDB. He likes the big hash table approach and brings up how the query language allows for parallelization. Update 4: Mark from areyouwatchingthis.com makes a really insightful point: I run a startup that gets 75% of our traffic from our API. The ability to move that processing and storage into a cloud _might_ save me a lot on hosting. Update 3: Marcelo Calbucci thinks SimpleDB is more of a directory service than a database because records can contain different attributes (no schema) and attributes can have multiple values. Update 2: Smug Mugs' Don MacAskill likes the service, but is concerned that field sizes are limited to 1024 characters and latency from far away datacenters. He thinks most queries will be easy to convert as they are predominantly hash like lookups anyway. Update: Scoble asks if SimpleDB kills MySQL, Oracle, et al. The answer is no. Google has a similar service internally and they are still major users of and contributors to MySQL. Sometimes you just need structured data. So RDBMSs aren't dead. They just may not be the starting point as the barrier to entry for doing the simplest thing to start a website has plummeted. No more setup or admin. Just code and go. The cherry missing from Amazon's AWS hot fudge sundae was a database service. They had a CPU scoop with EC2, they had storage scoop with S3, they had a work distribution scoop with their queue, but the database cherry was missing. Now they've added it and it's dessert time. News of SimpleDB is everywhere. Apparently it's been in development for a while. You can read about it inside looking out, GIGAOM, Innowave, SimpleDB Developer's Guide, and the SimpleDB Home Page. It seems to be a simple properties like store implemented on Erlang (as is CouchDB). It has simple query capabilities on attributes. It's fast and scalable. And At $0.14 per hour it's quite competitive with other options. What it doesn't have is a text search or complex RDBMS style queries for structured data. It's not clear if the data are geographically distributed, in case you are interested in fast response times from different parts of the world. I would be very curious on the relationship between SimpleDB and Dynamo. Even with these limitations it's a disruptive service. Most high speed websites use a property store for unstructured data and that's been hard for smaller groups to implement at scale. But if you're losing your mind trying to figure out how to store your data at scale, maybe you can now turn your attention to more productive problems.

    Click to read more ...