Entries in Grid (19)

Thursday
Sep252008

Is your cloud as scalable as you think it is?

An unstated assumption is that clouds are scalable. But are they? Stick thousands upon thousands of machines together and there are a lot of potential bottlenecks just waiting to choke off your scalability supply. And if the cloud is scalable what are the chances that your application is really linearly scalable? At 10 machines all may be well. Even at 50 machines the seas look calm. But at 100, 200, or 500 machines all hell might break loose. How do you know? You know through real life testing. These kinds of tests are brutally hard and complicated. who wants to do all the incredibly precise and difficult work of producing cloud scalability tests? GridDynamics has stepped up to the challenge and has just released their Cloud Performance Reports. The report is quite detailed so I'll just cover what I found most interesting. GridDynamics in this report test three configurations:

  • GridGain running a Monte-Carlo simulation on EC2. This test is a CPU only test, a data grid is not accessed. This scenario tests the scalability of EC2 and GridGain. * GridGain provided near linear scalability end-to-end on a 512 node EC2 hosted grid. * EC2 is ready for production usage on large-scale stateless computations exhibiting good price for performance and a strong linear scaling curve.
  • GigaSpaces running a risk management simulation on EC2. This is a data-driven test. GigaSpaces is used in a configuration where the compute grid and the data grid are separated, even though GigaSpaces supports an in-memory data grid. * GigaSpaces provided near linear scalability from 16 to 256 nodes. There was a 28% degradation from 256 to 512 nodes because only four data grid servers were used. More were needed. The compute grid and data grid must each be sized to independently to scale properly. * EC2 is ready for production usage for classes of large-scale data-driven applications.
  • Windows HPC Server and Velocity running an analytics application in Microsoft's grid testbed. This test was more complicated than the others. It tested the performance implications of data "in the cloud" vs. "outside the cloud" for data-intensive analytics applications. * Keeping data close to the business logic matters. Performance improved up to 31 times over "outside the cloud." * Velocity failed on 50 node clusters with 200 concurrent clients. * Local caches provided significant performance gains over distributed caches. The local cache took load off the distributed cache. They are currently running more tests with different configurations. Hopefully we'll see those results later. All-in-all a generally optimistic report. EC2 scales. Mot of the tested grid frameworks scaled. What's also clear is it may take a while before deploying cloud based grids is an easy process. It still takes a lot of work to install, configure, start, stop, monitor, and debug bottlenecks in cloud based grids. Thanks to GridDynamics for putting in all this work and I look forward to their next set of reports.

    Click to read more ...

  • Thursday
    Sep252008

    GridGain: One Compute Grid, Many Data Grids

    GridGain was kind enough to present at the September 17th instance of the Silicon Valley Cloud Computing Group. I've been curious about GridGain so I was glad to see them there. In short GridGain is: an open source computational grid framework that enables Java developers to improve general performance of processing intensive applications by splitting and parallelizing the workload. GridGain can also be thought of as a set of middleware primitives for building applications. GridGain's peer group of competitors includes GigaSpaces, Terracotta, Coherence, and Hadoop. The speaker for GridGain was the President and Founder, Nikita Ivanov. He has a very pleasant down-to-earth way about him that contrasts nicely with a field given to religious discussions of complex taxomic definitions. Nikita first talked about cloud computing in general. He feels Java is the perfect gateway for cloud computing. Which is good because GridGain only works with Java. The Java centricity of GridGain may be an immediate deal killer or a non-issue for a Java shop. Being so close to the language does offer a lot of power, but it sure sucks in a multi-language environment. Nikita gave a few definitions which are key to understanding where GridGain stands in the grid matrix:

  • Compute Grids: parallel execution.
  • Data Grids: parallel data storage.
  • Grid Computing: Compute Grids + Data Grids
  • Cloud Computing: datacenter + API. The key is automation via programmability as a way to deploy applications. The advantage is a unified programming model. Build an application on one node and you can run on many nodes without code change. Moving peak loads to the cloud can give you a 10x-100x cost reduction. Cloud computing poses a number of challenges: deployment, data sharing, load balancing, failover, discovery (nodes, availability), provisioning (add, remove), management, monitoring, development process, debugging, inter and external clouds (syncing data, syncing code, failover jobs). Nakita talked some about these issues, but he didn't go in-depth. But he showed a good understanding of the issues involved so I would be inclined to think GridGain handles them well. The cloud computing section is new to the standard GridGain presentation. GridGain is moving their grid into the cloud with new features like a cloud management layer available in Q1 2009. This move competes with GigaSpaces early move to the cloud with their RightScale partnership. It's a good move. Like peanut butter and chocolate, grids and clouds go better together. Grids have been under utilized largely because of infrastructure issues. A cloud platform makes it is to affordably grow and manage grids, so we might see an uptick in grid adoption as clouds and grids hookup. GridGain positions themselves as a developer centric framework according to their analysis of cloud computing in Java:
  • Heavy UI oriented. These types of applications or framework usually provide UI-based consoles, management applications, plugins, etc that provide the only way to manage resources on the cloud such as starting and stopping the image, etc. The key characteristic of this approach is that it requires a substantial user input and human interaction and thus they tend to be less dynamic and less on-demand. Good examples would be RightScale, GigaSpaces, ElasticGrid.
  • Heavy framework oriented. This approach strongly emphasizes dynamism of resource management on the cloud. The key characteristic of this approach is that it requires no human interaction and all resource management can be done programmatically by the grid/cloud middleware - and thus it is more dynamic, automated and true on-demand. Google App Engine (for Python), GridGain would be good examples. I think there's a misunderstanding of RightScale here. The UI is to configure the automated system, not manage the system. The automated system monitors and responds to events without human interaction. Won't their automated cloud layer have to do something similar? To bootstrap any complex system out of the mud of complexity a helpful UI is needed. The framework approach of GridGain's infrastructure is developer friendly, but that won't fly for external management within the cloud.

    GridGain's True Nature: One Compute Grid, Many Data Grids

    With these definitions in place we can now learn the secret of Grid Gain: One Compute Grid, Many Data Grids. Ding! Ding! Ding! Once I understood this I understood Grid Gain's niche. GridGain has focussed on making it dead simple to distribute work across a compute grid. It's a job management mechanism. GridGain doesn't include a data grid. It will work against any data grid. For some reason this fact was something I'd never pulled out of the noise before. And when I would read Nakita's blog with all the nifty little code samples I never really appreciated what was happening. Yes, I'm just that dumb, but I also think Grid Gain should expose the magic of what's going on behind the scenes more rather than push the simple 30-second-lets-write-code-live style demo. Seeing the mechanics would make it easier to build a mental model of the value being added by GridGain.

    Transparent and Low Configuration Implementation of Key Features

    A compute grid is just a bunch of CPUs calculations/jobs/work can be run on. As a developer problem are broken up into smaller tasks and spread across all your nodes so the result is calculated faster because it is happening in parallel. GridGain enthusiastically supports the MapReduce model of computation. When deploying a grid a few key problems come up:
  • How do you get your code to all nodes? Not just the first time, but every time a JAR file changes how distributed across all nodes?
  • How do all the other nodes find each other when they come up? Clearly for work to be sent to nodes someone must know about them.
  • How are jobs distributed to the nodes? Somehow jobs must be sent to a node, the calculations made, and the results assembled.
  • How are failures handled? Somehow when a node goes down and new nodes come on-line work must be rescheduled.
  • How does each node get the data it needs to do its work? Scalable computation without scalable data doesn't work for most problems. Much of the drama is lost with GridGain because most of these capabilities almost are implemented almost transparently. Discovery happens automatically. When nodes come up they communicate with each other and transparently form a grid. You don't see this, it just happens. In fact, this was one of GridGain's issues when porting to the cloud. They used multicast for discovery and Amazon doesn't support multicast. So they had to use another messaging service, which GridGain supports doing out-of-the box, and are now working on their unicast own version of the discovery service. Deploying new code is always a frustrating problem. Over the same transparently formed grid, code updates are transparently auto deployed on the grid. Again, this is one of those things you see happen from Eclipse and it loses most of the impact. It just looks like how it's supposed to work, but rarely does. With GridGain you do a build and your code changes are automatically sent through to each node in the grid. Very nice. To mark a method a gridified an annotation (or an API call) is used:
    @Gridify(taskClass = GridifyHelloWorldTask.class, timeout = 3000)
    public static int sayIt(String phrase) {
        // Simply print out the argument.
        System.out.println(">>> Printing '" + phrase + "' on this node from grid-enabled method.");
        return phrase.length();
    }
    
    The task class is responsible for splitting method execution into sub-jobs. For a full example go here. The @Gridify annotation uses AOP (aspect-oriented programming) to automatically "gridify" the method. I assume this registers the method with the job scheduling system. When the application comes up and triggers execution the method is then scheduled through the job scheduling system and allocated to nodes. Again, you don't see this and they really don't talk enough about how this part works. Notice how so much complexity is nicely hidden by GridGain with very little configuration on the developer's part. There aren't a billion different XML files where every single part of the system has to be defined ahead of time. The dynamic transparent nature of the core features make it simple to use.

    Integrating with the Data Grid

    We haven't talked about data at all. If you are just concerned with a program like a Monte Carlo simulation then the compute grid is all you need. But most calculations require data. Where does your massive compute grid pull the data from? That's where the data grid comes in. A data grid is the controlled sharing and management of large amounts of distributed data. GridGain leaves the data grid up to other software by integrating with packages like, JBoss Cache, Oracle Coherence, and GigaSpaces. Remember One Compute grid, Many Data Grids. GridGain accesses the data grid through an API so you can plug in any data grid you want to support with a little custom code. Google and Hadoop use a distributed file system (DFS) as their data grid. This makes sense. When you need to feed lots of CPUs the data can't come from a centralized store. The data must be parallelized and that's what a DFS does. A DFS splats data across a lot of spindles so it can be pulled relatively quickly by lots of CPUs in parallel. Other products like Coherence and GigaSpaces store data in an in-memory data grid instead of a filesytem. Serving data from memory is faster, but you are limited by the amount of memory you have. If you have a petabyte of data your choice is clear, but if your problem is a bit smaller than maybe an in-memory solution would work. The closer data is to the business logic the better performance will be. GridGain controls job execution while the data grid is responsible for the availability and integrity of the data. GridGain doesn't care what data grid you use, but your choice has implications for performance. A compute grid and an in-memory data grid in the same cloud will smoke configurations where the data grid comes from disk or is located outside the cloud.

    GridGain is Linearly Scalable for a Pure CPU Benchmark

    The good folks at GridDynamics are doing some serious cloud testing of different products and different clouds. They did a test Scalability Benchmark of Monte Carlo Simulation on Amazon EC2 with GridGain Software that found GridGain was linearly scalable to 512 nodes in Amazon's EC2. A Monte Carlo simulation is a CPU test only, it does not use a data grid. A data grid based test would be more useful to me as everything changes once large amounts of data start flying around, but it does indicate the core of GridGain is quite scalable.

    Wrapping Up

    Grid products like Coherence and GigaSpaces include both compute grid and data grid features. Why choose a compute grid only system like GridGain when other products include both capabilities? GridGain might say they win business on the quality of their compute grid, excellent support and documentation, and the ability to cleanly integrate into almost any existing ecosystem through their well thought out API abstraction layer and their out-of-the-box support for almost every important Java framework. Others may counter performance is far better when the business logic and the job management are integrated. All interesting issues to tradeoff in your own decision making process. GridGain is free as their business model is based on providing support and consultation. A non-starter for many is the Java-only restriction. What is unique about GridGain is how easy and transparent they made it to use and deploy. That's some thoughtful engineering.

    Related Articles

  • Gridify Blog
  • Ten Useful Gridgain How-To Tips
  • 10 Reasons to Use GridGain
  • What is Grid Gain?
  • Developers Productivity: Unsung Hero of GridGain
  • GridGain vs Hadoop
  • Cameron Purdy: Defining a Data Grid
  • Compute Grids vs. Data Grids

    Click to read more ...

  • Sunday
    May252008

    Product: Condor - Compute Intensive Workload Management

    From their website: Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource.

    Related Articles

  • High Throughput Computing by Miron Livny
  • Condor Presentations
  • (my) Principles of Distributed Computing by Miron Livny

    Click to read more ...

  • Friday
    Dec282007

    Amazon's EC2: Pay as You Grow Could Cut Your Costs in Half

    Update 2: Summize Computes Computing Resources for a Startup. Lots of nice graphs showing Amazon is hard to beat for small machines and become less cost efficient for well used larger machines. Long term storage costs may eat your saving away. And out of cloud bandwidth costs are high. Update: via ProductionScale, a nice Digital Web article on how to setup S3 to store media files and how Blue Origin was able to handle 3.5 million requests and 758 GBs in bandwidth in a single day for very little $$$. Also a Right Scale article on Network performance within Amazon EC2 and to Amazon S3. 75MB/s between EC2 instances, 10.2MB/s between EC2 and S3 for download, 6.9MB/s upload. Now that Amazon's S3 (storage service) is out of beta and EC2 (elastic compute cloud) has added new instance types (the class of machine you can rent) with more CPU and more RAM, I thought it would be interesting to take a look out how their pricing stacks up. The quick conclusion:the more you scale the more you save. A six node configuration in Amazon is about half the cost of a similar setup using a service provider. But cost may not be everything... EC2 gets a lot of positive pub, so if you would like a few other perspectives take a look at Jason Hoffman of Joyent's blog post on Why EC2 isn't yet a platform for "normal" web applications and Hostingfu's Short Comings of Amazon EC2. Both are well worth reading and tell a much needed cautionary tale. The upshot is batch operations clearly work well within EC2 and S3 (storage service), but the jury is still out on deploying large database centric websites completely within EC2. The important sticky issues seem to be: static IP addresses, load balancing/fail over, lack of data center redundancy, lack of custom OS building, and problematic persistent block storage for databases. A lack of large RAM and CPU machines has been solved with the new instance types. Assuming you are OK with all these issues, will EC2 cost less? Cost isn't the only issue of course. If dynamically scaling VMs is a key feature, SQS (message queue service) looks attractive, or S3's endless storage are critical, then weight accordingly. My two use cases are my VPS, for selfish reasons, and a quote from a leading service provider for a 6 node setup for a startup. Six nodes is small, but since the architecture featured horizontal scaling, the cost of expanding was pretty linear and incremental. Here's a quick summary of Amazon's pricing:

  • Data transfer: The prices are $0.10 per GB - all data transfer in, $0.18 per GB - first 10 TB / month data transfer out $0.16 per GB - next 40 TB / month data transfer out, $0.13 per GB - data transfer out / month over 50 TB. You don't pay for data transfer between EC2 and S3, so that's an advantage of using S3 within EC2.
  • S3: $0.15 per GB-Month, $0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests. I have no idea how many requests I would use.
  • Small Instance at 10 cents/hour:1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform. The CPU capacity is that of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
  • Large Instance at 40 cents/hour: 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform.
  • Extra Large Instance at 80 cents/hour: 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform. You don't have to run these numbers by hand. To calculate the Amazon costs I used their handy dandy calculator. When performing calculations, per Amazon, I used 732 hours per month.

    Single VPS Configuration

    I was very curious about the economics of moving this simple site(http://highscalability.com) from a single managed VPS to EC2. Currently my plan provides:
  • 1GB RAM (with no room for expansion).
  • 50 GB of storage. I use about 4 GB.
  • 800 GB montly transfer. Of which I use 1GB/month in and 10GB/month out.
  • 8 IP addresses. Very nice for virtual hosts.
  • 100Mbps uplink speed.
  • Very responsive support. Very poor system monitoring.
  • 1 VM backup image. Would prefer two.
  • The CPU usage is hard to characterize, but it's been more than sufficient for my needs.
  • Cost: $105 per month. From Amazon:
  • The small instance looks good to me. What I need is more memory, not more CPU, so that's attractive. VPS memory pricing is painfully high.
  • 10 GB storage, 1 GB transfer in, 10 GB transfer out, 2000 requests.
  • Cost: about $80 per month Will I switch? Probably not. I don't know how well Drupal runs in EC2/S3 and it's not really worth it for me to find out. Drupal isn't horizontally scalable so that feature of EC2 holds little attraction. But the extra RAM and affordable disk storage are attractive. So for very small deployments using your standard off the shelf software, there's no compelling reason to switch to EC2.

    Six Node Configuration for Startup

    This configuration is targeted at a real-life Web 2.0ish startup needing about 300GB of fast, highly available database storage. Currently there are no requirements for storing large quantities of BLOBs. There are 6 nodes overall: two database servers in failover configuration, two load balanced web servers, and two application servers. From the service provider:
  • Two database servers are $1500/month total for dual processor quad core Xeon 5310, 4GB of RAM, and 4x 300GB 15K SCSI HDD in RAID 10 configuration, with 5 IP addresses. 10 Mbps Public & Private Networks. Public Bandwidth 2000 GB Bandwidth.
  • The other 4 servers have 2GB RAM each, single Quad Core Xeon 5310, 1 X 73GB SAS 10k RPM, for about $250 each.
  • For backup the cost is 500GB $200/month.
  • I'm not including load balancer or firewall services as these don't apply to Amazon, which may be a negative depending on your thinking. Plus the provider as an excellent service and management infrastructure.
  • Cost: $2700/month. From Amazon:
  • Two extra large instances for the database servers. Your architecture here is more open and could take some rethinking. You could just rent 1 and bring another another on-line from the pool on failure, which would save about $500 a month. I'll assume we load balance read and write traffic here so we'll have two. Using one extra large instance is about the same price as two large instances.
  • Four small instances for the other servers. Here is another place the architecture could be rethought. It would be easy enough to buy one or two servers upfront and then add servers in response to demand. That might save about $140/month under low load conditions. Adding another 4 servers adds about $300.
  • 300 GB of storage. Doubling to $600 GB only adds about $50/month. If storing large amounts of data does become a requirement this could be a big win.
  • 200 GB transfer in, 1800 GB transfer out. This is a guesstimate. Doubling the numbers adds another $400.
  • 40,000 requests. No idea, but these are cheap so being wrong isn't that expensive.
  • Cost: about $1300/month using two large instance for the database.
  • Cost: about $1900/month using two extra large instances for the database. The cost numbers really stand out. You pay half for a similar setup and the cost of incrementally scaling along any dimension is relatively inexpensive. You could in fact start much smaller and much cheaper and simply pay as you grow. The comparison is not apples to apples however. All the potential problems with EC2 have to be factored in as well. As someone said, architectecting for EC2/S3 takes a different way of thinking about things. And that's really true. Many of the standard tricks don't apply anymore. Deploying customer facing production websites in a grid is not a well traveled path. If you don't want to walk the bleeding edge then EC2 may not be for you. For example, in the service provider scenario you have blisteringly fast disks in a very scalable RAID 10 setup. That will work. Now, how will your database work over S3? Is it even possible to deploy your database over S3 with confidence? Do you need to add a ton of caching nodes? Will you have to radically change your architecture in a way that doesn't fit your skill set or schedule? Will the extra care and monitoring needed by EC2 be unacceptable? Is the single data center model a problem? Does the lack of a hardware firewall and a load balancer seem like too big a weakness? Can you have any faith in Amazon as a grid provider? Only the shadow may know the answers to these questions, but the potential cost savings and the potential ease of scaling make the questions worth answering.

    Related Articles

  • Build an Infinitely Scalable Infrastructure for $100 Using Amazon
  • An Unorthodox Approach to Database Design : The Coming of the Shard

    Click to read more ...

  • Tuesday
    Oct232007

    Hire Facebook, Ning, and Salesforce to Scale for You

    One of the premier scaling strategies is always: get someone else to do the work for you. But unlike Huckleberry Finn in Tom Sawyer, you won't have to trick anyone into whitewashing a fence for you. Times have changed. Companies like Ning, Facebook, and Salesforce are more than happy to help. Their price: lock-in. Previously you had few options when building a "real" website. You needed to do everything yourself. Infrastructure and application were all yours. Then companies stepped in by commoditizing parts of the infrastructure, but the application was still yours. The next step is full on Borg take no prisoners assimilation where the infrastructure and application are built as one collective. What you have to decide as someone faced with building a scalable website is if these new options are worth the price. Feeding this explosion of choice is one of the new strategy games on the intertubes: the Internet Platform Game. Ning's Marc Andreessen defines a platform as: a system that can be programmed and therefore customized by outside developers -- users -- and in that way, adapted to countless needs and niches that the platform's original developers could not have possibly contemplated, much less had time to accommodate. The idea is you'll win great rewards in exchange for coding to someone else's internet platform. From Ning you'll win a featureful and customizable social networking platform that they are completely responsible for scaling. The cost ranges from free to very reasonable. From Facebook you'll win prime space on the profile page of over 40 million virally infected customers. It's free, but you must make your application scalable enough to handle all those millions. By coding to the Salesforce platform you'll win the same infrastructure that executes 100 million Salesforce transactions a day. The cost of their service is unknown at this time.

    The Three Levels of Internet Platforms

    Mr. Andreessen then went a step further and defined a three level platform categorization scheme:
  • Level 1: Access API. A platform provided in the form of a REST/SOAP web services API. Examples: eBay, Paypal, Flickr, Digg. Your application lives outside the service and their API is your only access point to the system. Scalability is completely up to you. You are basically building a mashup from distributed parts in your own data center.
  • Level 2: Plug-In API. A platform provided in the form of a system for embedding your application inside another application. Examples: Facebook, Eclipse, Firefox. You still use an API, but the user sees an integrated application because your application is using their screen real estate, log in, user accounts and so on. For internet plug-ins scalability is still up to you. The millions of Facebook users running your application must run completely on your servers.
  • Level 3: Deep hosting. A platform provided in the form an API, Plug-in, and fully hosted runtime environment. Examples: Ning, Salesforce, and Second Life. Your application is completely integrated with a host application framework and runs completely on the host servers. They are responsible adding machines, maintenance, and management. You are free to just write your application. Amazon is on his original list, but I don't put it there. If Amazon exposed their Dynamo service I would, but since with EC2 you are stuck worrying about database storage they really don't belong here. Like the typical depiction of human ascent from amoeba to weapon wielding, art appreciating primate, the levels are meant to indicate progress. While in reality evolution isn't about progress at all. It's all about survival through adaptation to local ecological niches. And that's how I look at the levels. At each level you gain something and you lose something. You need to select your niche by looking at your talents and needs.

    Why Use an Access API?

    Using open APIs to access services is what has made the internet great. APIs provide the most flexibility at the greatest cost. You get access to a huge number of wonderful services for virtually nothing. The linkage between website is a relatively simple API and a data definition. You can do anything you want, but you have to build the infrastructure to do it. Yet that's a lot better than building your own map service, your own SMS service, or your own photo sharing service. Yet there's still so much work to do. Grid services make the job easier, but the level of expertise it takes to create a scalable site is still very high.

    Why Use a Plug-In?

    Since Facebook is the only internet company in this category the answer is clear why you want to be a Facebook plug-in: to get access to a lot of users, connected by an exploitable social graph, for the purpose of exponentionally propagating your application along the graph. Most would be ecstatic to get to hundreds of thousands of regular users on their own standalone site. With Facebook that's very possible. The reward is great, but the costs are great too. Your application must be something that can be deconstructed onto Facebook. I don't see gmail making it as a Facebook app. You must subject yourself to a lot of restrictions to use the Facebook infrastructure. You must trust yourself to a poorly documented system in which it is hard to get anything done. And to top it off:Facebook does not host your application. This really blew me away when I first heard about it. When someone says they are offering a platform my immediate assumption is they are hosting your application. That's what a platform is, isn't it? But your application must run on your own hardware. Imagine going from 0 to millions of users in the space of a few days. How would you handle that? Well that's exactly the problem ILike (a popular music sharing site) had when they released their Facebook app. Mr. Andreessen gives a wonderful if somewhat self-serving account of ILike's troubles with viral growth. After launching they posted this on their blog: In our first 20 hours of opening doors we had 50,000 users sign up, and it is only accelerating. (10,000 users joined in the first 12 hrs. 10,000 more users in the next 3 hrs. 30,000 more users in the next 5 hrs!!) We started the system not knowing what to expect, with only 2 servers, but ready with backup. Facebook's rabid userbase chewed up our 2 servers almost instantly. We doubled our capacity to catch up. And then we doubled it again. And again. And again. Oh crap - we ran out of servers!! Although iLike.com has a very healthy level of Web traffic, and even though about half of all the servers in our datacenter were sitting unused, idle, as backup capacity, we are now completely maxed out. We just emailed everybody we know across over a dozen Bay Area startups, corporations, and venture firms in a desperate plea to find spare servers so we can triple our capacity for the continued onslaught. Tomorrow we are picking up over 100 servers from different companies to have them installed just to handle the weekend's traffic. (For those who responded to our late night pleas, thank you!) ILike says they now have over 3 million Facebook users and are growing at an astonishing rate of 300,000 users per day. That number of users and growth rate will make almost anyone salivate. Yet how many can afford the hundreds and hundreds of servers it would take to handle all those users, especially if you have an unclear monetization strategy? Which brings us to Deep Hosting and Mr. Andreessen's end game for the internet's evolution.

    Why Use Deep Hosting?

    The trouble with handling application growth under Facebook's large user base has an obvious solution: host your application on their infrastructure. This is exactly what Mr. Andreessen has done with Ning. Out of the Ning box you get an exceptionally functional social networking package. So functional in fact it makes almost anyone think "do I really need to reinvent all this stuff when they've already done it? Can't I just tweak a few things and make it my own?" And that's exactly what Ning wants to hear. They've made it so you can completely rebrand their software, add your own features using normal programming tools, yet still host your application on their platform, on their servers, in their datacenter. So you don't have to worry about scaling. Its Ning's job to scale the database, back it up, manage the infrastructure, add servers, and do all the other nasty bits that keep so many people away from deploying successful websites. So the temptation is clear. Go with Ning and you immediately get a cool system that will scale and that you can still program if you feel the need. But with all that power comes a price, as usual. You are locked inside a gilded cage. If your application slows down there's not much you can do about it. I found their documentation better than Facebook's, but not very useful for someone looking to get going quickly and that makes me very nervous when adopting a platform. Yet when they add features, as they frequently do, your app gets them for free. You see some of the same effects here that all Google apps get when the Google stack is improved. And not having to worry about scalability is very attractive, especially at such a reasonable cost.

    Problems with Deep Hosting

    Mr. Andreessen thinks that "in the long run, all credible large-scale Internet companies will provide Level 3 platforms." There are three problems with this argument.
  • One: Ning has the same problem as Salesforce, only their part of the application infrastructure is scalable. What if I want to a add new service that is specific to my application? Let's say I want to send mass emailings for an invitation feature, for example? How do I make my infrastructure for this run inside their platform? I don't. Which means I have to be able build a scalable infrastructure anyway. Which means I might as well do the whole thing. But Ning might say their functionality is so compelling that it's worth the trade off. You can always make those external services. Which brings us back to if I have to do one part I might as well do it all. And it also brings us to the second problem with the L3 platform model.
  • Two: How compelling will each L3 domain be? You have to be very very attractive to even get someone to consider assimilating into a platform. Ning has done an excellent job at this. But how many other companies in how many other domains will do as a good a job? Precious few I would think.
  • Three: Mr. Andreessen maintains it is "really easy to learn how to program -- in fact, it's never been easier." So centering the L3 platform definition around programmability is not seen as a concern. But programming is not easy. It's very hard. Especially with such poorly documented systems. The more code you have to write the further you are away from your goal and the further you are away from adoption. This is why we see systems like Drupal with well defined plug-in architectures being very popular. Most people can't and won't ever program, so building things from pre-existing parts (like how our bodies evolved) allows people to get a lot of core functionality with the chance for specialization and expandability.

    What does this mean for you?

    I've found it difficult to reconcile all the different pros and cons of each approach. There is a definite value in all these alternatives. If you have a vision for an application then building it yourself is the only way you'll achieve that vision. So do it yourself. But what good is a vision without users? So go Facebook. But I could get something going very quickly in Ning and the expand overtime with much less hassle, even if it's not exactly what I want. So go Ning. What to do? The point of this post isn't to come to a conclusion. The point has been to cover some new and different approaches to scalability so you can spend a few sleepless nights pondering your options too :-)

    Related Articles

  • The three kinds of platforms you meet on the Internet by Marc Andreessen
  • Analyzing the Facebook Platform, three weeks in by Marc Andreessen
  • Q&A with iLike’s Ali Partovi, on Facebook By Eric Eldon
  • I want to understand Ning's architecture and how it works
  • Response to Three Platforms You Meet by Joshua
  • Ning's Developer Documentation
  • Facebook's Application Architecture
  • Saleforce's On-Demand Computing Platform
  • Building a Business on Virtual Infrastructure, Using Google and salesforce.com

    Click to read more ...

  • Saturday
    Oct202007

    Should you build your next website using 3tera's grid OS?

    Update 2: 3tera has added Dynamic Appliances, which are "packaged data center operations like backup, migration or SLAs that users can add to their applications to provide functionality." Update: in an effort to help cross the chasm of how start building a website using their grid OS, 3tera is offering their Assured Success Plan. The idea is to provide training, consulting, and support so you can get started with some confidence you'll end up succeeding. If you are starting or extending a website you have a problem: what technologies should you use? Now there are more answers to that question than ever. One new and refreshingly innovative answer is 3tera's grid OS. In this podcast interview with Bert Armijo from 3tera, we'll learn how 3tera wants to change how you build websites. How? By transforming the physical into the virtual and then allowing the virtual to be manipulated as if it were real. Could I possibly be more abstract? Not really. But when I think of what they are doing that's the mental model I see whirling around in my mind. Don't worry, I promise we'll drill down to how it can help you in the real world. Let's see how. I think of 3tera's product as like staying at a nice hotel. At home you are in charge. If something needs doing you must do it. If something breaks you must fix it. But at a nice hotel everything just happens for you. Your room is cleaned, beds are made, outrageously expensive candy bars are replaced in the mini-bar, food arrives when you order it and plates disappear when you are done, and the courtesy mint is placed just so on your pillow. You are free to simply enjoy your stay. All the other details of living just happen. That's the same sort of experience 3tera is trying to provide for your website. You can concentrate on your application and 3tera, through their GUI on the front-end and their AppLogic grid operating system on the back-end, worries about all the housekeeping. I think Bert summed up their goal wonderfully when said their aim is to:

    Get peoples hands off physical boxes and to give them a way to define complex infrastructures in a reusable way that they can then instantiate, trade, sell, or replicate, backup up and manage as individual units. This is what AppLogic that does incredibly way.
    What they are doing is taking hard physical resources like CPU and storage and decoupling them from their physical sources so you can just order and use them on demand without worrying how its done under the covers. This is trend that has been happening for a while, but their grid OS takes that process to the next level. Your physical co-lo cage is now a private virtual data center. Physical boxes, once lovingly spec'ed, bought, and installed are now allocated on demand from a phalanx of preconfigured and separately maintained servers. Physical storage, once lovingly pieced together from disks, controllers, and networks is now allocated from a vast unending sea of virtual storage. Physical load balancers are now programs you can create. What this means for you is you can take a website architecture you've draw up on your white board and simply and quickly create it in a data center. Its all configurable from a GUI. You can bring on 10 new web servers with a simple drag and drop operation. It's basically your white board diagram come to life, only you get to skip all the nasty implementation bits. In the virtual world the nasty non application related implementation bits are someone else's problem. 3tera's value proposition pretty easy to understand:
  • Simplify the data center. You no longer need to locate, outfit, staff, maintain, and support a co-lo space.
  • Simplify operations. A few people can manage a lot machines.
  • Simplify disaster recovery. Failover is complicated and often doesn't work as planned. With AppLogic your redundant data center is always the same because the virtual data center is copied a unit. You can pick it up and move it anywhere you want.
  • Simplify the cost model for growth. If you grow how are you going to fund your hardware? Growing on a grid is more agile, incremental, and requires less upfront investment.
  • Simplify your architecture. The grid OS provides a powerful implementation model of how you should structure, grow, and maintain your system. You don't need to code it from scratch or think it up yourself. In short: customers don't care about your servers. Hardware and the data center do not add value. You core competency is in your application and running your business, not playing with servers. Well, that's it for the overview. Please listen to this podcast for all the nitty-gritty details. Download audio file (1:16 minutes, mp3).

    Podcast Notes

    I know what you are probably saying. You are saying: "But Todd, the podcast is over an hour long, couldn't you have please made it longer? I have nothing else to do today and I need to waste more time!" What can I say, Bert was very knowledgeable and helpful, and this is a new model for building scalable websites so I was trying to figure out how I could physically make a website using their product. That takes a lot of questions. I am happy with the result though. I think I have a good picture of how their system works and I think its well worth investigating if you are in the market for creating or expanding a website. Here are some notes taken from the podcast.
  • They started 3 years ago. At that time nobody could understand what they were trying to build. They have just now been able to build the higher level features, like Smart Appliances, that they wanted to build originally. They've been concentrating on making all the plumbing work.
  • The AppLogic grid operating system allows you to take hard infrastructure servers, load balancers, firewalls, VPNs, all these boxes you need to make a website and it allows you deploy these in a virtual data center
  • A virtual data center (VDC) is like a cage you would buy from a co-location service except you operate and manage it through a browser. You can be anywhere in the world and you can use hosting services anywhere in the world.
  • An entry level package ranges from $500 to a few thousand a month. The starting point is 4 - 32 CPUs, some amount of storage and some amount bandwidth. You add resources as you need to. Overage charges are passed through to you from the data center provider. They don't mark it up.
  • They don't own any servers. They contract with hosting providers data centers, like Softlayer and Layeredtech, for a uniform set of resources.
  • They offer templates for a scalable virtualized LAMP infrastructure as a starting point for building your own applications.
  • Their GUI shows you the architecture. You don't have to think of physical boxes.
  • There's a controller for the VDC through which you can provision your system.
  • You can still login to any physical or logical service. You have root access. You can install anything and manage the system, but you don't have to worry about where it physically resides.
  • To create an application: - You use the controller to provision a LAMP cluster. - Then you log into Apache server and configure it how you wish. - Then restart and it begins to serve. - Say you want 10 front-end web servers. - The load balancer is a virtual load balancer you program. - You use virtual NAS. - Upload code to the NAS. - Then have all apache servers run off the NAS. So you don't have to log into all and upload code.
  • Shared storage is part of the virtual data center by definition. You can create as many volumes as you wish. All are mirrored for high availability. If a virtual server goes down AppLogic will simply restart on another available resource in the data center.
  • Partners build the grid backbone to which nodes and other resources are attached. AppLogic runs that grid backbone. When you sign up you provision the virtual data center the nodes on the backbone are assigned to your VDC. A controller allows you to provision your VDC. Anything you can do in co-lo cage you can do, but there's nothing physical. AppLogic carries out your commands on the grid.
  • They provide standards for the hosting service. A variety of machine classifications available. Have customers with 50TB of storage. The largest number of CPUs in a single VDC is over 450.
  • To see if the VDC meets your requirements you run a test on the VDC. Once you have resources in your VDC they are not shared with anyone else so you can be confident the performance will be as tested. It's not a VPS. Their customers run production systems. They are all running a business of some sort.
  • Pricing is designed to be attractive for startups, but not artificially low to over-subscribe.
  • Currently there's no data center API. It's scriptable from the CLI. Smart Appliances can package up a data center operations into a drag and drop package. You can drag them into any application. Their first Smart Appliance is "follow me" which can move your application to a data center that is close to you. If you are in Asia you can move your data center to Asia. So your data center can follow you around. No coding is needed on your part. Just drag it into your VDC.
  • With AppLogic instead of managing a bunch of different things you manage your application. You do it once. AppLogic maintains the infrastructure for you.
  • In an upgrade of 10 Apache servers you don't upgrade standing infrastructure. You take a copy of your application and upgrade the copy.
  • Let's say you have an Apache server you want to patch. You create one prototype, which they call a class volume. Then when application restarts all the new changes will be picked up everywhere.
  • The power of what it means to be virtual can be seen in their rollback model. You don't upgrade in-place. You upgrade a copy. Because everything is virtual its easy to make copies of your entire data center. So you can copy your data center, keep the original running, and switch to the upgraded version. If the upgrade version doesn't work you can rollback to the original version of your VDC. This would be almost impossible using traditional methods. An application is the full state of the application with all its data. So you are operating a full complete copy of the application with all of its data. You can rollback to a complete running instance of the application. You just restart the old version.
  • For upgrades that require transformations, like database upgrades, you can write a script to run a database transformation.
  • They don't over automate. They don't only want to have their way of doing things.
  • The model an application has having two parts: the appliance and the content. For a web server this means: - Web Server - Content that it's serving.
  • You first create a prototype of what you want your system to look like. This becomes a class from which you later can create instances. There are templates, like the Linux appliance, to build from. Through their on-line system you configure your system, install packages, etc. When it works the way you want you can drag into your catalog as a template for building new instances. You can create hundreds of copies if you choose.
  • Content would be served off a mount location from inside the VDC.
  • You can upgrade the catalog element and restart the appliance and it will automatically upgrade for you. Not transactional. It's only an individual basis.
  • You can pin machines. You can get the environment to make machine specific configurations. You can put appliances into standby so you can quickly add additional resources on demand.
  • Their load balancer is Pound. No spam detection, but it is session aware. You can use others if you want.
  • They specialize in the code that runs the grid. They aren't specialists in load balancers and routers, etc.
  • In the VDC you can share infrastructure. You can email each other a clustered database, for example. You can save and package up an integration effort as an assembly. Save it. Sell it. Share it.
  • You can create an active-active redundancy scheme and pay for only resources you need because you can bring on resources like the front-end when you need them.
  • Many companies periodically make a local copy of their VDC and move it to their disaster center. - Remember, with a VDC it's easy to pick up your whole data center and move it somewhere else. The catalog doesn't have to be copied each time. Just the data for applications can be copied over. Not so bad with a fast backbone. - Disaster recovery can be triggered by a 3rd party or scripts. - This model is sufficient for companies that can accept some down time. - If no data loss can be tolerated you need replicate in an acive-active architecture.
  • Some companies maintain fungible data centers. They constantly copy their data center over to backup locations. If an app goes down they can fire up a replacement.
  • With AppLogic you can create a stub that can start an application on demand if it's not already running. This allows you to share resources. You can shut it down at night and save those resources for other applications.
  • Here's how they would handle TechCrunch: - Let's say you have an 8 grid data center. Let's say your normal load takes 20-30% of that. - First thing you'll do is use more resources from within the grid. - Then reconfigure appliances with more resources and restart them. - Then call your provider to add more resources. - Softlayer, for example, has a 500-1000 server inventory. So you can add servers to your grid within an hour a two. Currently this process requires human intervention.
  • Finding good OPs people is difficult. So with the VDC you can automate most of it and you don't need a big OPs team.
  • In you VDC your data center configuration is in the meta data, so its not kept as tacit knowledge. One or two people can run a thousand servers because you aren't really running servers, you are running applications.
  • Monitoring - AppLogic in control of all resources. You can build dashboards right off the bat. - You van plug your monitored variables into their monitoring system. - The data are available over the web. - Widgets are available for the display of live stats.
  • Different Way of Thinking about Your System - Typically you put the database on fastest server. Instead, they recommend allocating high end machines to everything so your database can run anywhere. A different way of thinking about your system. - Same with SAN. You don't need a SAN with the storage in the VDC. You are locking yourself into certain ways of thinking that don't apply in the VDC. Concept of using a SAN is just another lock-in.

    Some Observations and Conclusions

  • I think the grid/virtualization approach, in one form or another, is the wave of the future. It simply makes it easier for companies to scale applications. And as applications themselves are structured to run natively on a grid, it will become even easier.
  • Reaching the full potential of the virtual data center depends on having a more granular billing strategy and more fine grained control over resource management. For example, if I have 6 CPU grid and I want to upgrade. I don't want to pay for a 12 CPU data center just so I can upgrade a copy. I don't need 12 normally, I just transiently need 12. So during my upgrade I want my script to trigger allocating a copy of my VDC, do the upgrade, switch to it, and then decommission the old VDC. So I want for a time to have 6 extra servers for the time it takes to upgrade. Then the old VDC should go away. And I should only be billed for the resources I am using while I am using them. This would also give a more satisfactory solution to the TechCrunch scenario.
  • You need to architect your system to take advantage of the grid. To me this means a shared nothing architecture that can be grown horizontally by adding more machines on demand. Applications should read their configuration off shared storage so the configuration doesn't need to be configured on each machine and you can bring up new machines based on a template. If you need to scale a new machine should come up and automatically start handling load. Queuing architectures, for example, have this attribute.
  • They need a data center API so you can treat the data center like an object. This would allow you to orchestrate various data centers around the world as a single coopering unit.
  • Operations within a grid would benefit from standardization. I know this enters the application realm, but operations like upgrade and failover are common and hard. So it would be useful of common processes could be developed and easily deployed.
  • They need turnkey options for those new to the game. As it stands the path from signing up to their service and deploying a web service is little scary. They are very honest in saying they do only one part of the overall picture. But many people need a painting, not a brush and paint. It would be helpful to have out of the box plans for solving the most common problems people face. I would like to thank Bert again for taking this time for the interview! May the grid be with you, always.

    Related Sites and Articles

  • http://www.3tera.com/
  • On-Demand Infinitely Scalable Database Seed the Amazon EC2 Cloud

    Click to read more ...

  • Monday
    Jul302007

    Product: Sun Utility Computing

    The Sun Grid Compute Utility is a simple to use, simple to access data center-on-demand. Sun Grid delivers enterprise computing power and resources over the Internet, enabling developers, researchers, scientists and businesses to optimize performance, speed time to results, and accelerate innovation without investment in IT infrastructure. No matter the size of your business or the size of your job -- there is no barrier to entry and exit. This is the future of computing available today: IT as a service.

    Click to read more ...

    Monday
    Jul302007

    Product: GridLayer. Utility computing for online application

    TGL delivers Virtual Private Datacenters and virtual private servers from grids of commodity servers. Each TGL grid consists of a pool of HP servers connected with a Gigabit backbone network and running 3Tera's AppLogic grid operating system. With a Virtual Private Datacenter, you get complete control of your own private grid. Using our visual interface, you set up and assemble disposable virtual infrastructure, including firewalls, load balancers, web servers, database servers, NAS boxes, etc. visually, by pointing and clicking. You can build advanced clusters, deploy large and small applications, and save them as templates that can be provisioned in minutes. You can even build your own virtual servers and appliances.

    Click to read more ...

    Monday
    Jul302007

    Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

    Can you really create an infinitely scalable infrastructure for less than $100 using Amazon's storage, grid, and queuing services platform? It appears so, at least for the right application. Amazon beams a spot light on the future battle of the roll-your-own versus the connect-the-dots approach to building next generation websites using core external services. Their argument is strong. Using Amazon's platform you can quickly build an infrastructure that would otherwise take an eternity to make, a pile of money to create, and an unbounded mass of people to implement and maintain. Yet Amazon doesn't provide SLAs, so you can you really trust them with your crown jewels? Facebook recently leap frogged Amazon's vision with an even more comprehensive set of services. The battle for the future is on. Site: http://aws.amazon.com/

    Information Sources

  • Slides: Building Highly Scalable Web Applications
  • Podcast: Technometria: Amazon Web Services
  • Amazon Services Home.

    Platform

  • Amazon ECS (E-Commerce Service)
  • Amazon S3 (simple storage service)
  • Amazon SQS (simple queuing service)
  • Amazon EC2 (grid service)
  • Amazon Web Search Service
  • Amazon Flexible Payments Service (Amazon FPS)
  • REST and SOAP Service Interfaces

    What's Inside?

    Why use external services?

  • Amazon's services replace the boxes, wires, and disk drives part of the application stack.
  • Amazon has spent ten years and over $1 billion developing a world-class web service that millions of customers use every day. Maybe you can leverage that experience for your site?
  • Focus on the customer. 70% if Web Development isn't about providing customer value. It's about building and managing data centers. Your efforts would be better spent on your customers and not plumbing.
  • Quicker to market. Scaling is hard. Let someone else worry about that while you concentrate on adding user value.
  • Designing for peak load is expensive. So turn fixed costs into variable costs. Say you want to handle high traffic flows from slashdot or digg, or you have high seasonal demand, having the infrastructure in place to handle those loads is a high fixed cost. You could use that money better elsewhere. It make sense to create an infrastructure where you can automatically and temporarily scale resources to handle peak demand.
  • High reliability and availability. A dedicated service may be more reliable than a service you could create. It say "may" because Amazon doesn't provide an SLA, so you wont get any guarantees. The idea is that Amazon is cheap enough and reliable enough that the few failures will be acceptable. Besides, SLAs usually just refund some money when things go wrong, they don't really guarantee anything.
  • It's a cheap CDN. Amazon's storage network could serve a relatively inexpensive content delivery network. This option is discussed in Reducing Your Website's Bandwidth Usage. The idea is that just the frequent downloading of a simple favicon.ico file can use a significant portion of your bandwidth. Using S3 for $2/month to offload 90% of your bandwidth to an external host is a good deal. However, without an SLA S3 can't be thought of as a proper CDN.

    Amazon ECS (E-Commerce Service)

  • This service exposes Amazon's product data and e-commerce functionality: Detailed Product Information on all Amazon.com Products, Access to Product Images, All Customer Reviews associated with a Product, etc.
  • Amazon products are aggressively priced.
  • I found this service disappointing. If you want to build a store on top of Amazon it seems great, but I didn't see a way to add your own products to the store, so I don't think it's generally useful.

    Amazon S3 (simple storage service)

  • This service stores data in Amazon's storage network.
  • $.15 per GPB per month storage
  • $.01 for 1000 to 10000 requests.
  • $.10 - $.17 per GB data transfer.
  • The service is: fast, relaible, scalable, redundant, dispersed.
  • You can have per object URLs. This means you can reference an image or other file directly with a URL, so it's usable in a web page.
  • Typical use: CDN and backup storage.
  • Storage is distributed to multiple locations so you get a level of geographical distribution.

    Amazon SQS (simple queuing service)

  • This service provides an internet scale queuing service for storing messages. Distributed actors put work on the queue and take work off the queue.
  • $.10 per 1000 messages.
  • $.10 - $.18 per GB data transfer.
  • This service is: scalable, elastic, reliable, simple, secure.
  • Typical use: a centralized work queue. You put jobs on the queue and different actors can pop work of the queue and process them when they get CPU time.
  • Expected message latency, as of 2007, was 2-10 seconds. This is horrible for many applications, not bad for many others.
  • Part of scalability. Have any number of producers and consumers. You don't worry about it.
  • Queues are spread across multiple machines and multiple data centers.

    Amazon EC2 (grid service)

  • This service provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.
  • Basically you create a Xen image for your Linux distro and upload it into their "elastic compute cloud." Using an API you can then start as many instances as you like.
  • Typical use: transcoding, audio work, load testing.
  • Root level access to the server and full control over the machine.
  • Can scale up and scale down on a minute-by-minute basis.
  • For real-time processing one criticism has been slow CPUs (1.75 Ghz Xeon). This probably won't be a problem if your application is written to linearly scale.
  • An EC2 instance is not persistent so you can't store a database there. You have some local storage, but it goes away when the instance goes away.
  • Takes a few minutes to start and stop images, so it's not really on demand.
  • You can add anything you want to an image. If you want a database you can add it in.

    GigaVox Media Example Web-Scale Architecture

  • You can start to see how Amazon's services can work together. Let's say you have a large batch of MP2s you would like transcode to MP3s. You would store the original media into S3, queue the work request into SQS, and have instances running in EC2 to take work of the queue and perform the transcoding, storing the results back into S3. And this is exactly what GigaVox does.
  • GigaVox is a podcasting company. - They take original recordings and transcode them say from MP2 to MP3. Many other transcodings are also performed. - Then these chunks of media are assembled together into a delivery format based on building a show. For example, old podcasts can be reassembled each night with up to date advertisements. - To do this at scale would take a lot of costly resources.
  • Using Amazon's services GigaVox gets geographically redundancy and failover for relatively inexpensive CPU, bandwidth, and storage charges, and bandwidth costs. You have no boxes or wires. No data center to manage. And you can grow with small fixed costs.
  • Messages are time stamped on the queue. If the message waited in the queue for too long then they can start more EC2 images. You can balance costs. You could also layer in a customer based priority mechanism.
  • They have each instance have its own messaging queue for command and control.
  • For security reasons they upload files through ftp to instances rather than going through S3.
  • All bandwidth withing the Amazon cloud is free. This is an important business consideration for making the services work together.
  • Another set of instances and queues handles assembling the delivered media.
  • Allows GigaVox to deliver value to their customer at a low startup cost.

    Lessons Learned

  • Build or buy is always a difficult decision. If a service doesn't work then you may lose your customers and there's nothing more you can do other send yet another urgent email to nobody in particular. This is a horrible feeling. Yet, if it does work you could be way ahead of the game. How to choose? That would be telling :-)
  • Build a layer of virtualization so you can switch to another provider when they become available or so you can replace it with your own service. This lessens your dependency on Amazon in the event they get tired of offering services or their performance deteriorates.
  • As a startup using Amazon services isn't a big risk because you are already in a risky situation. And any risk is moderated by the very low cost of starting up and money is always an issue for startups.
  • For many use cases buying your own dedicated servers may still be a better approach as you get more control, lower latency, and the same hardware is usable for multiple purposes.
  • Software as a service is a powerful and practical idea. It changes how you build software. It forces you to layer your software around interfaces. And once your software is composed of interfaces you have loosely coupled components that can be easily replaced. You also have the basis for a platform API should you ever want to provide an API you your customers. The highest level of development would to use the same API you give your customers to build your service.
  • Loosely coupled, message based architectures combined with service interfaces allow you to think several levels up the abstraction layer. You don't have to wallow in the muck, which frees you how to structure your application using large scale blocks of behavior.
  • Designing a UI for an asynchronous interactive interface poses some challenges. It may take a while to perform an operation, so how do you interact with the user to handle that?
  • Instinctively I doubted Amazon could deliver. But if you have the right type of problem, you really can do a lot of work cheaply using Amazon services.

    See Also

  • Flickr and YouTube also deal with service level APIs.
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3

    Click to read more ...

  • Page 1 2