Wednesday
Oct012008

Joyent - Cloud Computing Built on Accelerators

Kent Langley was kind enough to create a profile template for Joyent, Kent's new employer. Joyent is an infrastructure and development company that has put together a multi-site, multi-million dollar hosting setup for their own applications and for the use of others. Joyent competes with the likes of Amazon and GoGrid in the multi-player cloud computing game and hosts Bumper Sticker: A 1 Billion Page Per Month Facebook RoR App. The template was originally created with web services in mind, not cloud providers, but I think it still works in an odd sort of way. Remember, anyone can fill out a profile template for their system and share their wonderfulness with the world.

Getting to Know You

  • What is the name of your system and where can we find out more about it? Joyent Accelerator Cloud Computing IaaS My name is Kent Langley, Sr. Director, Joyent, Inc. (www.productionscale.com) The Joyent website is located at www.joyent.com The scope of this exercise is the Joyent Accelerator product. http://www.joyent.com/accelerator/
  • What is your system is for? It is essentially a system that provides infrastructure primitives as a service (IaaS) for building cloud computing applications, migrating enterprise data center operations to secure private clouds, or just hosting your blog. There is a page on the site called what scales on Joyent: http://www.joyent.com/accelerator/what-scales-on-joyent/ Java, PHP, Ruby, Erlang, Perl, Python all work beautifully on Joyent. There is no lock-in. Ever. We try to run an open cloud. It's also a "loving cloud" if you ask our CTO. We have some of the largest Rails applications in the world, very high volume ejabberd XMPP infrastructure, exceptionally large Drupal installations, commerce sites in private clouds, .NET with Mono, TomCat, Resin, Glassfish, and much more all running on Accelerators. Joyent Accelerators are the perfect building blocks for almost any PaaS (Platform as a Service) play as well. Of particular note, Java runs exceptionally well on Accelerators because Accelerators are 64bit and you can also do 64 bit Java and have a JVM that could address as much as 32 GiB of RAM! This gives excellent vertical scalability for any running JVM. more below the fold
  • Why did you decide to build this system? There is demand for a high-end but reasonably priced elastic computing infrastructure.
  • How is your project financed? Self-Funded at this time.
  • What is your revenue model? We sell Joyent Accelerators, do Scale Consulting, and some related Services. We also have a growing Parter Channel.
  • How do you market your product? WebSite, Word of Mouth, Blogs, Email Lists, Twitter, Event Sponsorships, Open Source participation, forums, friendfeed, and more...
  • How long have you been working on it? I, Kent Langley, have been working with Joyent for about 2.5 years as a client. I've been with the company as an employee for about 2 months. Joyent has existed formally for about 4 years.
  • How big is your system? Try to give a feel for how much work your system does. We have hundreds and hundreds of servers representing significant compute power across 1000's of cores in multiple locations.
  • Number of monthly page views? Billions and Billions (multiple billion+ page view per month clients)
  • What is your in/out bandwidth usage? That's a secret.
  • How many documents, do you serve? How many images? How much data? Billions per month.
  • How fast are you growing? Fast enough to give me grey hairs.
  • What is your ratio of free to paying users? Very low. Most of our users have paid accounts. We do have some free offerings to help people get started. But, the demand for those services has been high so the lines are a little long.
  • What is your user churn? About Average for the industry we think.
  • How many accounts have been active in the past month? Thousands.

    How is your system architected?

  • What is the architecture of your system? Talk about how your system works in as much detail as you feel comfortable with. Our technology stack is predicated on something we call a Pod. We have several pods and plans to add more. From the top to bottom you'd find. BigIP F5 Application Switches Force 10 (1GB and 10GB switching) Custom Dell Hardware with some secret sauce Tier1 Hosting Providers Essentially a custom Solaris Nevada based OS Core w/ a Pkgsrc install system
  • What particular design/architecture/implementation challenges does your system have? Automation. Automation. Automation. Self-Service.
  • What did you do to meet these challenges? We have an amazing team of Systems Developers that work very hard to improve our ability to grow and manage systems each day. We have some great updates on our Roadmap coming up that should be very exciting for existing and potential customers.
  • How does your system evolve to meet new scaling challenges? Our system is by it's nature evolutionary. As technology grows and changes, we grow with it. A recent example is when a client needed a private cloud computing environment to achieve PCI compliance in a cloud environment. So, we worked with the client to create this. While it is in production for two clients already we consider this a beta product. But you should expect to see it as a formal offering soon. This is an example of a way we have evolved our systems to respond to the changing cloud computing market place.
  • Do you use any particularly cool technologies are algorithms? ZFS, BigIP, DTrace, OpenSolaris Nevada, and an in-house custom provisioning system we call MCP (hat's off to Tron)
  • What do you do that is unique and different that people could best learn from? We know our approach to things is a little different. But, we think that helps us inhabit a space that is different enough from other vendors in the Cloud Computing space that we offer a significant value proposition to a large cross-section of the IT industry. From the lone developer with a great idea that comes in and picks up a $199 per year 1/4 GiB Acclerator to a deployment that has literally 100's of Acclerators running the largest Rails applications on the planet. We are able to take good care of them both.
  • What lessons have you learned? If at first you don't succeed. Try, try again. Get over your mistakes and move on.
  • Why have you succeeded? We care. Our clients care. That's a nice fit.
  • How are you thinking of changing your architecture in the future? MORE secret sauce... But seriously, we have some great additions coming up soon. I'll be in touch.

    How is your team setup?

  • How many people are in your team? Joyent has a small employee to client ratio. But, that's because we do what we do well. We are divided into several of the normal divisions you might expect like client support, marketing, sales, development, operations, and the business units.
  • Where are they located? Our corporate office is in Sausalito, CA. We have a development team in Seattle. We have a support organization that follows the sun and spans the globe. IM is a big deal at Joyent.
  • Is there anything that you would do different or that you have found surprising? I'd say managing expectations is the most challenging thing. I think that's where we stand to improve the most and where most of the surprises come from.

    What infrastructure do you use?

  • Which languages do you use to develop your system? Ruby
  • How many servers do you have? Not as many as Google!
  • How is functionality allocated to the servers?
  • How are the servers provisioned? We have a custom cloud provisioning system called MCP.
  • What operating systems do you use? Customized Sun Solaris Nevada
  • Which web server do you use? Apache and Nginx are the work horses in Joyent Accelerators
  • Which database do you use? MySQL and PostGRES are included w/ every Accelerator. Oracle works of you bring your own licenses. CouchDB works. We are certifying more all the time.
  • Do you use a reverse proxy? Well, our clients often use Nginx and now that there is a viable port of Varnish to OpenSolaris we are seeing more of that. Some of our clients use Squid as well. Most popular reverse proxy software will run find on our setup.
  • Do you collocate, use a grid service, use a hosting service, etc? We are that.
  • What is your storage strategy? DAS/SAN/NAS/SCSI/SATA/etc/other? We provide NFS to our clients for $0.15/GiB. 1 GiB = 1024 MB.
  • How much capacity do you have? Many Terabytes
  • How do you grow capacity? Add hardware
  • Do you use a storage service? We are a storage service.
  • Do you use storage virtualization? Not really. It's been and continues to be tested. But, you can't beat the real thing still in many cases.
  • How do you handle session management? Our clients do this depending on their development platform of choice at the application layer. Also, we can of course use our BigIP load balancing infrastructure to help out with that also.
  • How is your database architected? Master/slave? Shard? Other? All of the above, client by client. We know that Master-Master MySQL, Master-Slave MySQL, Oracle Clusters, MySQL Clusters, PostGRES, etc. They all work fine.
  • How do you handle load balancing? We have F5 BigIP's and we do what we call a managed load balancing service. For example, if you have two application servers, you need to load balance. Just ask us to set you up a VIP and we'll add the nodes you specify for a cost per node. All the pricing information is here. http://www.joyent.com/accelerator/pricing/
  • Which web framework/AJAX Library do you use? We have clients that use just about everything you can think of.
  • Which real-time messaging frame works do you use? We have very large clients running ejabberd. Erlang works great on our systems.
  • Which distributed job management system do you use? This is client by client. We do not offer this out of the box.
  • How do you handle ad serving? This is up to the client. We've seen just about all of them.
  • What is your object and content caching strategy? We usually recommend memcached, it's pre-installed and ready to turn on.
  • What is your client side caching strategy? I'd say most of our clients use cookies.

    How do you handle customer support?

    We have a customer support team that is dedicated to helping our customers. Our services pretty much assume that you will have some degree of ability with building and deploying systems. However, if you don't, we have standard, extended plan, and partners that can all be combined in various ways to help our clients. Our support follows the sun around the world.

    How is your data center setup?

  • How many data centers do you run in? Several. :) Currently only domestic on both coasts and elsewhere.
  • How is your system deployed in data centers? In-House Automated provisioning systems
  • Are your data centers active/active, active/passive? Everything is always on. Our clients often co-locate in multiple locations so that they can have solid DR scenarios to keep investors happen and recover quickly should a truck hit a telephone pole or something.
  • How do you handle syncing between data centers and fail over and load balancing? This is a complex topic and can be very simple of very complex. It's a bit out of scope for this document.
  • Which DNS service do you use? We run our own based on PowerDNS
  • Which switches do you use? Force10
  • Which email system do you use? Mostly Postfix
  • How do you handle spam? Filter at a variety of levels
  • How do you backup and restore your system? High level snap shots, clients are responsible for their own data primarily. However, we have ways to help them.
  • How are software and hardware upgrades rolled out? We do quarterly releases of key software and our Accelerators. Sometimes we get a little behind but try to roll with it. You get root on your Accelerator so you are not dependent on the Joyent release cycle at all.
  • How do you handle major changes in database schemas on upgrades? This is up to the clients and highly platform and applications specific.
  • What is your fault tolerance and business continuity plan? Lots of redundancy.
  • Do you have a separate operations team managing your website? No. We do it ourselves.
  • Do you use a content delivery network? If so, which one and what for? Yes. We are currently partnered with Limelight.
  • How much do you pay monthly for your setup? Accelerator plans range from $199 per year to $4000 per month. Significant discounts can be had if you pay ahead. But, it's very important to note that we do not require or even want contracts. Some companies try to force us into contracts and if you just MUST lock yourself in for years, we'll tie you down. But, we don't recommend it at all. In essence pay for what you need when you need it on a month to month granularity. http://www.joyent.com/accelerator/pricing/

    SUMMARY

    The Joyent Accelerator is an extremely flexible tool for building and deploying all manner of infrastructure. If you have questions, please just contact us at sales@joyent.com. Email or at an address is the best way to reach us usually.

    Related Articles

  • Scaling in the Cloud with Joyent's Jason Hoffman (podcast)
  • Amazon Web Services or Joyent Accelerators: Reprise by Jason Hoffman

    Click to read more ...

  • Wednesday
    Oct012008

    The Pattern Bible for Distributed Computing

    Software design patterns are an emerging tool for guiding and documenting system design. Patterns usually describe software abstractions used by advanced designers and programmers in their software. Patterns can provide guidance for designing highly scalable distributed systems. Let's see how! Patterns are in essence solutions to problems. Most of them are expressed in a format called Alexandrian form which draws on constructs used by Christopher Alexander. There are variants but most look like this:

    • The pattern name
    • The problem the pattern is trying to solve
    • Context
    • Solution
    • Examples
    • Design rationale: This tells where the pattern came from, why it works, and why experts use it
    Patterns rarely stand alone. Each pattern works on a context, and transforms the system in that context to produce a new system in a new context. New problems arise in the new system and context, and the next ‘‘layer’’ of patterns can be applied. A pattern language is a structured collection of such patterns that build on each other to transform needs and constraints into an architecture. The latest POSA book Pattern-Oriented Software Architecture Volume 4: A Pattern Language for Distributed Computing will guide the readers through the best practices and introduce them to key areas of building distributed software systems using patterns. The book pulls together 114 patterns and shows how to use them in the context of distributed software architectures. Although somewhat theoretical it is still a great resource for practicing distributed-systems architects. It is as close as you're going to get to a one-stop "encyclopedia" of patterns relevant to distributed computing. However it is not a true encyclopedia since "over 150" patterns are referenced but not described in POSA Volume 4. The book does not go into the details of the pattern's implementations, so the reader should already be familiar with the patterns, or be prepared to spend some time researching. The pattern language for distributed computing includes patterns such as:
    • Broker
    • Client-Dispatcher-Server
    • Pipes and Filters
    • Leaders/Followers
    • Reactor
    • Proactor
    Patterns can indeed be useful in designing highly scalable systems and solving various problems related to concurrency, synchronization and resource management and other topics. Wikipedia has more details on Pattern languages to check out.

    Click to read more ...

    Tuesday
    Sep302008

    Scalability Worst Practices

    Brian Zimmer, architect at travel startup Yapta, highlights some worst practices jeopardizing the growth and scalability of a system: * The Golden Hammer. Forcing a particular technology to work in ways it was not intended is sometimes counter-productive. Using a database to store key-value pairs is one example. Another example is using threads to program for concurrency. * Resource Abuse. Manage the availability of shared resources because when they fail, by definition, their failure is experienced pervasively rather than in isolation. For example, connection management to the database through a thread pool. * Big Ball of Mud. Failure to manage dependencies inhibits agility and scalability. * Everything or Something. In both code and application dependency management, the worst practice is not understanding the relationships and formulating a model to facilitate their management. Failure to enforce diligent control is a contributing scalability inhibiter. * Forgetting to check the time. To properly scale a system it is imperative to manage the time alloted for requests to be handled. * Hero Pattern. One popular solution to the operation issue is a Hero who can and often will manage the bulk of the operational needs. For a large system of many components this approach does not scale, yet it is one of the most frequently-deployed solutions. * Not automating. A system too dependent on human intervention, frequently the result of having a Hero, is dangerously exposed to issues of reproducibility and hit-by-a-bus syndrome. * Monitoring. Monitoring, like testing, is often one of the first items sacrificed when time is tight.

    Click to read more ...

    Sunday
    Sep282008

    Product: Happy = Hadoop + Python

    Has a Java only Hadoop been getting you down? Now you can be Happy. Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze. There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages. The more ways the better. From the website:

    Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an 
    easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. 
    The current release supports Hadoop 0.17.2.
    
    Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a 
    map(records, task) and reduce(key, values, task) function. Then you create an instance of the 
    class, set the job parameters (such as inputs and outputs) and call run().
    
    When you call run(), Happy serializes your job instance and copies it and all accompanying 
    libraries out to the Hadoop cluster. Then for each task in the Hadoop job, your job instance is 
    de-serialized and map or reduce is called.
    
    The task results are written out using a collector, but aggregate statistics and other roll-up 
    information can be stored in the happy.results dictionary, which is returned from the run() call.
    
    Jython modules and Java jar files that are being called by your code can be specified using 
    the environment variable HAPPY_PATH. These are added to the Python path at startup, and 
    are also automatically included when jobs are sent to Hadoop. The path is stored in happy.path 
    and can be edited at runtime. 
    

    Click to read more ...

    Friday
    Sep262008

    Lucasfilm: The Real Magic is in the Data Center

    Kevin Clark, director of IT operations for Lucasfilm, discusses how their data center works: * Linux-based platform, SUSE (looking to change), and a lot of proprietary open source applications for content creation. * 4,500-processor render farm in the datacenter. Workstations are used off hours. * Developed their own proprietary scheduler to schedule their 5,500 available processors. * Render nodes, the blade racks (from Verari), run dual-core dual Opteron chips with 32GB of memory on board, but are expanding those to quad-core. Are an AMD shop. * 400TB of storage online for production. * Every night they write out 10-20TB of new data on a render. A project will use up to a hundred-plus terabytes of storage. * Incremental backups are a challenge because the data changes up to 50 percent over a week. * NetApps used for storage. They like the global namespace in the virtual file system. * Foundry Networks architecture shop. One of the larger 10-GbE-backbone facilities on the West coast. 350-plus 10 GbE ports that used for distribution throughout the facility and the backend. * Grid computing used for over 4 years. * A 10-Gig dark fiber connection connects San Rafael and their home office. Enables them to co-render and co-storage between the two facilities. No difference in performance in terms of where they went to look for their data and their shots. * Artists get server class machines: HP 9400 workstations with dual-core dual Opteron processors and 16GB of memory. * Challenge now is to better segment storage to not continue to sink costs into high-cost disks. * VMware used to host a lot of development environments. Allows the quick turn up of testing as the tests can be allocated across VMs. * Provides PoE (power-over-ethernet) out from the switch to all of our Web farms. * Next push on the facilities side. How to be more efficient at airflow management and power utilization.

    Click to read more ...

    Thursday
    Sep252008

    HighScalability.com Rated 16th Best Blog for Development Managers

    Jurgen Appelo of Noop.nl asked for nominations for top blogs and then performed a sophisticated weighting of their popularity based on page range, trust authority, Alexa rank, Google hits, and number of comments. When all the results poured out of the blender HighScalability was ranked 16th. Not bad! Joel on Software was number one, of course. The next few were: 2) Coding Horror by Jeff Atwood 3) Seth's Blog by Seth Godin and 4) Paul Graham: Essays Paul Graham. All excellent blogs. Very cool.

    Click to read more ...

    Thursday
    Sep252008

    Is your cloud as scalable as you think it is?

    An unstated assumption is that clouds are scalable. But are they? Stick thousands upon thousands of machines together and there are a lot of potential bottlenecks just waiting to choke off your scalability supply. And if the cloud is scalable what are the chances that your application is really linearly scalable? At 10 machines all may be well. Even at 50 machines the seas look calm. But at 100, 200, or 500 machines all hell might break loose. How do you know? You know through real life testing. These kinds of tests are brutally hard and complicated. who wants to do all the incredibly precise and difficult work of producing cloud scalability tests? GridDynamics has stepped up to the challenge and has just released their Cloud Performance Reports. The report is quite detailed so I'll just cover what I found most interesting. GridDynamics in this report test three configurations:

  • GridGain running a Monte-Carlo simulation on EC2. This test is a CPU only test, a data grid is not accessed. This scenario tests the scalability of EC2 and GridGain. * GridGain provided near linear scalability end-to-end on a 512 node EC2 hosted grid. * EC2 is ready for production usage on large-scale stateless computations exhibiting good price for performance and a strong linear scaling curve.
  • GigaSpaces running a risk management simulation on EC2. This is a data-driven test. GigaSpaces is used in a configuration where the compute grid and the data grid are separated, even though GigaSpaces supports an in-memory data grid. * GigaSpaces provided near linear scalability from 16 to 256 nodes. There was a 28% degradation from 256 to 512 nodes because only four data grid servers were used. More were needed. The compute grid and data grid must each be sized to independently to scale properly. * EC2 is ready for production usage for classes of large-scale data-driven applications.
  • Windows HPC Server and Velocity running an analytics application in Microsoft's grid testbed. This test was more complicated than the others. It tested the performance implications of data "in the cloud" vs. "outside the cloud" for data-intensive analytics applications. * Keeping data close to the business logic matters. Performance improved up to 31 times over "outside the cloud." * Velocity failed on 50 node clusters with 200 concurrent clients. * Local caches provided significant performance gains over distributed caches. The local cache took load off the distributed cache. They are currently running more tests with different configurations. Hopefully we'll see those results later. All-in-all a generally optimistic report. EC2 scales. Mot of the tested grid frameworks scaled. What's also clear is it may take a while before deploying cloud based grids is an easy process. It still takes a lot of work to install, configure, start, stop, monitor, and debug bottlenecks in cloud based grids. Thanks to GridDynamics for putting in all this work and I look forward to their next set of reports.

    Click to read more ...

  • Thursday
    Sep252008

    GridGain: One Compute Grid, Many Data Grids

    GridGain was kind enough to present at the September 17th instance of the Silicon Valley Cloud Computing Group. I've been curious about GridGain so I was glad to see them there. In short GridGain is: an open source computational grid framework that enables Java developers to improve general performance of processing intensive applications by splitting and parallelizing the workload. GridGain can also be thought of as a set of middleware primitives for building applications. GridGain's peer group of competitors includes GigaSpaces, Terracotta, Coherence, and Hadoop. The speaker for GridGain was the President and Founder, Nikita Ivanov. He has a very pleasant down-to-earth way about him that contrasts nicely with a field given to religious discussions of complex taxomic definitions. Nikita first talked about cloud computing in general. He feels Java is the perfect gateway for cloud computing. Which is good because GridGain only works with Java. The Java centricity of GridGain may be an immediate deal killer or a non-issue for a Java shop. Being so close to the language does offer a lot of power, but it sure sucks in a multi-language environment. Nikita gave a few definitions which are key to understanding where GridGain stands in the grid matrix:

  • Compute Grids: parallel execution.
  • Data Grids: parallel data storage.
  • Grid Computing: Compute Grids + Data Grids
  • Cloud Computing: datacenter + API. The key is automation via programmability as a way to deploy applications. The advantage is a unified programming model. Build an application on one node and you can run on many nodes without code change. Moving peak loads to the cloud can give you a 10x-100x cost reduction. Cloud computing poses a number of challenges: deployment, data sharing, load balancing, failover, discovery (nodes, availability), provisioning (add, remove), management, monitoring, development process, debugging, inter and external clouds (syncing data, syncing code, failover jobs). Nakita talked some about these issues, but he didn't go in-depth. But he showed a good understanding of the issues involved so I would be inclined to think GridGain handles them well. The cloud computing section is new to the standard GridGain presentation. GridGain is moving their grid into the cloud with new features like a cloud management layer available in Q1 2009. This move competes with GigaSpaces early move to the cloud with their RightScale partnership. It's a good move. Like peanut butter and chocolate, grids and clouds go better together. Grids have been under utilized largely because of infrastructure issues. A cloud platform makes it is to affordably grow and manage grids, so we might see an uptick in grid adoption as clouds and grids hookup. GridGain positions themselves as a developer centric framework according to their analysis of cloud computing in Java:
  • Heavy UI oriented. These types of applications or framework usually provide UI-based consoles, management applications, plugins, etc that provide the only way to manage resources on the cloud such as starting and stopping the image, etc. The key characteristic of this approach is that it requires a substantial user input and human interaction and thus they tend to be less dynamic and less on-demand. Good examples would be RightScale, GigaSpaces, ElasticGrid.
  • Heavy framework oriented. This approach strongly emphasizes dynamism of resource management on the cloud. The key characteristic of this approach is that it requires no human interaction and all resource management can be done programmatically by the grid/cloud middleware - and thus it is more dynamic, automated and true on-demand. Google App Engine (for Python), GridGain would be good examples. I think there's a misunderstanding of RightScale here. The UI is to configure the automated system, not manage the system. The automated system monitors and responds to events without human interaction. Won't their automated cloud layer have to do something similar? To bootstrap any complex system out of the mud of complexity a helpful UI is needed. The framework approach of GridGain's infrastructure is developer friendly, but that won't fly for external management within the cloud.

    GridGain's True Nature: One Compute Grid, Many Data Grids

    With these definitions in place we can now learn the secret of Grid Gain: One Compute Grid, Many Data Grids. Ding! Ding! Ding! Once I understood this I understood Grid Gain's niche. GridGain has focussed on making it dead simple to distribute work across a compute grid. It's a job management mechanism. GridGain doesn't include a data grid. It will work against any data grid. For some reason this fact was something I'd never pulled out of the noise before. And when I would read Nakita's blog with all the nifty little code samples I never really appreciated what was happening. Yes, I'm just that dumb, but I also think Grid Gain should expose the magic of what's going on behind the scenes more rather than push the simple 30-second-lets-write-code-live style demo. Seeing the mechanics would make it easier to build a mental model of the value being added by GridGain.

    Transparent and Low Configuration Implementation of Key Features

    A compute grid is just a bunch of CPUs calculations/jobs/work can be run on. As a developer problem are broken up into smaller tasks and spread across all your nodes so the result is calculated faster because it is happening in parallel. GridGain enthusiastically supports the MapReduce model of computation. When deploying a grid a few key problems come up:
  • How do you get your code to all nodes? Not just the first time, but every time a JAR file changes how distributed across all nodes?
  • How do all the other nodes find each other when they come up? Clearly for work to be sent to nodes someone must know about them.
  • How are jobs distributed to the nodes? Somehow jobs must be sent to a node, the calculations made, and the results assembled.
  • How are failures handled? Somehow when a node goes down and new nodes come on-line work must be rescheduled.
  • How does each node get the data it needs to do its work? Scalable computation without scalable data doesn't work for most problems. Much of the drama is lost with GridGain because most of these capabilities almost are implemented almost transparently. Discovery happens automatically. When nodes come up they communicate with each other and transparently form a grid. You don't see this, it just happens. In fact, this was one of GridGain's issues when porting to the cloud. They used multicast for discovery and Amazon doesn't support multicast. So they had to use another messaging service, which GridGain supports doing out-of-the box, and are now working on their unicast own version of the discovery service. Deploying new code is always a frustrating problem. Over the same transparently formed grid, code updates are transparently auto deployed on the grid. Again, this is one of those things you see happen from Eclipse and it loses most of the impact. It just looks like how it's supposed to work, but rarely does. With GridGain you do a build and your code changes are automatically sent through to each node in the grid. Very nice. To mark a method a gridified an annotation (or an API call) is used:
    @Gridify(taskClass = GridifyHelloWorldTask.class, timeout = 3000)
    public static int sayIt(String phrase) {
        // Simply print out the argument.
        System.out.println(">>> Printing '" + phrase + "' on this node from grid-enabled method.");
        return phrase.length();
    }
    
    The task class is responsible for splitting method execution into sub-jobs. For a full example go here. The @Gridify annotation uses AOP (aspect-oriented programming) to automatically "gridify" the method. I assume this registers the method with the job scheduling system. When the application comes up and triggers execution the method is then scheduled through the job scheduling system and allocated to nodes. Again, you don't see this and they really don't talk enough about how this part works. Notice how so much complexity is nicely hidden by GridGain with very little configuration on the developer's part. There aren't a billion different XML files where every single part of the system has to be defined ahead of time. The dynamic transparent nature of the core features make it simple to use.

    Integrating with the Data Grid

    We haven't talked about data at all. If you are just concerned with a program like a Monte Carlo simulation then the compute grid is all you need. But most calculations require data. Where does your massive compute grid pull the data from? That's where the data grid comes in. A data grid is the controlled sharing and management of large amounts of distributed data. GridGain leaves the data grid up to other software by integrating with packages like, JBoss Cache, Oracle Coherence, and GigaSpaces. Remember One Compute grid, Many Data Grids. GridGain accesses the data grid through an API so you can plug in any data grid you want to support with a little custom code. Google and Hadoop use a distributed file system (DFS) as their data grid. This makes sense. When you need to feed lots of CPUs the data can't come from a centralized store. The data must be parallelized and that's what a DFS does. A DFS splats data across a lot of spindles so it can be pulled relatively quickly by lots of CPUs in parallel. Other products like Coherence and GigaSpaces store data in an in-memory data grid instead of a filesytem. Serving data from memory is faster, but you are limited by the amount of memory you have. If you have a petabyte of data your choice is clear, but if your problem is a bit smaller than maybe an in-memory solution would work. The closer data is to the business logic the better performance will be. GridGain controls job execution while the data grid is responsible for the availability and integrity of the data. GridGain doesn't care what data grid you use, but your choice has implications for performance. A compute grid and an in-memory data grid in the same cloud will smoke configurations where the data grid comes from disk or is located outside the cloud.

    GridGain is Linearly Scalable for a Pure CPU Benchmark

    The good folks at GridDynamics are doing some serious cloud testing of different products and different clouds. They did a test Scalability Benchmark of Monte Carlo Simulation on Amazon EC2 with GridGain Software that found GridGain was linearly scalable to 512 nodes in Amazon's EC2. A Monte Carlo simulation is a CPU test only, it does not use a data grid. A data grid based test would be more useful to me as everything changes once large amounts of data start flying around, but it does indicate the core of GridGain is quite scalable.

    Wrapping Up

    Grid products like Coherence and GigaSpaces include both compute grid and data grid features. Why choose a compute grid only system like GridGain when other products include both capabilities? GridGain might say they win business on the quality of their compute grid, excellent support and documentation, and the ability to cleanly integrate into almost any existing ecosystem through their well thought out API abstraction layer and their out-of-the-box support for almost every important Java framework. Others may counter performance is far better when the business logic and the job management are integrated. All interesting issues to tradeoff in your own decision making process. GridGain is free as their business model is based on providing support and consultation. A non-starter for many is the Java-only restriction. What is unique about GridGain is how easy and transparent they made it to use and deploy. That's some thoughtful engineering.

    Related Articles

  • Gridify Blog
  • Ten Useful Gridgain How-To Tips
  • 10 Reasons to Use GridGain
  • What is Grid Gain?
  • Developers Productivity: Unsung Hero of GridGain
  • GridGain vs Hadoop
  • Cameron Purdy: Defining a Data Grid
  • Compute Grids vs. Data Grids

    Click to read more ...

  • Wednesday
    Sep242008

    Building a Scalable Architecture for Web Apps

    By Bhavin Turakhia CEO, Directi. Covers: * Why scalability is important. Viral marketing can result in instant success. With RSS/Ajax/SOA number of requests grow exponentially with user base. Goal is to build a web 2.0 app that can server millions of users with zero downtime. * Introduction to the variables. Scalability, performance, responsiveness, availability, downtime impact, cost, maintenance effort. * Introduction to the factors. Platform selection, hardware, application design, database architecture, deployment architecture, storage architecture, abuse prevention, monitoring mechanisms, etc. * Building our own scalable architecture in incremental steps: vertical scaling, vertical partitioning, horizontal scaling, horizontal partitioning, etc. First buy bigger. Then deploy each service on a separate node. Then increase the number of nudes and load balance. Deal with session management. Remove single points of failure. Use a shared nothing cluster. Choice of master-slave multi-master setup. Replication can be synchronous or asynchronous. * Platform selection considerations. Use a global redirector for serving multiple datacenters. Add object, session API, and page cache. Add reverse proxy. Think about CDNs, Grid computing. * Tips. Use a Horizontal DB architecture from the beginning. Loosely couple all modules. Use a REST interface for easier caching. Perform application sizing ongoingly to ensure optimal hardware utilization.

    Click to read more ...

    Tuesday
    Sep232008

    How to Scale with Ruby on Rails

    By George Palmer of 3dogsbark.com. Covers: * How you start out: shared hosting, web server DB on same machine. Move two 2 machines. Minimal code changes. * Scaling the database. Add read slaves on their own machines. Then master-master setup. Still minimal code changes. * Scaling the web server. Load balance against multiple application servers. Application servers scale but the database doesn't. * User clusters. Partition and allocate users to their own dedicated cluster. Requires substantial code changes. * Caching. A large percentage of hits are read only. Use reverse proxy, memcached, and language specific cache. * Elastic architectures. Based on Amazon EC2. Start and stop instances on demand. For global applications keep a cache on each continent, assign users to clusters by location, maintain app servers on each continent, use transaction replication software if you must replicate your site globally.

    Click to read more ...