High Scalability -

Video Interview with Manik Surtani, Founder & Project Lead at JBoss Cache, Infinispan Data Grid

Monday

Sep072009

Product: Infinispan - Open Source Data Grid

Monday, September 7, 2009 at 5:40AM

Infinispan is a highly scalable, open source licensed data grid platform in the style of GigaSpaces and Oracle Coherence.

From their website:

The purpose of Infinispan is to expose a data structure that is highly concurrent, designed ground-up to make the most of modern multi-processor/multi-core architectures while at the same time providing distributed cache capabilities. At its core Infinispan exposes a JSR-107 (JCACHE) compatible Cache interface (which in turn extends java.util.Map). It is also optionally is backed by a peer-to-peer network architecture to distribute state efficiently around a data grid.

Offering high availability via making replicas of state across a network as well as optionally persisting state to configurable cache stores, Infinispan offers enterprise features such as efficient eviction algorithms to control memory usage as well as JTA compatibility.

In addition to the peer-to-peer architecture of Infinispan, on the roadmap is the ability to run farms of Infinispan instances as servers and connecting to them using a plethora of clients - both written in Java as well as other popular platforms.

A few observations:

Open source is an important consideration, depending on your business model. As you scale out your costs don't go up. The downside is you'll likely put in more programming effort to implement capabilities the commercial products have already solved.

It's from the makers of Jboss Cache so it's likely to have a solid implmentation, even so early in it's development cycle. The API looks very well thought out.

Java only. Plan is to add more bindings in the future.

Distributed hash table only. Commercial products have very advanced features like distributed query processing which can make all the difference during implementation. We'll see how the product expands from its caching roots into a full fledged data manipulation platform.

MVCC and a STM-like approach provide lock- and synchronization-free data structures. This means dust off all those non-blocking algorithms you've never used before. It will be very interesting to see how this approach performs under real-life loads programmed by real-life programmers not used to such techniques.

Data is made safe using a configurable degree of redundancy. State is distributed across a cluster. And it's peer-to-peer, there's no central server.

API based (put and get operations). XML, bytecode manipulation and JVM hooks aren't used.

Future plans call for adding a compute-grid for map-reduce style operations.

Distributed transactions across multiple objects are supported. It also offers eviction strategies to ensure individual nodes do not run out of memory and passivation/overflow to disk. Warm-starts using preloads are also supported.

It's exciting to have an open source grid alternative. It will be interesting to see how Infinispan develops in quality and its feature set. Making a mission critical system of this type is no simple task.

I don't necessarily see Infinispan as just a competitor for obvious players like GigaSpaces and Coherence, it may play even more strongly in the NoSQL space. For people looking for a reliable, highly performant, scalable, transaction aware hash storage system, Ininispan may look even more attractive than a lot of the disk based systems.

Infinispan Interview by Mark Little on InfoQ.

Are Cloud Based Memory Architectures the Next Big Thing?

Infinispan - data grids meets open source on TheServerSide.com

Technical FAQs

Anti-RDBMS: A list of distributed key-value stores

Infinispan Wiki

Distribution instead of Buddy Replication

Implementation Focus: Squarespace

Squarespace Architecture - A Grid Handles Hundreds of Millions of Requests a Month

Monday, August 31, 2009 at 12:19AM

I first heard an enthusiastic endorsement of Squarespace streaming from the ubiquitous Leo Laporte on one of his many Twit Live shows. Squarespace as a fully hosted, completely managed environment for creating and maintaining a website, blog or portfolio was of interest to me because they promise scalability and this site doesn't have enough of that. But sadly, since they don't offer a link preserving Drupal import our relationship was not meant to be.

When a fine reader of High Scalability, Brian Egge, (and all my readers are thrifty, brave, and strong) asked me how Squarespace scaled I said I didn't know, but I would try and find out. I emailed Squarespace a few questions and founder Anthony Casalena and Director of Technical Operations Rolando Berrios were kind enough to reply in some detail. The questions were both from Brian and myself. Answers can be found below.

Two things struck me most about Squarespace's approach:

They based their system on a memory grid, in this case Oracle Coherence. I'm not aware of too many customer facing systems that have moved to a grid as the backbone of their scalability strategy. It's good to see a successful system visible out in the wild.

They use a sort of Private Cloud internally. Everything is highly automated and easy to expand. They scale by adding additional resources like CPUs and disks and the system just adapts without a lot of human fussing involved. Now that's scaling with gas.

Learn more about how Squarespace has learned how to scale to tens of thousands of customers, hundreds of thousands of signups, and serve hundreds of millions of hits per month.

Site: http://www.squarespace.com

The Stats

Tens of thousands of customers.

Hundreds of thousands of signups.

Serves hundreds of millions of hits per month.

Platform

Java - well supported and an advanced language to work in, and the components out there (Apache Foundation, etc.) are second to none.

Tomcat - the stability of the server is extremely impressive.

Grid - Oracle Coherence for the re-balancing and caching layers.

Storage - Isilon Cluster. This allows them to treat their storage like another "grid" as the storage pool is easily scaled by adding more diskspace.

Monetiziation Strategy - charge money. No free customers. Pricing starts at $8/month.

Uptime - 99.98%

Hosting - Peer1, they do not yet operate in multiple datacenters.

Competitors - TypePad and WordPress

Hardware - they don't use "commodity nodes" or low cost hardware units. These end up costing more in the long run as datacenter power is extremely expensive.

Cacti - a cacti instance is used to graph statistical data which helps see trends over time, predict when a hardware upgrade is necessary, and troubleshoot any problems that do show up.

Lessons Learned

Cache as much as you can and load balance requests intelligently across a cluster.

Use an infrastructure that scales automatically merely by adding more resources (CPU, disk).

Build a scalable design up front. Make scaling easy by designing the application and infrastructure with scaling in mind.

Build a hands-off capable maintenance system. Automate processes. Make them as simple as possible. Monitor programatically so people don't have to.

Release code early and often. Running on the latest code means problems can be detected quickly when the problem are small.

Keep things simple. Apply simplicity to every part of your infrastructure, including both your software and those of your outside vendors. Examples of this are: Grid for the application infrastructure, Isilon cluster for storage, automation, creating their own tools.

Use as few technologies as possible by selecting or building simple, powerful and robust tools.

Don't be afraid to implement your own code to ensure simplicity. Build or buy is a huge balancing act.

Don't be afraid to spend money on technology that helps you get where you need to go. It can save you months and months of headaches that would have prevented you from working on core functionality.

Interview Questions and Responses

They say they run on a grid. I'd be interested to know if they built their own grid?

Partially. We rely on Oracle's Coherence product for the re-balancing
and caching layers of our system -- which we consider a real workhorse
for the "grid" aspects of the system. Each node in our infrastructure
can handle a hit for any single site on the system. This means that in order to increase capacity, we just increase node count. No site is handled by a single node.

2. How much traffic they can really handle?

We've had several customer sites on the front page of Digg on multiple
occasions, and didn't notice any performance degradation for any of our
sites. In fact, we didn't even realize the surge happened until we reviewed our traffic reports a few hours later. For 99% of sites out there, Squarespace is going to be sufficient. Even larger sites with millions of inbound hits per day are servable, as the bulk of the traffic serving on those sites is in the media being served.

3. How do they scale up, and allow for certain sites to become quite busy?

We've tried to make scaling easy, and the application and infrastructure
have been designed with scaling in mind. Because of this, we're luckily not
in a situation where we need to keep getting bigger and beefier hardware to handle more and more traffic -- we try to scale out by supplementing the
grid. Since we try to cache as much as we can and every server
participates in handling requests for every site, it's generally just a
matter of adding another node to the environment.

We try to apply this simplicity to every part of our infrastructure, both
with our own software and when deciding on purchases from outside vendors. For instance, we just increased the amount of available storage another few terabytes by adding another node to our Isilon cluster.

4. Are there any stats you can share about how many customers, how many users, how many requests served, how many servers, how much disk, how fast, how reliable?

We, unfortunately, can't share these numbers as we're a private company
-- but we can say we have tens of thousands of customers, hundreds
of thousands of signups, and serve hundreds of millions of hits per
month. The server types and disk configurations (RAID, etc) are a bit
irrelevant, as the clustering we implement provides redundancy -- not
anything implemented into a particular single machine. Nothing in
hardware is too particular to our setup. I will say we don't purchase
"commodity nodes" or other low cost hardware units, as we find these
end up costing more in the long run as datacenter power is extremely
expensive.

5. What technology stack are you using and why did you make the choices you made?

We currently use Java along with Tomcat as our web server. After
trying a few other solutions, we really appreciated the ability to use
as few technologies as possible, and have those always remain things
that are understandable for us. Java is an incredibly well supported
and advanced language to work in, and the components out there (Apache
Foundation, etc.) are second to none. As for Tomcat, the stability of
the server is extremely impressive. We've implemented our own
controller mechanisms on top of Tomcat (instead of going with some
other library) in order to ensure extreme simplicity.

6. How are you handling...

Multi-tenancy?

As mentioned above, every web node handles traffic for all sites, so a
customer doesn't have to worry about an underpowered server unable to handle their traffic, or a node going down.

Backups?

Backups are obviously important to us, and we have several copies of user
and server data stored in multiple locations. We gather backups with a
combination of various home-grown scripts customized for our environment.

Failover? Monitoring?

Since this company originally was solely maintained by Anthony when he
first started it, things needed to be as simple and automated as possible.
This includes failover and monitoring. Our monitoring systems check every
aspect of our environment we can think of several times a minute, and can
restart obviously dead services, or alert us if it's something an
actual person needs to handle.

Additionally, we've set up a cacti instance to graph as much statistical
data as we can pull out of our servers, so we can see trends over time.
This allows us to easily predict when a hardware upgrade is necessary. It also helps us troubleshoot any problems that do show up.

Operations? Releases? Upgrades? Add new hardware?

With our customer base constantly growing, it's getting tough to manage our systems and still keep our workload under control. There are some projects on the road map to move to a much more hands-off maintenance of our environment, including automatic code deployments and system software upgrades. Most operations can be done without taking the grid offline.

Multiple data centers?

We do not have multiple data centers, but have some plans in the works to
roll one out within the next year.

Development?

This is a really broad question, so it's a bit hard to succinctly
answer. One thing (amongst many) that has consistently served us very
well is trying to ensure our development environment is always
releasable into production. By ensuring we're always out there with
our latest code, we can usually detect problems very rapidly, and
as a result, those problems are generally extremely small. Everyone on our development team tends to be responsible for wide, sweeping aspects of the system -- which gives them a lot of flexibility to determine how
their components should work as a whole. It's incredibly important
that everything fits seamlessly together in the end, so we spend a lot
of time iterating on things that other groups might consider finished.

Support?

Support is something we take extremely seriously. As we've grown from
the ground up without an external investor, most of our team members
are versed in support, and understand how critical this component is.
Our support staff is completely hired from our community, and is
incredibly passionate about their jobs. We try and get every single
customer support inquiry answered within 15 minutes or less, and have all sorts of metrics related to our goals here.

7. What have you done that's really cool that you think other people could learn from?

We spend a lot of time internally writing scripts and other
applications that simply run our business. For instance, our
persistence layer configuration files are generated by applications
we've written that read our database model directly from the database.
We develop a lot of these programs, and a lot of "standard naming"--this, again, means that we can move very rapidly as we have less monotonous tasks and searching to think about.

While this sort of thing is appropriate for small tasks, for the big
ones, we also aren't afraid to spend money on well developed
technology. Some of our choices for load balancing and storage are
very costly, but end up saving us months and months of time in the
long haul, as we've avoided having to "put out fires" generated by
untested home grown solutions. It's a huge balancing act.

The End

Often the best way to judge a product is to peruse the developer forums. It's these people who know what's really happening. And when I look I see an almost complete absence of threads about performance, scalability, or reliability problems. Take a look at other CMSs and you'll see a completely different tenor of questions. That says something good about the strength of their scalability strategy.

I'd really like to thank Squarespace for taking the time and making the effort to share they've learned with the larger community. It's an effort we all benefit from. If you would also like to share your knowledge and wisdom with the world please get in touch and let's get started!

Are Cloud Based Memory Architectures the Next Big Thing?

Up and running on Squarespace by Peter Efland

Kevin Rose Comes to Squarespace by D. Atkinson

Squarespace Vs Wordpress a thread in their developer forum.

10 Comments |

Permalink |

Example,

Monday

Jun012009

Data grid comparison: Oracle Coherence vs Gigaspaces XAP

Monday, June 1, 2009 at 1:08AM

A short summary of differences between Oracle Coherence and GigaSpaces XAP.

gojko |

1 Comment |

Permalink |

Grid,

Oracle,

coherence,

data-grid,

gigaspaces,

GridGain: One Compute Grid, Many Data Grids

Monday

Mar162009

Are Cloud Based Memory Architectures the Next Big Thing?

Monday, March 16, 2009 at 10:54AM

We are on the edge of two potent technological changes: Clouds and Memory Based Architectures. This evolution will rip open a chasm where new players can enter and prosper. Google is the master of disk. You can't beat them at a game they perfected. Disk based databases like SimpleDB and BigTable are complicated beasts, typical last gasp products of any aging technology before a change. The next era is the age of Memory and Cloud which will allow for new players to succeed. The tipping point will be soon.

Let's take a short trip down web architecture lane:

It's 1993: Yahoo runs on FreeBSD, Apache, Perl scripts and a SQL database

It's 1995: Scale-up the database.

It's 1998: LAMP

It's 1999: Stateless + Load Balanced + Database + SAN

It's 2001: In-memory data-grid.

It's 2003: Add a caching layer.

It's 2004: Add scale-out and partitioning.

It's 2005: Add asynchronous job scheduling and maybe a distributed file system.

It's 2007: Move it all into the cloud.

It's 2008: Cloud + web scalable database.

It's 20??: Cloud + Memory Based Architectures

You may disagree with the timing of various innovations and you would be correct. I couldn't find a history of the evolution of website architectures, so I just made stuff up. If you have any better information please let me know.

Why might cloud based memory architectures be the next big thing? For now we'll just address the memory based architecture part of the question, the cloud component is covered a little later.

Behold the power of keeping data in memory:

Google query results are now served in under an astonishingly fast 200ms, down from 1000ms in the olden days. The vast majority of this great performance improvement is due to holding indexes completely in memory. Thousands of machines process each query in order to make search results appear nearly instantaneously.

This text was adapted from notes on Google Fellow Jeff Dean keynote speech at WSDM 2009.

Google isn't the only one getting a performance bang from moving data into memory. Both LinkedIn and Digg keep the graph of their network social network in memory. Facebook has northwards of 800 memcached servers creating a reservoir of 28 terabytes of memory enabling a 99% cache hit rate. Even little guys can handle 100s of millions of events per day by using memory instead of disk.

With their new Unified Computing strategy Cisco is also entering the memory game. Their new machines "will be focusing on networking and memory" with servers crammed with 384 GB of RAM, fast processors, and blazingly fast processor interconnects. Just what you need when creating memory based systems.

Memory is the System of Record

What makes Memory Based Architectures different from traditional architectures is that memory is the system of record. Typically disk based databases have been the system of record. Disk has been King, safely storing data away within its castle walls. Disk being slow we've ended up wrapping disks in complicated caching and distributed file systems to make them perform.

Sure, memory is used as all over the place as cache, but we're always supposed to pretend that cache can be invalidated at any time and old Mr. Reliable, the database, will step in and provide the correct values. In Memory Based Architectures memory is where the "official" data values are stored.

Caching also serves a different purpose. The purpose behind cache based architectures is to minimize the data bottleneck through to disk. Memory based architectures can address the entire end-to-end application stack. Data in memory can be of higher reliability and availability than traditional architectures.

Memory Based Architectures initially developed out of the need in some applications spaces for very low latencies. The dramatic drop of RAM prices along with the ability of servers to handle larger and larger amounts of RAM has caused memory architectures to verge on going mainstream. For example, someone recently calculated that 1TB of RAM across 40 servers at 24 GB per server would cost an additional $40,000. Which is really quite affordable given the cost of the servers. Projecting out, 1U and 2U rack-mounted servers will soon support a terabyte or more or memory.

RAM = High Bandwidth and Low Latency

Why are Memory Based Architectures so attractive? Compared to disk RAM is a high bandwidth and low latency storage medium. Depending on who you ask the bandwidth of RAM is 5 GB/s. The bandwidth of disk is about 100 MB/s. RAM bandwidth is many hundreds of times faster. RAM wins. Modern hard drives have latencies under 13 milliseconds. When many applications are queued for disk reads latencies can easily be in the many second range. Memory latency is in the 5 nanosecond range. Memory latency is 2,000 times faster. RAM wins again.

RAM is the New Disk

The superiority of RAM is at the heart of the RAM is the New Disk paradigm. As an architecture it combines the holy quadrinity of computing:

Performance is better because data is accessed from memory instead of through a database to a disk.

Scalability is linear because as more servers are added data is transparently load balanced across the servers so there is an automated in-memory sharding.

Availability is higher because multiple copies of data are kept in memory and the entire system reroutes on failure.

Application development is faster because there’s only one layer of software to deal with, the cache, and its API is simple. All the complexity is hidden from the programmer which means all a developer has to do is get and put data.

Access disk on the critical path of any transaction limits both throughput and latency. Committing a transaction over the network in-memory is faster than writing through to disk. Reading data from memory is also faster than reading data from disk. So the idea is to skip disk, except perhaps as an asynchronous write-behind option, archival storage, and for large files.

Or is Disk is the the new RAM

To be fair there is also a Disk is the the new RAM, RAM is the New Cache paradigm too. This somewhat counter intuitive notion is that a cluster of about 50 disks has the same bandwidth of RAM, so the bandwidth problem is taken care of by adding more disks.

The latency problem is handled by reorganizing data structures and low level algorithms. It's as simple as avoiding piecemeal reads and organizing algorithms around moving data to and from memory in very large batches and writing highly parallelized programs. While I have no doubt this approach can be made to work by very clever people in many domains, a large chunk of applications are more time in the random access domain space for which RAM based architectures are a better fit.

Grids and a Few Other Definitions

There's a constellation of different concepts centered around Memory Based Architectures that we'll need to understand before we can understand the different products in this space. They include:

Compute Grid - parallel execution. A Compute Grid is a set of CPUs on which calculations/jobs/work is run. Problems are broken up into smaller tasks and spread across nodes in the grid. The result is calculated faster because it is happening in parallel.

Data Grid - a system that deals with data — the controlled sharing and management of large amounts of distributed data.

In-Memory Data Grid (IMDG) - parallel in-memory data storage. Data Grids are scaled horizontally, that is by adding more nodes. Data contention is removed removed by partitioning data across nodes.

Colocation - Business logic and object state are colocated within the same process. Methods are invoked by routing to the object and having the object execute the method on the node it was mapped to. Latency is low because object state is not sent across the wire.

Grid Computing - Compute Grids + Data Grids

Cloud Computing - datacenter + API. The API allows the set of CPUs in the grid to be dynamically allocated and deallocated.

Who are the Major Players in this Space?

With that bit of background behind us, there are several major players in this space (in alphabetical order):

Coherence - is a peer-to-peer, clustered, in-memory data management system. Coherence is a good match for applications that need write-behind functionality when working with a database and you require multiple applications have ACID transactions on the database. Java, JavaEE, C++, and .NET.

GemFire - an in-memory data caching solution that provides low-latency and near-zero downtime along with horizontal & global scalability. C++, Java and .NET.

GigaSpaces - GigaSpaces attacks the whole stack: Compute Grid, Data Grid, Message, Colocation, and Application Server capabilities. This makes for greater complexity, but it means there's less plumbing that needs to be written and developers can concentrate on writing business logic. Java, C, or .Net.

GridGain - A compute grid that can operate over many data grids. It specializes in the transparent and low configuration implementation of features. Java only.

Terracotta - Terracotta is network-attached memory that allows you share memory and do anything across a cluster. Terracotta works its magic at the JVM level and provides: high availability, an end of messaging, distributed caching, a single JVM image. Java only.

WebSphere eXtreme Scale. Operates as an in-memory data grid that dynamically caches, partitions, replicates, and manages application data and business logic across multiple servers.

This class of products has generally been called In-Memory Data Grids (IDMG), though not all the products fit snugly in this category. There's quite a range of different features amongst the different products.

I tossed IDMG the acronym in favor of Memory Based Architectures because the "in-memory" part seems redundant, the grid part has given way to the cloud, the "data" part really can include both data and code. And there are other architectures that will exploit memory yet won't be classic IDMG. So I just used Memory Based Architecture as that's the part that counts.

Given the wide differences between the products there's no canonical architecture. As an example here's a diagram of how GigaSpaces In-Memory-Data-Grid on the Cloud works.

Some key points to note are:

A POJO (Plain Old Java Object) is written through a proxy using a hash-based data routing mechanism to be stored in a partition on a Processing Unit. Attributes of the object are used as a key. This is straightforward hash based partitioning like you would use with memcached.

You are operating through GigaSpace's framework/container so they can automatically handle things like messaging, sending change events, replication, failover, master-worker pattern, map-reduce, transactions, parallel processing, parallel query processing, and write-behind to databases.

Scaling is accomplished by dividing your objects into more partitions and assigning the partitions to Processing Unit instances which run on nodes-- a scale-out strategy. Objects are kept in RAM and the objects contain both state and behavior. A Service Grid component supports the dynamic creation and termination of Processing Units.

Not conceptually difficult and familiar to anyone who has used caching systems like memcached. Only is this case memory is not just a cache, it's the system of record.

Obviously there are a million more juicy details at play, but that's the gist of it. Admittedly GigaSpaces is on the full featured side of the product equation, but from a memory based architecture perspective the ideas should generalize. When you shard a database, for example, you generally lose the ability to execute queries, you have to do all the assembly yourself. By using GigaSpaces framework you get a lot of very high-end features like parallel query processing for free.

The power of this approach certainly comes in part from familiar concepts like partitioning. But the speed of memory versus disk also allows entire new levels of performance and reliability in a relatively simple and easy to understand and deploy package.

NimbusDB - the Database in the Cloud

Jim Starkey, President of NimbusDB, is not following the IDMG gang's lead. He's taking a completely fresh approach based on thinking of the cloud as a new platform unto itself. Starting from scratch, what would a database for the cloud look like?

Jim is in position to answer this question as he has created a transactional database engine for MySQL named Falcon and added multi-versioning support to InterBase, the first relational database to feature MVCC (Multiversion Concurrency Control).

What defines the cloud as a platform? Here's are some thoughts from Jim I copied out of the Cloud Computing group. You'll notice I've quoted Jim way way too much. I did that because Jim is an insightful guy, he has a lot of interesting things to say, and I think he has a different spin on the future of databases in the cloud than anyone else I've read. He also has the advantage of course of not having a shipping product, but we shall see.

I've probably said this before, but the cloud is a new computing platform that some have learned to exploit, others are scrambling to master, but most people will see as nothing but a minor variation on what they're already doing. This is not new. When time sharing as invented, the batch guys considered it as remote job entry, just a variation on batch. When departmental computing came along (VAXes, et al), the timesharing guys considered it nothing but timesharing on a smaller scale. When PCs and client/server computing came along, the departmental computing guys (i.e. DEC), considered PCs to be a special case of smart terminals. And when the Internet blew into town, the client server guys considered it as nothing more than a global scale LAN. So the batchguys are dead, the timesharing guys are dead, the departmental computing guys are dead, and the client server guys are dead. Notice a pattern?

The reason that databases are important to cloud computing is that virtually all applications involve the interaction of client data with a shared, persistent data store. And while application processing can be easily scaled, the limiting factor is the database system. So if you plan to do anything more than play Tetris in the cloud, the issue of database management should be foremost in your mind.

Disks are the limiting factors in contemporary database systems. Horrible things, disk. But conventional wisdom is that you build a clustered database system by starting with a distributed file system. Wrong. Evolution is faster processors, bigger memory, better tools. Revolution
is a different way of thinking, a different topology, a different way of putting the parts together.

What I'm arguing is that a cloud is a different platform, and what works well for a single computer doesn't work at all well in cloud, and things that work well in a cloud don't work at all on the single computer system. So it behooves us to re-examine a lot an ancient and honorable assumptions to see if they make any sense at all in this brave new world.

Sharing a high performance disk system is fine on a single computer, troublesome in a cluster, and miserable on a cloud.

I'm a database guy who's had it with disks. Didn't much like the IBM 1301, and disks haven't gotten much better since. Ugly, warty, slow, things that require complex subsystems to hide their miserable characteristics. The alternative is to use the memory in a cloud as a distributed L2
cache. Yes, disks are still there, but they're out of the performance loop except for data so stale that nobody has it memory.

Another machine or set of machines is just as good as a disk. You can quibble about reliable power, etc, but write queuing disks have the same problem.

Once you give up the idea of logs and page caches in favor of asynchronous replications, life gets a great deal brighter. It really does make sense to design to the strengths of cloud(redundancy) rather than their weaknesses (shared anything).

And while one guys is fetching his 100 MB per second, the disk is busy and everyone else is waiting in line contemplating existence. Even the cheapest of servers have two gigabit ethernet channels and switch. The network serves everyone in parallel while the disk is single threaded

I favor data sharing through a formal abstraction like a relational database. Shared objects are things most programmers are good at handling. The fewer the things that application developers need to manage the more likely it is that the application will work.

I buy the model of object level replication, but only as a substrate for something with a more civilized API. Or in other words, it's a foundation, not a house.

I'd much rather have a pair of quad-core processors running as independent servers than contending for memory on a dual socket server. I don't object to more cores per processor chip, but I don't want to pay for die size for cores perpetually stalled for memory.

The object substrate worries about data distribution and who should see what. It doesn't even know it's a database. SQL semantics are applied by an engine layered on the object substrate. The SQL engine doesn't worry or even know that it's part of a distributed database -- it just executes SQL statements. The black magic is MVCC.

I'm a database developing building a database system for clouds. Tell me what you need. Here is my first approximation: A database that scales by adding more computers and degrades gracefully when machines are yanked out; A database system that never needs to be shut down; Hardware and software fault tolerance; Multi-site archiving for disaster survival; A facility to reach into the past to recover from human errors (drop table customers; oops;); Automatic load balancing

MySQL scales with read replication which requires a full database copy to start up. For any cloud relevant application, that's probably hundreds of gigabytes. That makes it a mighty poor candidate for on-demand virtual servers.

Do remember that the primary function of a database system is to maintain consistency. You don't want a dozen people each draining the last thousand buckets from a bank account or a debit to happen without the corresponding credit.

Whether the data moves to the work or the work moves to the data isn't that important as long as they both end up a the same place with as few intermediate round trips as possible.

In my area, for example, databases are either limited by the biggest, ugliest machine you can afford *or* you have to learn to operation without consistent, atomic transactions. A bad rock / hard place choice that send the cost of scalable application development through the ceiling. Once we solve that, applications that server 20,000,000 users will be simple and cheap to write. Who knows where that will go?

To paraphrase our new president, we must reject the false choice between data consistency and scalability.

Cloud computing is about using many computers to scale problems that were once limited by the capabilities of a single computer. That's what makes clouds exciting, at least to me. But most will argue that cloud computing is a better economic model for running many instances of a
single computer. Bah, I say, bah!

Cloud computing is a wonder new platform. Let's not let the dinosaurs waiting for extinction define it as a minor variation of what they've been doing for years. They will, of course, but this (and the dinosaurs) will pass.

The revolutionary idea is that applications don't run on a single computer but an elastic cloud of computers that grows and contracts by demand. This, in turn, requires an applications infrastructure that can a) run a single application across as many machines as necessary, and b) run many applications on the same machines without any of the cross talk and software maintenance problems of years past. No, the software infrastructure required to enable this is not mature and certainly not off the shelf, but many smart folks are working on it.

There's nothing limiting in relational except the companies that build them. A relational database can scale as well as BigTable and SimpleDB but still be transactional. And, unlike BigTable and SimpleDB, a relational database can model relationships and do exotic things like transferring money from one account to another without "breaking the bank.". It is true that existing relational database systems are largely constrained to single cpu or cluster with a shared file system, but we'll get over that.

Personally, I don't like masters any more than I like slaves. I strongly favor peer to peer architectures with no single point of failure. I also believe that database federation is a work-around
rather than a feature. If a database system had sufficient capacity, reliability, and availability, nobody would ever partition or shard data. (If one database instance is a headache, a million tiny ones is a horrible, horrible migraine.)

Logic does need to be pushed to the data, which is why relational database systems destroyed hierarchical (IMS), network (CODASYL), and OODBMS. But there is a constant need to push semantics higher to further reduce the number of round trips between application semantics and the database systems. As for I/O, a database system that can use the cloud as an L2 cache breaks free from dependencies on file systems. This means that bandwidth and cycles are the limiting factors, not I/O capacity.

What we should be talking about is trans-server application architecture, trans-server application platforms, both, or whether one will make the other unnecessary.

If you scale, you don't/can't worry about server reliability. Money spent on (alleged) server reliability is money wasted.

If you view the cloud as a new model for scalable applications, it is a radical change in computing platform. Most people see the cloud through the lens of EC2, which is just another way to run a server that you have to manage and control, then the cloud is little more than a rather
boring business model. When clouds evolve to point that applications and databases can utilize whatever resources then need to meet demand without the constraint of single machine limitations, we'll have something really neat.

On MVCC: Forget about the concept of master. Synchronizing slaves to a master is hopeless. Instead, think of a transaction as a temporal view of database state; different transactions
will have different views. Certain critical operations must be serialized, but that still doesn't require that all nodes have identical views of database state.

Low latency is definitely good, but I'm designing the system to support geographically separated sub-clouds. How well that works under heavy load is probably application specific. If the amount of volatile data common to the sub-clouds is relatively low, it should work just fine provided there is enough bandwidth to handle the replication messages.

MVCC tracks multiple versions to provide a transaction with a view of the database consistent with the instant it started while preventing a transaction from updating a piece of data that it could not see. MVCC is consistent, but it is not serializable. Opinions vary between academia and the real world, but most database practitioners recognize that the consistency provided by MVCC is sufficient for programmers of modest skills to product robust applications.

MVCC, heretofore, has been limited to single node databases. Applied to the cloud with suitable bookkeeping to control visibility of updates on individual nodes, MVCC is as close to black magic as you are likely to see in your lifetime, enabling concurrency and consistency with mostly non-blocking, asynchronous messaging. It does, however, dispense with the idea that a cloud has at any given point of time a single definitive state. Serializability implemented with record locking is an attempt to make distributed system march in lock-step so that the result is as if there there no parallelism between nodes. MVCC recognizes that parallelism is the key to scalability. Data that is a few microseconds old is not a problem as long as updates don't collide.

Jim certainly isn't shy with his opinions :-)

My summary of what he wants to do with NimbusDB is:

Make a scalable relational database in the cloud where you can use normal everyday SQL to perform summary functions, define referential integrity, and all that other good stuff.

Transactions scale using a distributed version of MVCC, which I do not believe has been done before. This is the key part of the plan and a lot depends on it working.

The database is stored primarily in RAM which makes cloud level scaling of an RDBMS possible.

The database will handle all the details of scaling in the cloud. To the developer it will look like just a very large highly available database.

I'm not sure if NimbusDB will support a compute grid and map-reduce type functionality. The low latency argument for data and code collocation is a good one, so I hope it integrates some sort of extension mechanism.

Why might NimbusDB be a good idea?

Keeps simple things simple. Web scale databases like BigTable and SimpleDB make simple things difficult. They are full of quotas, limits, and restrictions because by their very nature they are just a key-value layer on top of a distributed file system. The database knows as little about the data as possible. If you want to build a sequence number for a comment system, for example, it takes complicated sharding logic to remove write contention. Developers are used to SQL and are comfortable working within the transaction model, so the transition to cloud computing would be that much easier. Now, to be fair, who knows if NimbusDB will be able to scale under high load either, but we need to make simple things simple again.

Language independence. Notice the that IDMG products are all language specific. They support some combination of .Net/Java/C/C++. This is because they need low level object knowledge to transparently implement their magic. This isn't bad, but it does mean if you use Python, Erlang, Ruby, or any other unsupported language then you are out of luck. As many problems as SQL has, one of its great gifts is programmatic universal access.

Separates data from code. Data is forever, code changes all the time. That's one of the common reasons for preferring a database instead of an objectbase. This also dovetails with the language independence issue. Any application can access data from any language and any platform from now and into the future. That's a good quality to have.

The smart money has been that cloud level scaling requires abandoning relational databases and distributed transactions. That's why we've seen an epidemic of key-value databases and eventually consistent semantics. It will be fascinating to see if Jim's combination of Cloud + Memory + MVCC can prove the insiders wrong.

Are Cloud Based Memory Architectures the Next Big Thing?

We've gone through a couple of different approaches to deploying Memory Based Architectures. So are they the next big thing?

Adoption has been slow because it's new and different and that inertia takes a while to overcome. Historically tools haven't made it easy for early adopters to make the big switch, but that is changing with easier to deploy cloud based systems. And current architectures, with a lot of elbow grease, have generally been good enough.

But we are seeing a wide convergence on caching as way to make slow disks perform. Truly enormous amounts of effort are going into adding cache and then trying to keep the database and applications all in-sync with cache as bottom up and top down driven changes flow through the system.

After all that work it's a simple step to wonder why that extra layer is needed when the data could have just as well be kept in memory from the start. Now add the ease of cloud deployments and the ease of creating scalable, low latency applications that are still easy to program, manage, and deploy. Building multiple complicated layers of application code just to make the disk happy will make less and less sense over time.

We are on the edge of two potent technological changes: Clouds and Memory Based Architectures. This evolution will rip open a chasm where new players can enter and prosper. Google is the master of disk. You can't beat them at a game they perfected. Disk based databases like SimpleDB and BigTable are complicated beasts, typical last gasp products of any aging technology before a change. The next era is the age of Memory and Cloud which will allow for new players to succeed. The tipping point is soon.

GridGain vs Hadoop

Cameron Purdy: Defining a Data Grid

Compute Grids vs. Data Grids

Performance killer: Disk I/O by Nathanael Jones

RAM is the new disk... by Steven Robbins

Talk on disk as the new RAM by Greg Linden

Disk-Based Parallel Computation, Rubik's Cube, and Checkpointing by Gene Cooperman, Northeastern Professor, High Performance Computing Lab - Disk is the the new RAM and RAM is the new cache

Disk is the new disk by David Hilley.

Latency lags bandwidth by David A. Patterson

InfoQ Article - RAM is the new disk... by Nati Shalom

Tape is Dead Disk is Tape Flash is Disk RAM Locality is King by Jim Gray

Product: ScaleOut StateServer is Memcached on Steroids

Cameron Purdy: Defining a Data Grid

Compute Grids vs. Data Grids

Latency is Everywhere and it Costs You Sales - How to Crush it

Virtualization for High Performance Computing by Shai Fultheim

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (part 1) by Greg Pfister

How do you design and handle peak load on the Cloud ? by Cloudiquity.

Defining a Data Grid by Cameron Purdy

The Share-Nothing Architecture by Zef Hemel.

Scaling memcached at Facebook

Cache-aside, write-behind, magic and why it sucks being an Oracle customer by Stefan Norberg.

Introduction to Terracotta by Mike

The five-minute rule twenty years later, and how flash memory changes the rules by Goetz Graefe

32 Comments |

Permalink |

GridGain: One Compute Grid, Many Data Grids

Monday

Feb162009

Handle 1 Billion Events Per Day Using a Memory Grid

Monday, February 16, 2009 at 1:58AM

Moshe Kaplan of RockeTier shows the life cycle of an affiliate marketing system that starts off as a cub handling one million events per day and ends up a lion handling 200 million to even one billion events per day. The resulting system uses ten commodity servers at a cost of $35,000. Mr. Kaplan's paper is especially interesting because it documents a system architecture evolution we may see a lot more of in the future: database centric --> cache centric --> memory grid. As scaling and performance requirements for complicated operations increase, leaving the entire system in memory starts to make a great deal of sense. Why use cache at all? Why shouldn't your system be all in memory from the start?

General Approach to Evolving the System to Scale

Analyze the system architecture and the main business processes. Detect the main hardware bottlenecks and the related business process causing them. Focus efforts on points of greatest return.

Rate the bottlenecks by importance and provide immediate and practical recommendation to improve performance.

Implement the recommendations to provide immediate relief to problems. Risk is reduced by avoiding a full rewrite and spending a fortune on more resources.

Plan a road map for meeting next generation solutions.

Scale up and scale out when redesign is necessary.

One Million Event Per Day System

The events are common advertising system operations like: ad impressions, clicks, and sales.

Typical two tier system. Impressions and banner sales are written directly to the database.

The immediate goal was to process 2.5 million events per day so something needed to be done.

2.5 Million Event Per Day System

PerfMon was used to check web server and DB performance counters. CPU usage was at 100% at peak usage.

Immediate fixes included: tuning SQL queries, implementing stored procedures, using a PHP compiler, removing include files and fixing other programming errors.

The changes successfully double the performance of the system within 3 months. The next goal was to handle 20 million events per day.

20 Million Event Per Day System

To make this scaling leap a rethinking of how the system worked was in order.

The main load of the system was validating inputs in order to prevent forgery.

A cache was maintained in the application servers to cut unnecessary database access. The result was 50% reduction in CPU utilization.

An in-memory database was used to accumulate transactions over time (impression counting, clicks, sales recording).

A periodic process was used to write transactions from the in-memory database to the database server.

This architecture could handle 20 million events using existing hardware.

Business projections required a system that could handle 200 million events.

200 Million Event Per Day System

The next architectural evolution was to a scale out grid product. It's not mentioned in the paper but I think GigaSpaces was used.

A Layer 7 load balancer is used to route requests to sharded application servers. Each app server supports a different set of banners.

Data is still stored in the database as the data is used for statistics, reports, billing, fraud detection and so on.

Latency was slashed because logic was separated out of the HTTP request/response loop into a separate process and database persistence is done offline. At this point architecture supports near-linear scaling and it's projected that it can easily scale to a billion events per day.

Click to read more ...

17 Comments |

Permalink |

Strategy,