High Scalability -

Video Interview with Manik Surtani, Founder & Project Lead at JBoss Cache, Infinispan Data Grid

Product,

key-value store

Monday

Sep072009

Product: Infinispan - Open Source Data Grid

Monday, September 7, 2009 at 5:40AM

Infinispan is a highly scalable, open source licensed data grid platform in the style of GigaSpaces and Oracle Coherence.

From their website:

The purpose of Infinispan is to expose a data structure that is highly concurrent, designed ground-up to make the most of modern multi-processor/multi-core architectures while at the same time providing distributed cache capabilities. At its core Infinispan exposes a JSR-107 (JCACHE) compatible Cache interface (which in turn extends java.util.Map). It is also optionally is backed by a peer-to-peer network architecture to distribute state efficiently around a data grid.

Offering high availability via making replicas of state across a network as well as optionally persisting state to configurable cache stores, Infinispan offers enterprise features such as efficient eviction algorithms to control memory usage as well as JTA compatibility.

In addition to the peer-to-peer architecture of Infinispan, on the roadmap is the ability to run farms of Infinispan instances as servers and connecting to them using a plethora of clients - both written in Java as well as other popular platforms.

A few observations:

Open source is an important consideration, depending on your business model. As you scale out your costs don't go up. The downside is you'll likely put in more programming effort to implement capabilities the commercial products have already solved.

It's from the makers of Jboss Cache so it's likely to have a solid implmentation, even so early in it's development cycle. The API looks very well thought out.

Java only. Plan is to add more bindings in the future.

Distributed hash table only. Commercial products have very advanced features like distributed query processing which can make all the difference during implementation. We'll see how the product expands from its caching roots into a full fledged data manipulation platform.

MVCC and a STM-like approach provide lock- and synchronization-free data structures. This means dust off all those non-blocking algorithms you've never used before. It will be very interesting to see how this approach performs under real-life loads programmed by real-life programmers not used to such techniques.

Data is made safe using a configurable degree of redundancy. State is distributed across a cluster. And it's peer-to-peer, there's no central server.

API based (put and get operations). XML, bytecode manipulation and JVM hooks aren't used.

Future plans call for adding a compute-grid for map-reduce style operations.

Distributed transactions across multiple objects are supported. It also offers eviction strategies to ensure individual nodes do not run out of memory and passivation/overflow to disk. Warm-starts using preloads are also supported.

It's exciting to have an open source grid alternative. It will be interesting to see how Infinispan develops in quality and its feature set. Making a mission critical system of this type is no simple task.

I don't necessarily see Infinispan as just a competitor for obvious players like GigaSpaces and Coherence, it may play even more strongly in the NoSQL space. For people looking for a reliable, highly performant, scalable, transaction aware hash storage system, Ininispan may look even more attractive than a lot of the disk based systems.

Infinispan Interview by Mark Little on InfoQ.

Are Cloud Based Memory Architectures the Next Big Thing?

Infinispan - data grids meets open source on TheServerSide.com

Technical FAQs

Anti-RDBMS: A list of distributed key-value stores

Infinispan Wiki

Distribution instead of Buddy Replication

Reconnoiter Project Home Page

Reconnoiter - Large-Scale Trending and Fault-Detection

Thursday, August 13, 2009 at 3:15PM

One of the top recommendations from the collective wisdom contained in Real Life Architectures is to add monitoring to your system. Now! Loud is the lament for not adding monitoring early and often. The reason is easy to understand. Without monitoring you don't know what your system is doing which means you can't fix it and you can't improve it. Feedback loops require data.

Some popular monitor options are Munin, Nagios, Cacti and Hyperic. A relatively new entrant is a product called Reconnoiter from Theo Schlossnagle, President and CEO of OmniTI, leading consultants on solving problems of scalability, performance, architecture, infrastructure, and data management. Theo's name might sound familiar. He gives lots of talks and is the author of the very influential Scalable Internet Architectures book.

So right away you know Reconnoiter has a good pedigree. As Theo says, their products are born of pain, from the fire of solving real-life problems and that's always a harbinger of good things to come.

The problem Reconnoiter is trying to solve is monitoring thousands of nodes across many datacenters where the nodes can vary widely in power, architecture, and software configuration. With that kind of problem what they really want is the ability to:

Configure everything from one place.

Cheap checks that are made on the specified time interval and aren't late and don't cause a heavy load on the machine.

Change the configuration from any datacenter without coordination.

Add checks in the field.

Separate data collection from visualization and fault-detection.

Analyze trends for long-term capacity planning and postmortem analysis.

Detect when faults have happened and when they are about to happen.

Support trending: the intelligent data correlation, regression analysis/curve fitting and looking into the past to see how much you go where you are now so you can do better next time.

Create a monitoring system that doesn't require a separate powerful network and its own set of hosts on which to run.

If you've ever used or written a distributed stats collection system the architecture of Reconnoiter will look somewhat familiar:

Some of the more interesting bits of the architecture are:

PostgresSQL stores all the data. The data isn't stuck in funky little files.

Fault-detection is based on Esper, a streaming complex event processing system. It's not clear how well this approach will work but the hooks are there.

A Comet-style web server is used to feed real-time updates. Much better than your traditional polling cycle.

Although the web console is PHP based, PHP is used mainly to execute Json calls. Rendering happens in the browser in an AJAX client.

Canvas is used for real time graphics. No images are created on the fly.

Data is transferred securely over SSL.

The system is robust against failures.

Data is not thrown away as it is with some systems so you can check against history.

Reconnoiter isn't completely pain free. Lua for an extension language is an interesting choice. The installation and configuration process is very complex. There are a lot of separate steps and bits to configure. Another potential problem is monitoring produces a lot of real-time data. I have to wonder if PostgresSQL can handle that flow for very large systems. The data is partitioned by month, but a large number of machines and a large number of events can be crushing. And I wasn't sure if graph data could be correlated with released features or other system changes. In the video Theo mentions seeing in the graphs that using deflate improved performance, but I'm not sure just looking at the graph how you would be able correlate system data with system changes.

It's droolingly clear where Reconnoiter shines is on creating complex graphs, charts, and other visualizations. The graphs look useful and quick to render. The real time visualizations are spectacular and extremely are difficult to do in other systems.

OmniTI Reconnoiter: Web Management and Analysis by Eric J. Bruno

Reconnoiter Update by Theo Schlossnagle

Video: Reconnoiter: a whirlwind tour

Big Picture of the Overall System

Reconnoiter: Monitoring and Trend Analysis from OSCON

OmniTI Unveils Open Source Monitoring Tool, Reconnoiter by Jayashree Adkoli

The sad state of open source monitoring tools by Grig Gheorghiu

How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book.

New open source IT management tool: Lighter-weight than Nagios, more granular than Cacti by Matt Stansberry

2 Comments |

Permalink |

Monitoring,

Product

Thursday

Jul022009

Product: Hbase

Thursday, July 2, 2009 at 12:43AM

Update 3: Presentation from the NoSQL Conference: slides, video.
Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint."
Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases. "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing."

Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop (product), which implements functionality similar to Google's GFS and Map/Reduce systems.

Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application.

Bigtable (and Hbase) provide a means for organizing and efficiently accessing these large data sets.

Hbase is still not ready for production, but it's a glimpse into the power that will soon be available to your average website builder.

Google is of course still way ahead of the game. They have huge core competencies in data center roll out and they will continually improve their stack.

It will be interesting to see how these sorts of tools along with Software as a Service can be leveraged to create the next generation of systems.

2 Comments |

Permalink |

BigTable,

HBase,

Product

Thursday

Jul022009

Hypertable is a New BigTable Clone that Runs on HDFS or KFS

Thursday, July 2, 2009 at 12:38AM

Update 3: Presentation from the NoSQL conference: slides, video 1, video 2.

Update 2: The folks at Hypertable would like you to know that Hypertable is now officially sponsored by Baidu, China’s Leading Search Engine. As a sponsor of Hypertable, Baidu has committed an industrious team of engineers, numerous servers, and support resources to improve the quality and development of the open source technology.

Update: InfoQ interview on Hypertable Lead Discusses Hadoop and Distributed Databases. Hypertable differs from HBase in that it is a higher performance implementation of Bigtable.

Skrentablog gives the heads up on Hypertable, Zvents' open-source BigTable clone. It's written in C++ and can run on top of either HDFS or KFS. Performance looks encouraging at 28M rows of data inserted at a per-node write rate of 7mb/sec.

1 Comment |

Permalink |

BigTable,

Product

Thursday

Jul022009

Product: Project Voldemort - A Distributed Database

Thursday, July 2, 2009 at 12:02AM

Update: Presentation from the NoSQL conference: slides, video 1, video 2.

Project Voldemort is an open source implementation of the basic parts of Dynamo (Amazon’s Highly Available Key-value Store) distributed key-value storage system. LinkedIn is using it in their production environment for "certain high-scalability storage problems where simple functional partitioning is not sufficient."

From their website:

Data is automatically replicated over multiple servers.

Data is automatically partitioned so each server contains only a subset of the total data

Server failure is handled transparently

Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, and Java Serialization

Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system

Each node is independent of other nodes with no central point of failure or coordination

Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, and the replication factor

Support for pluggable data placement strategies to support things like distribution across data centers that are geographical far apart.

They also have a nice design page going over some of their architectural choices: key-value store only, no complex queries or joins; consistent hashing is used to assign data to nodes; JSON is used for schema definition; versioning and read-repair for distributed consistency; a strict layered architecture with put, get, and delete as the interface between layers.

Just a hint when naming a project: don't name it after one of the most popular key words in muggledom. The only way someone will find your genius via search is with a dark spell. As I am a Good Witch I couldn't find much on Voldemort in the real world. But the idea is great and is very much in line with current thinking on scalable database design. Worth a look.

The CouchDB Project

7 Comments |

Permalink |

Java,

key-value store

Monday

Jun152009

Large-scale Graph Computing at Google

Monday, June 15, 2009 at 7:07AM

To continue the graph theme Google has got into the act and released information on Pregel. Pregel does not appear to be a new type of potato chip. Pregel is instead a scalable infrastructure...

...to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology.

Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel's applicability is harder to quantify, but so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.

Pregel does not appear to be publicly available, so it's not clear what the purpose of the announcement could be. Maybe it will be a new gmail extension :-)

1 Comment |

Permalink |

Neo4j -- or why graph dbs kick ass

Product,

graph

Saturday

Jun132009

Neo4j - a Graph Database that Kicks Buttox

Saturday, June 13, 2009 at 1:18AM

Update: Social networks in the database: using a graph database. A nice post on representing, traversing, and performing other common social network operations using a graph database.

If you are Digg or LinkedIn you can build your own speedy graph database to represent your complex social network relationships. For those of more modest means Neo4j, a graph database, is a good alternative.

A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes. Slap properties (key-value pairs) on nodes and relationships and you have a surprisingly powerful way to represent most anything you can think of. In a graph database "relationships are first-class citizens. They connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs. So you can look at a graph database as a key-value store, with full support for relationships."

A graph looks something like:

For more lovely examples take a look at the Graph Image Gallery.

Here's a good summary by Emil Eifrem, founder of the Neo4j, making the case for why graph databases rule:

Most applications today handle data that is deeply associative, i.e. structured as graphs (networks). The most obvious example of this is social networking sites, but even tagging systems, content management systems and wikis deal with inherently hierarchical or graph-shaped data.

This turns out to be a problem because it’s difficult to deal with recursive data structures in traditional relational databases. In essence, each traversal along a link in a graph is a join, and joins are known to be very expensive. Furthermore, with user-driven content, it is difficult to pre-conceive the exact schema of the data that will be handled. Unfortunately, the relational model requires upfront schemas and makes it difficult to fit this more dynamic and ad-hoc data.

A graph database uses nodes, relationships between nodes and key-value properties instead of tables to represent information. This model is typically substantially faster for associative data sets and uses a schema-less, bottoms-up model that is ideal for capturing ad-hoc and rapidly changing data.

So relational database can't handle complex relationships. Graph systems are opaque, unmaintainable, and inflexible. OO databases loose flexibility by combining logic and data. Key-value stores require the programmer to maintain all relationships. There, everybody sucks :-)

Neo4j's Key Characteristics

Dual license: open source and commercial.

Well suited for many web use cases such as tagging, metadata annotations, social networks, wikis and other network-shaped or hierarchical data sets.

An intuitive graph-oriented model for data representation. Instead of static and rigid tables, rows and columns, you work with a flexible graph network consisting of nodes, relationships and properties.

Decent documentation, active and responsive email list, a few releases, good buzz. All a good sign for something that has a chance to last a while.

Has bindings for a number of languages Python, Jython, Ruby, and Clojure. No binding for .Net yet. The recommendation is to access using a REST interface.

Disk-based, native storage manager completely optimized for storing graph structures for maximum performance and scalability. SSD ready.

Massive scalability. Neo4j can handle graphs of several billion nodes/relationships/properties on a single machine.

Frequently outperforms relational backends with >1000x for many increasingly important use cases.

Powerful traversal framework for high-speed traversals in the node space.

Small footprint. Neo4j is a single <500k jar with one dependency (the Java Transaction API).

Simple and convenient object-oriented API.

Retrieving children is trivial in a graph database.

No need to flatten and serialize an object graph as graphs are native to a graph database.

Fully transactional like a real database. Supports JTA/JTS, XA, 2PC, Tx recovery, deadlock detection, etc.

Current implementation is built to handle large graphs that don't fit in memory with durability. It's not a cache, it's a fully persistent transactional store.

No events or triggers. Planned in a future release.

No sharding. A suggestion for how one might shard is here.

Some common graph calculations are missing. For example, in a social network finding a common friend for a set of users.

Separates data and logic with a more "natural" representation than tables. This makes it easy to use Neo4j as the storage tier for OO code while keeping behaviour and state separate.

Neo4j traverses depths of 1000 levels and beyond at millisecond speed. That's many orders of magnitude faster than relational systems.

Neo4j vs Hadoop

This post makes an illuminating comparison between Neo4j vs Hadoop:

In principle, Hadoop and other Key-Value stores are mostly concerned with relatively flat data structures. That is, they are extremely fast and scalable regarding retrieval of simple objects, like values, documents or even objects.

However, if you want to do deeper traversal of e.g. a graph, you will have to retrieve the nodes for every traversal step (very fast) and then match them yourself in some manner (e.g. in Java or so) - slow.

Neo4j in contrast is build around the concept of "deep" data structures. This gives you almost unlimited flexibility regarding the layout of your data and domain object graph and very fast deep
traversals (hops over several nodes) since they are handled natively by the Neo4j engine down to the storage layer and not your client code. The drawback is that for huge data amounts (>1Billion nodes) the clustering and partitioning of the graph becomes non-trivial, which is one of the areas we are working on.

Then of course there are differences in the transaction models, consistency and others, but I hope this gives you a very short philosophical answer :)

It would have never occurred to me to compare the two, but the comparison shows why we need multiple complementary views of data. Hadoop scales the data grid and the compute grid and is more flexible in how data are queried and combined. Neo4j has far lower latencies for complex navigation problems. It's not a zero-sum game.

The current database debate and graph databases by Anders Nawroth

On Building a Stupidly Fast Graph Database by Scott Wheeler and the Hacker News Thread

Network Model from wikipedia

Databases as a service: FathomDB

Using Neo4J to load and query OWL ontologies by Sujit Pal

Graph Databases and the Future of Large-Scale Knowledge Management by Marko A. Rodriguez

Memo To The Semantic Web: Drop “Semantic” And Become The “Graph Web” by Hank Williams

Is the Relational Database Doomed? by Tony Bain

Neo Database Introduction

Neo4j Email List

flare Data Visualization for the Web

Giant Global Graph by Tim Berners-Lee

Tim Berners-Lee -- Linked Data at TED

Drop ACID and Think About Data by Bob Ippolito

Analyzing and adapting graph algorithms for large persistent graphs by Larsson, Patrik

18 Comments |

Permalink |

graph

Friday

May292009

Is Eucalyptus ready to be your private cloud?

Friday, May 29, 2009 at 12:48AM

Update:: Eucalyptus Goes Commercial with $5.5M Funding Round. This removes my objection that it's an academic project only. Go team go!

Rich Wolski, professor of Computer Science at the University of California, Santa Barbara, gave a spirited talk on Eucalyptus to a large group of very interested cloudsters at the Eucalyptus Cloud Meetup. If Rich could teach computer science at every school the state of the computer science industry would be stratospheric. Rich is dynamic, smart, passionate, and visionary. It's that vision that prompted him to create Eucalyptus in the first place. Rich and his group are experts in grid and distributed computing, having a long and glorious history in that space. When he saw cloud computing on the rise he decided the best way to explore it was to implement what everyone accepted as a real cloud, Amazon's API. In a remarkably short time they implement Eucalyptus and have been improving it and tracking Amazon's changes ever since.

The question I had going into the meetup was: should Eucalyptus be used to make an organization's private cloud? The short answer is no. Wait wait, it's now yes, see the update at the beginning of the article.

The project is of high quality, the people are of the highest quality, but in the end Eucalyptus is a research project from a university. As an academic project Eucalyptus is subject to changes in funding and the research interests of the team. When funding sources dry up so does the project. If the team finds another research area more interesting, or if they get tired of chasing a continuous stream of new Amazon features, or no new grad students sign on, which will happen in a few years, then the project goes dark.

Fears over continuity have at least two solutions: community support and commercial support. Eucalyptus could become community supported open source project. This is unlikely to happen though as it conflicts with the research intent of Eucalyptus. The Eucalyptus team plans to control the core for research purposes and encourage external development of add-on service like SQS. Eucalyptus won't go commercial as University projects must stay clear from commercial pretensions. Amazon is "no comment" on Eucalyptus so it's not clear what they would think of commercial development should it occur.

Taken together these concerns imply Eucalyptus is not a good base for an enterprise quality private cloud. Which they readily admit. It's not enterprise ready Rich repeats. It's not that the quality isn't there. It is and will be. And some will certainly base their private cloud on Eucalyptus, but when making a decision of this type you have to be sure your cloud infrastructure will be around for the long haul. With Eucalyptus that is not necessarily the case. Eucalyptus is still a good choice for it's original research purpose, or as cheap staging platform for Amazon, or as base for temporary clouds, but as your rock solid private cloud infrastructure of the future Eucalyptus isn't the answer.

The long answer is a little more nuanced and interesting.

The primary purpose for Eucalyptus is research. It was never meant to be our little untethered private Amazon cloud. But if it works, why not?

Eucalyptus is Not a Full Implementation of the Amazon Stack

Eucalyptus implements most of EC2 and a little of S3. They hope to get community support for the rest. That of course makes Eucalyptus far less interesting as a development platform. But if your use for Eucalyptus is as an instant provisioning framework you are still in the game. Their emulation of EC2 is so good RightScale was able to operate on top of Eucalyptus. Impressive.

But even in the EC2 arena I have to wonder for how long they'll track Amazon development. If you are a researcher implementing every new Amazon feature is going to get mighty old after a while. It will be time to move on and if you are dependent on Eucalyptus you are in trouble. Sure, you can move to Amazon but what about that $1 million data center buildout?

Developing software not tied to the Amazon service stack then Eucalyptus would work great.

As an Amazon developer I would want my code to work without too much trouble in both environments. Certainly you can mock the different services for testing or create a service layer to hide different implementations, but that's not ideal and makes Eucalyptus as an Amazon proxy less attractive.

One of the uses for Eucalyptus is to make Amazon cheaper and easier by testing code locally without out having to deploy into Amazon all the time. Given the size of images the bandwidth and storage costs add up after a while, so this could make Eucalyptus a valuable part of the development process.

Eucalyptus is Not as Scalable as Amazon

No kidding. Amazon has an army of sysadmins, network engineers, and programmers to make their system work at such ginormous scales. Eucalyptus was built on smarts, grit and pizza. It will never scale as well as Amazon, but Eucalyptus is scalable to 256 nodes right now. Which is not bad.

Rich thinks with some work they already know about it could scale to 5000 nodes. Not exactly Amazon scale, but good enough for many data center dreams.

One big limit Eucalyptus has is the self-imposed requirement to work well in any environment. It's just a tarball you can install on top of any network. They rightly felt this was necessary for adoption. Saying to potential customers that you need to setup a special network before you can test this software tends to slow down adoption. By making Eucalyptus work as an overlay they soothed a lot of early adopter pain.

But by giving up control of the machines, the OS, the disk, and the network they limited how scalable they can be. There's more to scalability than just software. Amazon has total control and that gives them power. Eucalyptus plans to make more invasive and more scalable options available in the future.

Lacks Some Private Cloud Features

Organizations interested in a private cloud are often interested in:

Control

Privacy and Security

Utility Chargeback System

Instant Provisioning Framework

Multi-tenancy

Temporary Infrastructure for Proof of Concept for "Real" Provisioning

Cloud Management Infrastructure

Eucalyptus satisfies many of these needs, but a couple are left wanting:

The Utility Chargeback System allows companies to bill back departments for the resources they use and is a great way get around a rigid provisioning process and still provide accountability back to the budgeting process. Eucalyptus won't do this for you.

A first class Cloud Management Infrastructure is not part of Eucalyptus because it's not part of Amazon's API. Amazon doesn't expose their internal management process. Eucalyptus is adding some higher level management tools, but they'll be pretty basic.

These features may or may not be important to you.

Clouds vs Grids

Endless pixels have been killed defining clouds, grids, and how they are different enough that there's really a whole new market to sell into. Rich actually makes a convincing argument that grids and clouds are different and do require a completely different infrastructure. The differences:

Cloud

Full private cluster is provisioned

Individual user can only get a tiny fraction of the total resource pool

No support for cloud federation except through the client interface

Opaque with respect to resources

Grid

Built so that individual users can get most, if not all of the resources in a single request

Middleware approach takes federation as a first principle

Resources are exposed, often as bare metal

Get Off of My Cloud by M. Jagger and K. Richards.

Rich Wolski's Home Page

Enomaly

Nimbus

3 Comments |

Permalink |