In Memory Data Grid Technologies
After winning a CSC Leading Edge Forum (LEF) research grant, I (Paul Colmer) wanted to publish some of the highlights of my research to share with the wider technology community.
What is an In Memory Data Grid?
It is not an in-memory relational database, a NOSQL database or a relational database. It is a different breed of software datastore.
In summary an IMDG is an ‘off the shelf’ software product that exhibits the following characteristics:
The data model is distributed across many servers in a single location or across multiple locations. This distribution is known as a data fabric. This distributed model is known as a ‘shared nothing’ architecture.
- All servers can be active in each site.
- All data is stored in the RAM of the servers.
- Servers can be added or removed non-disruptively, to increase the amount of RAM available.
- The data model is non-relational and is object-based.
- Distributed applications written on the .NET and Java application platforms are supported.
- The data fabric is resilient, allowing non-disruptive automated detection and recovery of a single server or multiple servers.
There are also hardware appliances that exhibit all these characteristics. I use the term in-memory data grid appliance to describe this group of products and these were excluded from my research.
There are six products in the market that I would consider for a proof of concept, or as a starting point for a product selection and evaluation:
- VMware Gemfire (Java)
- Oracle Coherence (Java)
- Alachisoft NCache (.Net)
- Gigaspaces XAP Elastic Caching Edition (Java)
- Hazelcast (Java)
- Scaleout StateServer (.Net)
And here are the rest of products available in the market now, that I consider IMDGs:
- IBM eXtreme Scale
- Terracotta Enterprise Suite
- Jboss (Redhat) Infinispan
Relative newcomers to this space, and worthy of watching closely are Microsoft and Tibco.
Why would I want an In Memory Data Grid?
Let’s compare this with our old friend the traditional relational database:
- Performance – using RAM is faster than using disk. No need to try and predict what data will be used next. It’s already in memory to use.
- Data Structure – using a key/value store allows greater flexibility for the application developer. The data model and application code are inextricably linked. More so than a relational structure.
- Operations – Scalability and resiliency are easy to provide and maintain. Software / hardware upgrades can be performed non-disruptively.
How does an In Memory Data Grid map to real business benefits?
- Competitive Advantage – businesses will make better decisions faster.
- Safety – businesses can improve the quality of their decision-making.
- Productivity – improved business process efficiency reduces waster and likely to improve profitability.
- Improved Customer Experience – provides the basis for a faster, reliable web service which is a strong differentiator in the online business sector.
How do use an In Memory Data Grid?
- Simply install your servers in a single site or across multiple sites. Each group of servers within a site is referred to as a cluster.
- Install the IMDG software on all the servers and choose the appropriate topology for the product. For multi-site operations I always recommend a partitioned and replicated cache.
- Setup your APIs, or GUI interfaces to allow replicated between the various servers.
- Develop your data model and the business logic around the model.
With a partitioned and replicated cache, you simply partition the cache on the servers that best suits the business needs to trying to fulfil, and the replicated part ensures there are sufficient copies across all the servers. This means that if a server dies, there is no effect on the business service. Providing you have provisioned enough capacity of course.
The key here is to design a topology that mitigates all business risk, so that if a server or a site is inoperable, the service keeps running seamlessly in the background.
There are also some tough decisions you may need to make regarding data consistency vs performance. You can trade the performance to improve data consistency and vice versa.
Are there any proven use cases for In Memory Data Grid adoption?
Oh yes, and if you’re a competitor in these markets, you may want to rethink your solution.
Financial Services: Improve decision-making, profitability and market competitiveness through increased performance in financial stock-trading markets. Reduction in processing times from 60 minutes to 60 seconds.
Online Retailer: Providing a highly available, easily maintainable and scalable solution for 3+ million visitors per month in the online card retailer market.
Aviation: Three-site active / active / active flight booking system for a major European budget-airline carrier. Three sites are London, Dublin and Frankfurt.
Check out the VMware Gemfire and Alachisoft NCache websites for more details on these proven use cases.
About the Author:
Paul Colmer is a technology consultant working for CSC and director and active professional musician for Music4Film.net. He specialises in Cloud Computing, Social Business and Solution Architecture. He is based in Brisbane, Australia. http://www.linkedin.com/pub/paul-colmer/6/894/539
Reader Comments (25)
What this post is about?
Your article is an excellent summary of in-memory data grids and their benefits; thanks for sharing the results of your work. Near the end of your post you mention that an IMDG can provide a competitive edge due to its enabling faster application performance. We agree, and at ScaleOut Software we’ve extended the performance benefit of an IMDG by integrating a computational engine with our product. This enables fast analysis of stored data using the popular "map/reduce" style of computation. Users across a wide range of vertical applications find that this can give them a further competitive advantage by speeding up data mining and decision making.
I’m sure you’re aware of the huge buzz around Hadoop’s map/reduce model for analyzing big data. We use the same powerful approach, but by storing data in memory in the IMDG instead of in a distributed file system, we have observed much faster performance. (Many interesting datasets fit within the memory of an IMDG.) A recent test of a financial "stock back-testing" application demonstrated a 16X improvement over Hadoop. It also turns out that an IMDG is easier to use for data analysis because the IMDG's object-oriented view of data integrates naturally into Java and C#.
One final note: you list ScaleOut StateServer as a .NET solution. That’s correct, but it also provides equivalent support for Java. We believe that ScaleOut may be the only true cross platform IMDG since our underlying engine is written in C and includes both .NET and Java libraries. In fact, you can mix Java and .NET objects on both Windows and Linux IMDG servers operating as a single IMDG.
There is no mention of memcached. Can you also please compare the access APIs/protocols for each product?
I think this post misses the point.
Why do you mix languages with technologies?
You put in the requisites of an IMDG the support for .NET and Java?? What?
This post is only a list of well known best practices and some advertising for products.
I think this post missing the point, and by a wide mark too.
I get the overall impression that the author has spent more time reading the marketing blurbs and hype rather than actually understand the requirements in this space and using the products. Its so limited on technical information and heavy on simplistic marketing commentary that its almost disinformation.
One simple point...as soon as you go distributed you must consider the network. Getting data out of distributed memory via the network is going to be orders of magnitude slower than retrieving data from local memory / disk. Obviously this can be solved by processing data locally. Hadoop and some of the products mentioned are a great example of this but its not mentioned once.
Was this a sponsored post?
You really ought to mention TIBCO's ActiveSpaces as one of the products to seriously consider! Especially the new version 2.0 which has some unique features.
This post was quite painful to read. No mention of the downsides with regards to the trade offs with regards to CAP? How about operational durability during cascading network events? There's a place and time for everything and IMDGs can be quite useful, but treating it like a drop in replacement for the other technologies borders on the irresponsible.
Great posts guys. A couple of important points. It is NOT a sponsored post, I decided to share some of my research highlights, as many people outside of the specialist IT community have not heard of IMDGs and even those in the community may not know where they should start.
The technical detail is contained within the actual research paper (over 150 pages in 2 documents). What you see here is just a summary. I'll try and post some more of the detail over the next few weeks / months, so watch this space. If you're genuinely interested in those products, please contact the vendors. Many of them offer 30 day trials. Please post any genuine problems you find. Vendors are great at listening and fixing things up.
Keep the feedback coming. :-)
Paul, your article is a very good overview of IMDGs for non-techies. Thanks for posting it.
Some of the differentiators among the products you mentioned are partitioning model, indexing, remote execution and distributed transaction support.
You might want to know that GigaSpaces XAP comes also in a .Net version, not just Java. The premium edition provides further scalability for components like the web server, so it can basically replace the old generation app servers.
For financial trading, 1 second is considered an eternity, not to mention 60, but for large batch or analytics jobs (e.g. reconciliation or risk calculations) you may get this type of reduction in computation time.
Other interesting existing use cases from the real world are a social search engine, telco OSS software and online gaming.
Combining IMDG and big-data can also add some interesting solutions (imagine both of them deployed elastically on the cloud...).
I have read the specification looks like you have reimplemented erlang + mnesia....
Another popular use case is using In Memory Data Grid with Big Data solution to speedup the data processing time.
You can read more detailes on that regard on another post on this site:
Build Your Own Twitter Like Real Time Analytics - A Step By Step Guide
Here is a great link to a real use case, that shows the transactional throughput potential of these new breed of products.
http://www.csc.com/features/stories/48346-controlling_the_swiss_rail_system
Well, I would tend to disagree.
While it is well know the fastest SQL query is without any query at all (or something like that);I still do have problem to catch the real life interrest of those in memory DB.
First there is the cost, to be effective there is the need to have a grappe of computers only for that (to compare with a Mysql or a PostGreSql that are free and often bundled within the OS of your main server).
Second, while I would agree speed is parmount, real time data (I do mean REAL TIME) are not (repeat not) the key for most of the business out there. A latency of some seconds is usually acceptable (but for stock exchange, bank business but people there do not need such paper to be aware of the potential of the proposed solution).
Third, speed of an internet application can be considered as the sum of : time for the request to travel over the network, time for the server side script to process, time for the DB to process, time for the response to travel to the client, time for the client to process. In day to day business, a DB who provide the needed data in a range of 200ms (let say for a join of 5 tables each having 61 millions of rows) is considered as an acceptable DB, and while it is true IMDB may (or may not), we will come at it soon) be faster, on the long run and with the network latency such gain will be ridiculous.
Now there are 2 cases you shall have speak off :
1) typically a server with low memory will be slow, so ideally to avoid such case, IMDB haveto be in adistant harware (a grappe of them would be almost needed).So we simply double the network latency; and there is no way to avoid that. if we need to keep the IMDB scalable.
2) The use of de-normalized (or un-normalized) tables is the typicall approach those days for any SQL DB. The table will have only what you need, avoiding join, avoiding grouping and sorting. While it increaze the work of the tech in charge, it also reduce the work of the DB (not only faster, but also better scalability).
Concretly, we tested IMDB against Mysql, and the gains (yes there was some) were quite small with a huge DB (about 50 tables each over 61 millions rows) that we did not pursued that idea.
@Eric:
You will not get a join of 5 tables each containing 61 million rows in 200ms in ANY relational database. Period. Even in the most highly optimized of relational database designs, big data enterprise apps can push them to their limits quite easily and data interactions often become the bottleneck.
An in-memory data grid should not be confused as an "in-memory database" as it is not a database whatsoever. Think of it as a distributed cache, but fault tolerant. Reads/writes to memory are always multitudes faster than performing the same actions from a database (even in-memory tables, as their is still a network transaction required to perform operations).
While I do agree that true real-time data is not a common requirement of most smaller client-server scenarios, it is quickly becoming a demand in most large scale scenarios today. The question isn't if they really need it, it's if they want it.
As for concerns relating to costs, memory is cheaper than ever now and an incredible feature of not just most IMDGs, but almost any of the cutting edge high-performance/availability technologies emerging are designed to be capable of running on very cheap hardware to be cost effective. One could argue the memory cost for big-data type solutions but the same could be said for relational databases that would be otherwise used. A little research will reveal that literally all of the biggest on the net are either already using, or transitioning to high availability implementations in any phase of the application cycle they can. The technologies available are just incredible and they are just getting started...
Great article Paul. Thanks for shedding a spot light on IMDGs, and explaining things at a level that can be understood by a broader audience.
I see that you mentioned Alachisoft NCache solution for the .Net space. I would just like to mention that Alachisoft has now launched JvCache their Native Java In-Memory Data Grid which brings the rich functionality and reliability of NCache to the Java World.
You can read more about it at JvCache- In-Memory Data Grid
Great Post Paul,Helped alot.
Recently got chance to explore TIBCO's Active Spaces which is whole new product offering In Grid Memory computing with share nothing persistence approach.Its Active Spaces which make me to digg more about IMDG Software capability.
But still not convinced about the Scalabiity part.
If we have flexibility to add or remove server from Grid runtime in which each grid will have replication of same data across then how scalability can achieved.
Its like you have 10Million data in Grid 1 with 8GB of Ram then you add another grid with 8GB of RAM which essentially joined the in memory but still the data will replicate here.which means we have 16GB of RAM with 2 Million Data.
Can anybody give me any insight into it?
Reetesh,
To answer your question about scalability: in the case of TIBCO's ActiveSpaces, replication is a degree that you define (per space), and it is active/active distributed replication.
for your example of 10 M rows representing 8 Gigs of data:
- replication degree 0 means no replication: over 2 nodes it means 5 M rows and 4 Gigs (50% of the data) is stored on each node
- replication degree 1 means one replication: over 2 nodes it means each node 'owns' 5 M rows and 4 Gigs (50% of the data) and also 'replicates' 5 M rows and 4 Gigs (the remaining 50% of the data, so each node will use 8 Gigs of RAM)
Now if you use 4 nodes rather than 2 with the same replication degree of 1, each node would 'own' 2.5M rows and 2 Gigs (25% of the data) and 'replicate' another 2.5M rows and 2 Gigs (25% of the data).
And so on... you can deploy as many nodes as you want (and can use any degree of replication).
Hope this answers your question.
I want to comment on In-Memory distributed Database vs In-Memory distributed data grid point. While I agree that Data Grid is not a Database, there are Data Grid implementations out there that do support Ad Hoc SQL queries, including distributed joins, aggregations, etc.
For example, Apache Ignite has a scalable distributed JCache compliant In-Memory Data Grid with full support for SQL queries (ANSI 99 compliant). It also has one of the fastest ACID transaction implementations today.
There is also GridGain In-Memory Data Fabric which provides enterprise features on top of Apache Ignite, as well as C++ and .NET support.
Why wouldn't I use an in-memory database? Runs in memory too.
Hi, May be this is a stupid question, does this mean that the data is kept always in memory, never saved to disk storage?
Hahd, it flushes data into disk from time to time.
GigaSpaces XAP delivering an hybrid data storage architecture that is leveraging both on-heap and off-heap. Off-heap may use RAM or SSD. When using SSD - this allow users to form very large cost effective data grid with small HW footprint that may perform very fast key-value data access as well complex aggregations and computations across large data sets.
With XAP - you may use local SSD (PCIe/SATA) or Central SSD (flash array device) for maximum HA and reuse of existing IT storage resources.
See more: http://docs.gigaspaces.com/xap102adm/memoryxtend-rocksdb-ssd.html
This post is 6 years old now. Researching datagrid tech for current employer and some of these are not relevant anymore. Gemfire was bought by Pivotal but they are actively developing it anymore. Consider dumping it. Also, Gridgain is one of the main players in this arena these days. Not mentioned in your article.
Time keeps on slipping into the future...
Is there any tutorial to compare IMDG with relational database ?
We implemented our data model on MySQL and implementation using Spring Data JPA. We would like to move to Pivotal Gemfire (IMDG).
Just to clarify, can't we add foreign keys in Gemfire (IMDG) ?