High Scalability -

7 Comments |

Permalink |

nosql

Monday

Apr302012

Masstree - Much Faster than MongoDB, VoltDB, Redis, and Competitive with Memcached

Monday, April 30, 2012 at 9:15AM

The EuroSys 2012 system conference has an excellent live blog summary of their talks for: Day 1, Day 2, Day 3 (thanks Henry at the Paper Trail blog). Summaries for each of the accepted papers are here.

One of the more interesting papers from a NoSQL perspective was Cache Craftiness for Fast Multicore Key-Value Storage, a wonderfully detailed description of the low level techniques used to implement Masstree:

A storage system specialized for key-value data in which all data ﬁts in memory, but must persist across server restarts. It supports arbitrary, variable-length keys. It allows range queries over those keys: clients can traverse subsets of the database, or the whole database, in sorted order by key. On a 16-core machine Masstree achieves six to ten million operations per second on parts A–C of the Yahoo! Cloud Serving Benchmark benchmark, more than 30 as fast as VoltDB [5] or MongoDB [2].

If you are looking for innovative detailed high performance design, this paper is for you. An example from the section on writer-writer coordination:

Masstree writers coordinate using per-node spinlocks. A node’s lock is stored in a single bit in its version counter. Any modiﬁcation to a node’s keys or values requires holding the node’s lock. Some data is protected by other nodes’ locks, however. A node’s parent pointer is protected by its parent’s lock, and a border node’s prev pointer is protected by its previous sibling’s lock. This minimizes the simultaneous locks required by split operations; when an interior node splits, for example, it can assign its children’s parent pointers without obtaining their locks.

Here's the live blog writeup:

5 Comments |

Permalink |

Strategy,

nosql

Monday

Feb202012

Berkeley DB Architecture - NoSQL Before NoSQL was Cool

Monday, February 20, 2012 at 8:56AM

After the filesystem and simple library packages like dbm, Berkeley DB was the original luxury embedded database widely used by applications as their core database engine. NoSQL before NoSQL was cool. The hidden secret making complex applications sing. If you want to dispense with all the network overhead of a server based system, it's still a a good choice.

There's a great writeup for the architecture behind Berkeley DB in the book The Architecture of Open Source Applications. If you want to understand more about how a database works or if you are pondering how to build your own, it's rich in detail, explanations, and lessons. Here's the Berkeley DB chapter from the book. It covers topics like: Architectural Overview; The Access Methods: Btree, Hash, Recno, Queue; The Library Interface Layer; The Buffer Manager: Mpool; Write-ahead Logging; The Lock Manager: Lock; The Log Manager: Log; The Transaction Manager: Txn.

Permalink |

Example,

nosql

Tuesday

Feb072012

Hypertable Routs HBase in Performance Test -- HBase Overwhelmed by Garbage Collection

Tuesday, February 7, 2012 at 9:30AM

This is a guest post by Doug Judd, original creator of Hypertable and the CEO of Hypertable, Inc.

Hypertable delivers 2X better throughput in most tests -- HBase fails 41 and 167 billion record insert tests, overwhelmed by garbage collection -- Both systems deliver similar results for random read uniform test

We recently conducted a test comparing the performance of Hypertable (@hypertable) version 0.9.5.5 to that of HBase (@HBase) version 0.90.4 (CDH3u2) running Zookeeper 3.3.4. In this post, we summarize the results and offer explanations for the discrepancies. For the full test report, see Hypertable vs. HBase II.

Introduction

Hypertable and HBase are both open source, scalable databases modeled after Google's proprietary Bigtable database. The primary difference between the two systems is that Hypertable is written in C++, while HBase is written in Java. We modeled this test after the one described in section 7 of the Bigtable paper and tuned both systems for maximum performance. The test was run on a total of sixteen machines connected together with gigabit Ethernet. The machines had the following configuration:

Doug Judd |

16 Comments |

Permalink |

BigTable,

Hadoop,

nosql

Tuesday

Jan242012

The State of NoSQL in 2012

Tuesday, January 24, 2012 at 9:15AM

This is a guest post by Siddharth Anand, a senior member of LinkedIn's Distributed Data Systems team.

Preamble Ramble

If you’ve been working in the online (e.g. internet) space over the past 3 years, you are no stranger to terms like “the cloud” and “NoSQL”.

In 2007, Amazon published a paper on Dynamo. The paper detailed how Dynamo, employing a collection of techniques to solve several problems in fault-tolerance, provided a resilient solution to the on-line shopping cart problem. A few years go by while engineers at AWS toil in relative obscurity at standing up their public cloud.

It’s December 2008 and I am a member of Netflix’s Software Infrastructure team. We’ve just been told that there is something called the “CAP theorem” and because of it, we are to abandon our datacenter in hopes of leveraging Cloud Computing.

Huh?

Permalink |

nosql

Thursday

Jan052012

Shutterfly Saw a Speedup of 500% With Flashcache

Thursday, January 5, 2012 at 9:15AM

In the "should I or shouldn't I" debate around deploying SSD, it always helps to have real-world data. Fiesta! with a live-blog summary of a presentation by Kenny Gorman on Shutterfly on MongoDB Performance Tuning.

What if you still need more performance after doing all of this tuning? One option is to use SSDs. Shutterfly uses Facebook’s flashcache: kernel module to cache data on SSD. Designed for MySQL/InnoDB. SSD in front of a disk, but exposed as a single mount point. This only makes sense when you have lots of physical I/O. Shutterfly saw a speedup of 500% w/ flashcache. A benefit is that you can delay sharding: less complexity.

The whole series of posts has a lot of great information and is worth a longer look, especially if you are considering using MongoDB.

2 Comments |

Permalink |

SSD,

Strategy,

nosql,

scale-up

Tuesday

Nov012011

Finding the Right Data Solution for Your Application in the Data Storage Haystack

Tuesday, November 1, 2011 at 9:27AM

The InfoQ article Finding the Right Data Solution for Your Application in the Data Storage Haystack makes a series of concrete recommendations for a user who wants to find the right storage solution for his application.

Few years back, there was a time SQL RDBMS were solution for almost all storage needs, but we all know how scaling came along and shattered the perfect dream. Then NoSQL happened, and now we are end up with a Haystack of solutions. For example, Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. are some classes of such solutions.

We discuss about how to find the right storage solution, and we make choices often when we design. But, when comes to describe how to select the right one, we often end up giving very high-level guideline. The article argues that the way to make more concrete recommendations is to drill down into bit more detail and consider them case by case.

To that end the article takes four parameters about an application/usecase (Scale, Consistency, Type of Data, and Queries needed), then take some 40+ cases that arises from different value combination of those parameters and make one or more concrete recommendations on right storage solution for that case.

What follows are the four parameters and potential values they can take and the recommendations for structured, semi-structured, and unstructured data:

Srinath Perera |

Must see: 5 Steps to Scaling MongoDB (Or Any DB) in 8 Minutes

Tuesday, September 13, 2011 at 9:07AM

Jared Rosoff concisely, effectively, entertainingly, and convincingly gives an 8 minute MongoDB tutorial on scaling MongoDB at Scale Out Camp. The ideas aren't just limited to MongoDB, they work for most any database: Optimize your queries; Know your working set size; Tune your file system; Choose the right disks; Shard. Here's an explanation of all 5 strategies:

General Chicken |

Permalink |

Strategy,

nosql

Tuesday

Sep062011

Big Data Application Platform

Tuesday, September 6, 2011 at 8:39AM

It's time to think of the architecture and application platforms surrounding "Big Data" databases. Big Data is often centered around new database technologies mostly from the emerging NoSQL world. The main challenge that these databases solve is how to handle massive amount of data at a reasonable cost and without poor performanc - distributed databases emerged to address this challenge and today we're seeing high adoption rate and quite impressive success stories such as the Netflix use of Cassandra/DataStax solution. All that indicate the speed in which this market evolves.

The need for a Big Data Application Platform

Nati Shalom |

Jim Starkey is Creating a Brave New World by Rethinking Databases for the Cloud

Thursday, August 4, 2011 at 9:11AM

Jim Starkey, founder of NuoDB, in this thread on the Cloud Computing group, delivers a masterful post on why he thinks the relational model is the best overall compromise amongst the different options, why NewSQL can free itself from the limitations of legacy SQL architectures, and how this creates a brave new lock free world....

I'll [Jim Starkey] go into more detail later in the post for those who care, but the executive summary goes like this: Network latency is relatively high and human attention span is relatively low. So human facing computer systems have to perform their work in a small number of trips between the client and the database server. But the human condition leads inexorably to data complexity. There are really only two strategies to manage this problem. One is to use coarse granularity storage, glombing together related data into a single blob and letting intelligence on the client make sense of it. The other is storing fine granularity data on the server and using intelligence on the server to aggregate data to be returned to the client.

NoSQL uses the former for a variety of reasons...

Permalink |