Entries in key-value store (19)

Wednesday
Aug052009

Anti-RDBMS: A list of distributed key-value stores

Update 8: Introducing MongoDB by Eliot Horowit

Update 7: The Future of Scalable Databases by Robin Mathew.

Update 6: NoSQL : If Only it Was that Easy. BJ Clark lays down the law on which databases are scalable: Tokyo - NO, Redis - NO, Voldemort - YES, MongoDB - Not Yet, Cassandra - Probably, Amazon S3 - YES * 2, MySQL - NO. The real thing to point out is that if you are being held back from making something super awesome because you can’t choose a database, you are doing it wrong.
Update 5: Exciting stuff happening in Japan at this Key-Value Storage meeting in Tokyo. Presentations on Groonga, Senna, Lux IO, Tokyo-Cabinet, Tx, repcached, Kai, Cagra, kumofs, ROMA, and Flare.

Update 4: NoSQL and the Relational Model: don’t throw the baby out with the bathwater by Matthew Willson. So my key point is, this kind of modelling is WORTH DOING, regardless of which database tool you end up using for physical storage.
Update 3: Choosing a non-relational database; why we migrated from MySQL to MongoDB. An illuminating article explaining why Boxed Ice move to MongoDB over MySQL and other NoSQL options: easy install, PHP support, replication and master-master support, good doc, auto sharding on the road map. They still use MySQL for billing.
Update 2: They are now called NoSQL databases. So keep up! Eric Lai wrote a good article in Computerworld No to SQL? Anti-database movement gains steam about the phenomena. There was even a NoSQL conference. It was unfortunately full by the time I wanted to sign up, but there are presentations by all the major players. Nice Hacker News thread too.
Update: Some Notes on Distributed Key Stores by Leonard Lin. What's the best way to handle a fast growing system with 100M items that requires low latency and lots of inserts? Leanord takes a trip through several competing systems. The winner was: Tokyo Cabinet.

Richard Jones has put together a very nice list of various key-value stores around the internets. The list includes: Project Voldemort, Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Cassandra, HBase, and Hypertable. Richard also includes some commentary and their basic components (language, fault tolerance, persistence, client protocol, data model, docs, community).

There's an excellent discussion in the comments of Paxos vs Vector Clock techniques for synchronizing writes in the face of network failures.

Friday
Jul172009

Against all the odds

This article not about Mariah Carey, or its song. It's about Storing System, Database.

First let's describe what means by odds: In my social network, I found 93% of the mainstream developers sanctify the database, or at least consider it in any data persistence challenge as the ultimate, superhero, and undefeatable solution.

I think this problem come from the education, personally, and some companies also I think it's involved in this.

To start to fix this bad thinking, we all should agree in the following points:

  • Every challenge have its own solutions, so whatever you want to save/persistent, there are always many solutions. For example the Web search engines, such as: Google, Kngine, Yahoo, Bing don't use database at all instead we use Indexes (Index file) for better performance.
  • The Database in general whatever the vendor it's slow compared with other solutions such as: Key-Value storing system, Index file, DHT.
  • The Database currently employ Relation Data model, or Object relational data model, so don't convince yourself to save non-relation data into relation data model store system such as: Database.
  • The Database system architecture didn't changed very much in last 30 years, and it's content a lot of limits, and fails in its performance, scalability character. If you don't believe me check out this papers:
  1. The End of an Architectural Era (It's Time for a Complete Rewrite)

  2. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

I hope if you agreed with me in the previous points. So the question do we really need Database in every application?

There are many scenario shouldn't use Database resisters, such as: Web search engine, Caching, File sharing system, DNS system, etc. In the other hand there many of scenarios should use Database, such as: Customer database, Address book, ERP, etc.

Tiny URL services for example, shouldn't use Database at all because it's require very simple needs, just map a small/tiny URL to the real/big URL. If you start agreed with me, you likely want ask: But what we can use beside or instead of Databases?

There are a lot of tools that fallowing CAP, BASE model, instead of ACID model. But first let's describe ACID:

  • Atomicity: A transaction is all or nothing
  • Consistency: Only valid data is written to the database
  • Isolation: Pretend all transactions are happening serially and the data is correct
  • Durability: what you write is what you get
  1. The problem with ACID is that it gives you too much; it trips you up when you are trying to scale a system across multiple nodes.

  2. Down time is unacceptable. So your system needs to be reliable. Reliability requires multiple nodes to handle machine failures.

  3. To make scalable systems that can handle lots and lots of reads and writes you need many more nodes.

  4. Once you try to scale ACID across many machines you hit problems with network failures and delays. The algorithms don't work in a distributed environment at any acceptable speed.

In other hand CAP model is about:

  • Consistency: Your data is correct all the time. What you write is what you read.

  • Availability: You can read and write and write your data all the time.

  • Partition Tolerance: If one or more nodes fails the system still works and becomes consistent when the system comes on-line.
  1. CAP is easy to scale, distribute. CAP is scalable by nature.

  2. Everyone who builds big applications builds them on CAP. Who use CAP: Google, Yahoo, Facebook, Kngine, Amazon, eBay, etc.

For example in any in-memory or in-disk caching system you will never need all the Database features. You just need CAP like system. Today there are a lot of: column oriented, and key-value oriented systems. But first let's describe Column oriented:

A column-oriented is a database management system (DBMS) which stores its content by column rather than by row. This has advantages for databases such as data warehouses and library catalogues, where aggregates are computed over large numbers of similar data items. This approach is contrasted with row-oriented databases and with correlation databases, which use a value-based storage structure. For more information check Wikipedia page.

Distributed key-value stores:

Distributed column stores (Bigtable-like systems):

Something a little different:

Resource:

Thursday
Jul022009

Product: Facebook's Cassandra - A Massive Distributed Store

Update 2: Presentation from the NoSQL conference: slides, video.
Update: Why you won't be building your killer app on a distributed hash table by Jonathan Ellis. Why I think Cassandra is the most promising of the open-source distributed databases --you get a relatively rich data model and a distribution model that supports efficient range queries. These are not things that can be grafted on top of a simpler DHT foundation, so Cassandra will be useful for a wider variety of applications.

James Hamilton has published a thorough summary of Facebook's Cassandra, another scalable key-value store for your perusal. It's open source and is described as a "BigTable data model running on a Dynamo-like infrastructure." Cassandra is used in Facebook as an email search system containing 25TB and over 100m mailboxes.

  • Google Code for Cassandra - A Structured Storage System on a P2P Network
  • SIGMOD 2008 Presentation.
  • Video Presentation at Facebook
  • Facebook Engineering Blog for Cassandra
  • Anti-RDBMS: A list of distributed key-value stores
  • Facebook Cassandra Architecture and Design by James Hamilton
  • Thursday
    Jul022009

    Product: Project Voldemort - A Distributed Database

    Update: Presentation from the NoSQL conference: slides, video 1, video 2.

    Project Voldemort is an open source implementation of the basic parts of Dynamo (Amazon’s Highly Available Key-value Store) distributed key-value storage system. LinkedIn is using it in their production environment for "certain high-scalability storage problems where simple functional partitioning is not sufficient."

    From their website:

  • Data is automatically replicated over multiple servers.
  • Data is automatically partitioned so each server contains only a subset of the total data
  • Server failure is handled transparently
  • Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, and Java Serialization
  • Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system
  • Each node is independent of other nodes with no central point of failure or coordination
  • Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, and the replication factor
  • Support for pluggable data placement strategies to support things like distribution across data centers that are geographical far apart.

    They also have a nice design page going over some of their architectural choices: key-value store only, no complex queries or joins; consistent hashing is used to assign data to nodes; JSON is used for schema definition; versioning and read-repair for distributed consistency; a strict layered architecture with put, get, and delete as the interface between layers.

    Just a hint when naming a project: don't name it after one of the most popular key words in muggledom. The only way someone will find your genius via search is with a dark spell. As I am a Good Witch I couldn't find much on Voldemort in the real world. But the idea is great and is very much in line with current thinking on scalable database design. Worth a look.

    Related Articles

  • The CouchDB Project
  • Tuesday
    May052009

    Drop ACID and Think About Data

    The abstract for the talk given by Bob Ippolito, co-founder and CTO of Mochi Media, Inc:

    Building large systems on top of a traditional single-master RDBMS data storage layer is no longer good enough. This talk explores the landscape of new technologies available today to augment your data layer to improve performance and reliability. Is your application a good fit for caches, bloom filters, bitmap indexes, column stores, distributed key/value stores, or document databases? Learn how they work (in theory and practice) and decide for yourself.
    Bob does an excellent job highlighting different products and the key concepts to understand when pondering the wide variety of new database offerings. It's unlikely you'll be able to say oh, this is the database for me after watching the presentation, but you will be much better informed on your options. And I imagine slightly confused as to what to do :-) An interesting observation in the talk is that the more robust products are internal to large companies like Amazon and Google or are commercial. A lot of the open source products aren't yet considered ready for prime-time and Bob encourages developers to join a project and make patches rather than start yet another half finished key-value store clone. From my monitoring of the interwebs this does seem to be happening and existing products are starting to mature. From all the choices discussed the column database Vertica seems closest to Bob's heart and it's the product they use. It supports clustering, column storage, compression, bitmapped indexes, bloom filters, grids, and lots of other useful features. And most importantly: it works, which is always a plus :-) Here's a summary of some of the points talked about in the presentation:
  • Video Presentation of Drop ACID and Think About Data

    ACID

  • The claim to fame for relational databases is they make the ACID promise: * Atomicity - a transaction is all or nothing * Consistency - only valid data is written to the database * Isolation - pretend all transactions are happening serially and the data is correct * Durability - what you write is what you get
  • The problem with ACID is that it gives you too much, it trips you up when you are trying to scale a system across multiple nodes.
  • Down time is unacceptable. So your system needs to be reliable. Reliability requires multiple nodes to handle machine failures.
  • To make a scalable systems that can handle lots and lots of reads and writes you need many more nodes.
  • Once you try to scale ACID across many machines you hit problems with network failures and delays. The algorithms don't work in a distributed environment at any acceptable speed.

    CAP

  • If you can't have all of the ACID guarantees it turns out you can have two of the following three characteristics: * Consistency - your data is correct all the time. What you write is what you read. *Availability - you can read and write and write your data all the time * Partition Tolerance - if one or more nodes fails the system still works and becomes consistent when the system comes on-line.

    BASE

  • The types of large systems based on CAP aren't ACID they are BASE (har har): * Basically Available - system seems to work all the time * Soft State - it doesn't have to be consistent all the time * Eventually Consistent - becomes consistent at some later time
  • Everyone who builds big applications builds them on CAP and BASE: Google, Yahoo, Facebook, Amazon, eBay, etc

    Google's BigTable

  • Google BigTable - manages data across many nodes.
  • Paxos (Chubby) - distributed transaction algorithm that manages locks across systems.
  • BigTable Characteristics: * stores data in tablets using GFS, a distributed file system. * compression - great gains in throughput, can store more, reduces IO bottleneck because you have to store less so you have to talk to the disks less so performance improves. *single master - one node knows everything about all the other node (backed up and cached). *hybrid between row and column database ** row database - store objects together ** column database - store attributes of objects together. Makes sequential retrieval very fast, allows very efficient compression, reduces disks seeks and random IO. * versioning * bloom filters - allows data to be distributed across a bunch of nodes. It's a calculation on data that probabilistically maps the data to the nodes it can be found on. * eventually consistent - append only system using a row time stamp. When a client queries they get several versions and the client is in charge of picking the most recent.
  • Pros: * Compression is awesome. He really thinks compression is an important attribute of system. * Clients are probably simple. * Integrates with map-reduce.
  • Cons: * Proprietary to Google - You can't use it on your system. * Single-master - could be a downside but not sure.

    Amazon's Dynamo

  • A giant distributed hash table, called a key-value store.
  • Uses consistent hashing to distribute data to one or more nodes for redundancy and performance. * Consistent hashing - a ring of nodes and hash function picks which node(s) to store data * Consitency between nodes is based on vector clocks and read repair. * Vector clocks - time stamp on every row for every node that has written to it. * Read repair - When a client does a read and the nodes disagree on the data it's up to the client to select the correct data and tell the nodes the new correct state.
  • Pros: * No Master - eliminates single point of failure. * Highly Available for Write - This is the partition failure aspect of CAP. You can write to many nodes at once so depending on the number of replicas (which is configurable) maintained you should always be able to write somewhere. So users will never see a write failure. * Relatively simple which is why we see so many clones.
  • Cons: * Proprietary. * Clients have to be smart to handle read-repair, rebalancing a cluster, hashing, etc. Client proxies can handle these responsibilities but that adds another hop. * No compression which doesn't reduce IO. * Not suitable for column-like workloads, it's just a key-value store, so it's not optimized for analytics. Aggregate queries, for example, aren't in it's wheel house.

    Facebook's Cassandra

  • Peer-to-peer so no master like Dynamo
  • Storage model more like BigTable
  • Pros: * Open source. You can use it. * Incremental scalable - as data grows you can add more nodes to storage mesh. * Minimal administration - because it's incremental you don't have to do a lot of up front planning for migration.
  • Cons: * Not polished yet. It was built for in-box searching so may not be work well for other use cases. * No compression yet.

    Distributed Database Musings

  • Distributed databases are the new web framework.
  • None are awesome yet. No obvious winners.
  • There are many clones with partial features implemented. * For example Project Voldemort doesn't have rebalancing, no garbage collection.
  • Pick one and start submitting patches. Don't start another half-baked clone.

    Simple Key-Value Store

  • Some people are using simple key-value stores to replace relational database.
  • A key (array of bytes) maps using a hash to a value (a BLOB). It's like an associative array.
  • They are really fast and simple.

    Memcached Key-Value Stores

  • Is a key-value store that people use as a cache.
  • No persistence
  • RAM only
  • LRU so it throws data away on purpose when there's too much data
  • Lightening fast
  • Everyone uses it so well supported.
  • A good first strategy in removing load from the database.
  • Dealing with mutable data in a cache is really hard. Adding cache to an ACID system is something you'll probably get wrong and is difficult to debug because it does away with several ACID properties: * Isolation is gone with multiple writers. Everyone sees the current written value where in a database you see a consistent view of the database. * On a transaction fail the cache may reflect the new data when it has been rolled back in the database. * Dependent cache keys are difficult to program correctly because they aren't transactional. Consistency is impossible to keep. Update one key and what happens to the dependent keys? * It's complicated and you'll get it wrong and lose some of the consistency that the database had given your.

    Tokyo Cabinet/Tyrant Key-Value Store

  • Similar use cases as for BerkelyDB.
  • Disk persistence. Can store data larger than RAM.
  • Performs well.
  • Actively developed. Lots of developers adding new features (but not bug fixes).
  • Similar replication strategy to MySQL. Not useful for scalability as it limits the write throughput to one node.
  • Optional compressed pages so has some compression advantages.

    Redis Data Structure Store

  • Very new.
  • It's a data structure store not a key-value store, which means it understands your values so you can operate on them. Typically in a key-value store the values are opaque.
  • Can match on key spaces. You can look for all keys that match an expression.
  • Understands lists and sets. So you can do list and set operation in the process of the database server which is much more efficient because all the data doesn't have to be paged to the client. Updates can then b done atomically on the server side which is difficult to do on the client side.
  • Big downside is it requires that full data store in RAM. Can't store data on disk.
  • It is disk backed so it is reliable over a reboot, but still RAM limited.
  • Maybe useful as a cache that supports higher level operations than memcache.

    Document Databases

  • Schema-free. You don't have to say which attributes are in the values that are stored.
  • Makes schema migration easier because it doesn't care what fields you have. Applications must be coded to handle the different versions of documents.
  • Great for storing documents.

    CouchDB Document Database

  • Apache project so you can use it.
  • Written Erlang
  • Asynchronous replication. Still limited to the write speed of one node.
  • JSON based so easy to use on the web.
  • Queries are done in a map-reduce style using views. - A view is created by writing a Javascript function that is applied to all documents in the document store. This creates a matching list of documents. - Views are materialized on demand. Once you hit the view once it saves the list until an update occurs.
  • Neat admin UI.

    MongoDB Document Database

  • Written in C++
  • Significantly faster the CouchDB
  • JSON and BSON (binary JSON-ish) formats.
  • Asynchronous replication with auto-sharding coming soon.
  • Supports indexes. Querying a property is quick because an index is automatically kept on updates. Trades off some write speed for more consistent read spead.
  • Documents can be nested unlike CouchDB which requires applications keep relationships. Advantage is that the whole object doesn't have to be written and read because the system knows about the relationship. Example is a blog post and comments. In CouchDB the post and comments are stored together and walk through all the comments when creating a view even though you are only interested in the blog post. Better write and query performance.
  • More advanced queries than CouchDB.

    Column Databases

  • Some of this model is implemented by BigTable, Cassandra, and HyperTable.
  • Sequential reads are fast because data in a column is stored together.
  • Columns compress better than rows because the data is similar.
  • Each column is stored separately so IO is efficient as only the columns of interest are scanned. When using column database you are almost always scanning the entire column.
  • Bitmap indexes for fast sequential scans. * Turning cell values into 1 or more bits. * Compression reduces IO even further. * Indexes can be logical anded and ored together to know which rows to select. * Used for big queries for performing joins of multiple tables. When a row is 2 bits (for example) there's a lot less IO than working on uncompressed unbitmapped values.
  • Bloom Filters * Used by BigTable, Cassandra and other projects. * Probabilistic data structure. * Lossy, so you can lose data. * In exchange for losing data you can store all information in constant space * Gives you false positives at a known error rate. * Store bloom filter for a bunch of nodes. Squid uses this for its cache protocol. Knowing a bloom filter you can locally perform a computation to know which nodes the data may be on. If a bit is not set in the filter then the data isn't on the node. If a bit is set it may or may not be on the node. * Used for finding stuff and approximating counts in constant space.

    MonetDB Column Database

  • Research project which crashes a lot and corrupts your data.

    LucidDB Column Database

  • Java/C++ open source data warehouse
  • No clustering so only single node performance, but that can be enough for the applications column stores are good at.
  • No experience so can't speak to it.

    Vertica Column Database

  • The product they use.
  • Commercial column store based on C-store.
  • Clustered
  • Actually works.

    Related Articles

  • Availability & Consistency by Werner Vogels
  • BASE: An Acid Alternative by Dan Pritchett
  • MongoDB - a high-performance, open source, schema-free document-oriented data store that's easy to deploy, manage and use.
  • Vertica - blazing-fast data warehousing software
  • LucidDB - the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence.
  • CouchDB - a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.
  • Memcached Tag at High Scalability
  • Key-Value Store Tag at High Scalability
  • BigTable Tag at High Scalability
  • Dynamo.

    Click to read more ...

  • Thursday
    Apr232009

    Which Key value pair database to be used

    My Table has 2 columsn .Column1 is id,Column2 contains information given by user about item in Column1 .User can give 3 types of information about item.I separate the opinion of single user by comma,and opinion of another user by ;. Example- 23-34,us,56;78,in,78 I need to calculate opinions of all users very fast.My idea is to have index on key so the searching would be very fast.Currently i m using mysql .My problem is that maximum column size is below my requirement .If any overflow occurs i make new row with same id and insert data into new row. Practically I would have around maximum 5-10 for each row. I think if there is any database which removes this application code. I just learn about key value pair database which is exactly i needed . But which doesn't put constraint(i mean much better than RDMS on column size. This application is not in production.

    Click to read more ...

    Thursday
    Mar192009

    Product: Redis - Not Just Another Key-Value Store

    With the introduction of Redis your options in the key-value space just grew and your choice of which to pick just got a lot harder. But when you think about it, that's not a bad position to be in at all. Redis (REmote DIctionary Server) - a key-value database. It's similar to memcached but the dataset is not volatile, and values can be strings, exactly like in memcached, but also lists and sets with atomic operations to push/pop elements. The key points are: open source; speed (benchmarked performing 110,000 SET operations, and 81,000 GETs, per second); persistence, but in an asynchronous way taking everything in memory; support for higher level data structures and atomic operations. The home page is well organized so I'll spare the excessive-copying-to-make-this-post-longer. For a good overview of Redis take a look at Antonio Cangiano's article: Introducing Redis: a fast key-value database. If you are looking at a way to understand how Redis is different than something like Tokyo Cabinet/Tyrant, Ezra Zygmuntowicz has done a good job explaining their different niches:

    Redis has a different use case then tokyo does. I think tokyo is better for long term persistent data storage. But redis is much better as a state/data structure server. Redis is faster then tokyo by quite a bit but is not immediately durable as in the writes to disk happen in the background at certain trigger points so it is possible to lose a little bit of data if the server crashes. But the power of redis comes in its data types. Having LISTS and SETS as well as string values for keys means you can do O(1) push/pop/shift/unshift as well as indexing/slicing into lists. And with the SET data type you can do set intersection in the server. This allows for very cool thigs like storing a set of tags for each key and then querying for the intersection of the set of tags of multiple keys to find out the common set of tags. I'm using this in nanite(an agent based messaging system) for the persistent storage of agent state and routing info. Using the SET data types makes this faster then keeping the state in memory in my ruby processes since can do the SEt intersection routing inside of redis rather then iterating and comparing in ruby: http://gist.github.com/77314. You can also use redis LISTS as a queue to distribute work between multiple processes. Since pushing and popping are atomic you can use it as a shared tuple space. Also you can use a LIST as a circular log buffer by pushing to the end of a list and then doing an LTRIM to trim the list to the max size: redis.list_push_tail('logs', 'some log line..') redis.list_trim('logs', 0, 99). That will keep your circular log buffer at a max of 100 items. With tokyo you can store lists if you use lua on the server, but to push or pop from a list you have to pull the entire list down, alter it and then push it back up to the server. There is no atomic push/pop of just a single item because tokyo cannot store real lists as values so you have to do marshaling of your list to/from a string.
    A demo application called Retwis shows how to use Redis to make a scalable Twitter clone, at least of the message posting aspects. There's a lot of good low level detail on how Redis is used in a real application.

    Click to read more ...

    Friday
    Mar062009

    Product: Lightcloud - Key-Value Database

    Lightcloud is a distributed and persistent key-value database from Plurk.com. Performance is said to be comparable to memcached. It's different than memcachedb because it scales out horizontally by adding new nodes. It's different than memcached because it persists to disk, it's not just a cache. Now you have one more option in the never ending quest to ditch the RDBMS. Their website does a nice job explaining the system:

  • Built on Tokyo Tyrant. One of the fastest key-value databases [benchmark]. Tokyo Tyrant has been in development for many years and is used in production by Plurk.com, mixi.jp and scribd.com (to name a few)...
  • Great performance (comparable to memcached!)
  • Can store millions of keys on very few servers - tested in production
  • Scale out by just adding nodes
  • Nodes are replicated via master-master replication. Automatic failover and load balancing is supported from the start
  • Ability to script and extend using Lua. Included extensions are incr and a fixed list
  • Hot backups and restore: Take backups and restore servers without shutting them down
  • LightCloud manager can control nodes, take backups and give you a status on how your nodes are doing
  • Very small foot print (lightcloud client is around ~500 lines and manager about ~400)
  • Python only, but LightCloud should be easy to port to other languages

    Click to read more ...

  • Wednesday
    Dec172008

    Ringo - Distributed key-value storage for immutable data

    Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For archiving small (less than 4KB) or medium-size data items (<100MB) in real-time so that the data can survive K - 1 disk breaks, where K is the desired number of replicas, without any downtime, in a manner that scales to terabytes of data. In addition to storing, Ringo should be able to retrieve individual or small sets of data items with low latencies (<10ms) and provide a convenient on-disk format for bulk data access. Ringo is compatible with the map-reduce framework Disco and it was started at Nokia Research Center Palo Alto.

    Click to read more ...

    Page 1 2