« A Scalability Lament | Main | Scaling Traffic: People Pod Pool of On Demand Self Driving Robotic Cars who Automatically Refuel from Cheap Solar »
Friday
Jul172009

Against all the odds

This article not about Mariah Carey, or its song. It's about Storing System, Database.

First let's describe what means by odds: In my social network, I found 93% of the mainstream developers sanctify the database, or at least consider it in any data persistence challenge as the ultimate, superhero, and undefeatable solution.

I think this problem come from the education, personally, and some companies also I think it's involved in this.

To start to fix this bad thinking, we all should agree in the following points:

  • Every challenge have its own solutions, so whatever you want to save/persistent, there are always many solutions. For example the Web search engines, such as: Google, Kngine, Yahoo, Bing don't use database at all instead we use Indexes (Index file) for better performance.
  • The Database in general whatever the vendor it's slow compared with other solutions such as: Key-Value storing system, Index file, DHT.
  • The Database currently employ Relation Data model, or Object relational data model, so don't convince yourself to save non-relation data into relation data model store system such as: Database.
  • The Database system architecture didn't changed very much in last 30 years, and it's content a lot of limits, and fails in its performance, scalability character. If you don't believe me check out this papers:
  1. The End of an Architectural Era (It's Time for a Complete Rewrite)

  2. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

I hope if you agreed with me in the previous points. So the question do we really need Database in every application?

There are many scenario shouldn't use Database resisters, such as: Web search engine, Caching, File sharing system, DNS system, etc. In the other hand there many of scenarios should use Database, such as: Customer database, Address book, ERP, etc.

Tiny URL services for example, shouldn't use Database at all because it's require very simple needs, just map a small/tiny URL to the real/big URL. If you start agreed with me, you likely want ask: But what we can use beside or instead of Databases?

There are a lot of tools that fallowing CAP, BASE model, instead of ACID model. But first let's describe ACID:

  • Atomicity: A transaction is all or nothing
  • Consistency: Only valid data is written to the database
  • Isolation: Pretend all transactions are happening serially and the data is correct
  • Durability: what you write is what you get
  1. The problem with ACID is that it gives you too much; it trips you up when you are trying to scale a system across multiple nodes.

  2. Down time is unacceptable. So your system needs to be reliable. Reliability requires multiple nodes to handle machine failures.

  3. To make scalable systems that can handle lots and lots of reads and writes you need many more nodes.

  4. Once you try to scale ACID across many machines you hit problems with network failures and delays. The algorithms don't work in a distributed environment at any acceptable speed.

In other hand CAP model is about:

  • Consistency: Your data is correct all the time. What you write is what you read.

  • Availability: You can read and write and write your data all the time.

  • Partition Tolerance: If one or more nodes fails the system still works and becomes consistent when the system comes on-line.
  1. CAP is easy to scale, distribute. CAP is scalable by nature.

  2. Everyone who builds big applications builds them on CAP. Who use CAP: Google, Yahoo, Facebook, Kngine, Amazon, eBay, etc.

For example in any in-memory or in-disk caching system you will never need all the Database features. You just need CAP like system. Today there are a lot of: column oriented, and key-value oriented systems. But first let's describe Column oriented:

A column-oriented is a database management system (DBMS) which stores its content by column rather than by row. This has advantages for databases such as data warehouses and library catalogues, where aggregates are computed over large numbers of similar data items. This approach is contrasted with row-oriented databases and with correlation databases, which use a value-based storage structure. For more information check Wikipedia page.

Distributed key-value stores:

Distributed column stores (Bigtable-like systems):

Something a little different:

Resource:

Reader Comments (9)

Actually, "Against all odds" is a Phil Collins song which was covered (poorly) by Mariah Carey :)

December 31, 1999 | Unregistered CommenterAnonymous

Could you precise a little bit why CouchDB is different from the two other categories?
Anyway, thanks for the post!

December 31, 1999 | Unregistered Commenterkhigia

s/it's song/its song/
s/To start fix/To start to fix/
s/butter performance/better performance/
etc.

December 31, 1999 | Unregistered CommenterLeo Petr

I'm a fan of the High Scalability blog, so I'm surprised by the poor quality of the writing in this post; it prevents me from finding the value of the author's thoughts.

December 31, 1999 | Unregistered CommenterAaron White

Hmm... Eric Brewer's CAP theorem means, you can have only two out of the three properties. So I am not so sure how a system can give you all three, and what kind of systems you call "CAP systems".

December 31, 1999 | Unregistered CommenterBernd Eckenfels

Not everyone speaks English as their first language ... maybe some of us can offer to copy edit?

December 31, 1999 | Unregistered CommenterJim

How is redis "distributed"? Redis server does nothing on its own to "distribute" the data or the key space. Y

December 31, 1999 | Unregistered CommenterAnonymous

I'm not usually picky about small grammar mistakes (even because I'm not a native English speaker myself) this post is both very badly written, and poor in terms of information:

CAP is easy to scale

Please do not publish this kind of article here or I'll have to stop recommending this blog to people.

December 31, 1999 | Unregistered CommenterGustavo Niemeyer

I think I speak for a lot of us when I say Please don't publish any more garbage like this or we will stop reading this website. Did someone root your website and publish this? Did you think it would get hits because it's controversial?

"CAP is easy to scale, distribute. CAP is scalable by nature."

No, CAP is a theorem that says you can only have 2 of the 3. It means distributed systems are HARD to scale because (if you want partition tolerance) you must choose between Consistency and Availability.

"Consistency: Only valid data is written to the database"

No, that's not what it means at all.

"Something a little different:"

If you read this website, you would know it's called a Docstore.

I could go on, but my IQ is starting to fall from re-reading the article.

December 31, 1999 | Unregistered CommenterAnonymous

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>