Entries in BigData (24)

Tuesday
Mar302010

Running Large Graph Algorithms - Evaluation of Current State-of-the-Art and Lessons Learned

On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic, in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution.

The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valuable raw material.

By itself data is no more useful than a truck load of iron ore. Data must be brought to a factory. It must be purified, processed, and formed. That’s the job for a new field of science called Data Science. Yes, while you weren't looking a whole new branch of science was created. It makes sense in a way. Since data is a new kind of material we need a new profession paralleling that of the Material Scientist, someone who seeks to deeply understand data, the Data Scientist. We aren't so much in the age of data, as the age of data inference.

Click to read more ...

Thursday
Feb252010

Paper: High Performance Scalable Data Stores 

The world of scalable databases is not a simple one. They come in every race, creed, and color. Rick Cattell has brought some harmony to that world by publishing High Performance Scalable Data Stores, a nicely detailed one stop shop paper comparing scalable databases soley on the content of their character. Ironically, the first step in that evaluation is dividing the world into four groups:

  • Key-value stores: Redis, Scalaris, Voldmort, and Riak.
  • Document stores: Couch DB, MongoDB, and SimpleDB.
  • Record stores: BigTable, HBase, HyperTable, and Cassandra.
  • Scalable RDBMSs: MySQL Cluster, ScaleDB, Drizzle, and VoltDB.

The paper describes each system and then compares them on the dimensions of Concurrency Control, Data Storage Replication, Transaction Model, General Comments, Maturity, K-hits, License Language.

And the winner is: there are no winners. Yet. Rick concludes by pointing to a great convergence:

I believe that a few of these systems will gain critical mass and key players, and will pull away from the others by next year.  At that point, open source contributors will likely migrate to those players.

From the paper:

 

Click to read more ...

Monday
Nov232009

Big Data on Grids or on Clouds? 

 Contributed by Wolfgang Gentzsch:

Now that we have a new computing paradigm, Cloud Computing, how can Clouds help our data? Replace our internal data vaults as we hoped Grids would? Are Grids dead now that we have Clouds? Despite all the promising developments in the Grid and Cloud computing space, and the avalanche of publications and talks on this subject, many people still seem to be confused about internal data and compute resources, versus Grids versus Clouds, and they are hesitant to take the next step. I think there are a number of issues driving this uncertainty.

read more at: BigDataMatters.com

Thursday
Oct222009

Paper: The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM 

Stanford Info Lab is taking pains to document a direction we've been moving for a while now, using RAM not just as a cache, but as the primary storage medium. Many quality products have built on this model. Even if the vision isn't radical, the paper does produce a lot of data backing up the transition, which is in itself helpful. From the The Abstract:
Disk-oriented approaches to online storage are becoming increasingly problematic: they do not scale grace-fully to meet the needs of large-scale Web applications, and improvements in disk capacity have far out-stripped improvements in access latency and bandwidth. This paper argues for a new approach to datacenter storage called RAMCloud, where information is kept entirely in DRAM and large-scale systems are created by aggregating the main memories of thousands of commodity servers. We believe that RAMClouds can provide durable and available storage with 100-1000x the throughput of disk-based systems and 100-1000x lower access latency. The combination of low latency and large scale will enable a new breed of data-intensive applications.

Related Articles

Page 1 2 3