High Scalability -

Python,

Web Server,

real time,

tornado

Thursday

Sep102009

Building Scalable Databases: Denormalization, the NoSQL Movement and Digg

Thursday, September 10, 2009 at 6:27AM

Database normalization is a technique for designing relational database schemas that ensures that the data is optimal for ad-hoc querying and that modifications such as deletion or insertion of data does not lead to data inconsistency. Database denormalization is the process of optimizing your database for reads by creating redundant data. A consequence of denormalization is that insertions or deletions could cause data inconsistency if not uniformly applied to all redundant copies of the data within the database.

Read more on Carnage4life blog...

mg1313 |

2 Comments |

Permalink |

digg,

nosql,

scalable

Monday

Aug032009

Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

Monday, August 3, 2009 at 11:18AM

This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.

During the tutorial, we will perform the following data processing steps.... read more on Cloudera website

mg1313 |

Permalink |

EC2,

Hadoop,

cloudera,

data,

hive,

web application

Thursday

Jul162009

Scalable Web Architectures and Application State

Thursday, July 16, 2009 at 9:22AM

In this article we follow a hypothetical programmer, Damian, on his quest to make his web application scalable.

Read the full article on Bytepawn

mg1313 |

1 Comment |

Permalink |

LAMP,

Monday

Jun292009

eHarmony.com describes how they use Amazon EC2 and MapReduce

Monday, June 29, 2009 at 7:31AM

This slide show presents eHarmony.com experience (one of the biggest dating sites out there) in using Amazon EC2 and MapReduce to scale their service.

Go to the Slideshare presentation

mg1313 |

Permalink |

EC2,

amazon,

eharmony,

mapreduce

Thursday

Jun112009

Yahoo! Distribution of Hadoop

Thursday, June 11, 2009 at 3:14PM

Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project.

This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop.

Read more and get the Hadoop distribution from Yahoo

mg1313 |

Permalink |

Hadoop,

Java,

distribution,

open source,

yahoo

Monday

May112009

Facebook, Hadoop, and Hive

Monday, May 11, 2009 at 2:41AM

Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run applications that process vast amounts of data), Yahoo being the first.

Learn how they do it and what are the challenges on DBMS2 blog, which is a blog for people who care about database and analytic technologies.

mg1313 |

1 Comment |

Permalink |

Hadoop,

facebook,

hive,

yahoo

Sunday

Apr262009

Map-Reduce for Machine Learning on Multicore

Sunday, April 26, 2009 at 6:53PM

We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up.
In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time.
Specifically, we show that algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.

Read more about this study here (PDF - you can download also)

Click to read more ...

mg1313 |

3 Comments |

Permalink |

Sunday

Apr262009

Scale-up vs. Scale-out: A Case Study by IBM using Nutch/Lucene

Sunday, April 26, 2009 at 6:42PM

Scale-up solutions in the form of large SMPs have represented the mainstream of commercial computing for the past several years. The major server vendors continue to provide increasingly larger and more powerful machines. More recently, scale-out solutions, in the form of clusters of smaller machines, have gained increased acceptance for commercial computing.
Scale-out solutions are particularly effective in high-throughput web-centric applications. In this paper, we investigate the behavior of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application. Our conclusions show that a scale-out strategy can be the key to good performance even on a scale-up machine.
Furthermore, scale-out solutions offer better price/performance, although at an increase in management complexity.

Read more about scaling out/up and about the conclusions here (PDF - you can also download it)

Click to read more ...

mg1313 |

Permalink |