Entries by mg1313 (24)

Thursday
Sep102009

The technology behind Tornado, FriendFeed's web server

Today, we are open sourcing the non-blocking web server and the tools that power FriendFeed under the name Tornado Web Server. We are really excited to open source this project as a part of Facebook's open source initiative, and we hope it will be useful to others building real-time web services.

You can download Tornado at tornadoweb.org.

Read more on Brett Taylor's blog (co-founder of FriendFeed)

Thursday
Sep102009

Building Scalable Databases: Denormalization, the NoSQL Movement and Digg

Database normalization is a technique for designing relational database schemas that ensures that the data is optimal for ad-hoc querying and that modifications such as deletion or insertion of data does not lead to data inconsistency. Database denormalization is the process of optimizing your database for reads by creating redundant data. A consequence of denormalization is that insertions or deletions could cause data inconsistency if not uniformly applied to all redundant copies of the data within the database.

Read more on Carnage4life blog...

Monday
Aug032009

Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application.

During the tutorial, we will perform the following data processing steps.... read more on Cloudera website

Thursday
Jul162009

Scalable Web Architectures and Application State

In this article we follow a hypothetical programmer, Damian, on his quest to make his web application scalable.

Read the full article on Bytepawn

Monday
Jun292009

eHarmony.com describes how they use Amazon EC2 and MapReduce 

This slide show presents eHarmony.com experience (one of the biggest dating sites out there) in using Amazon EC2 and MapReduce to scale their service.

Go to the Slideshare presentation

Thursday
Jun112009

Yahoo! Distribution of Hadoop

Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project.

This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop.

Read more and get the Hadoop distribution from Yahoo

Monday
May112009

Facebook, Hadoop, and Hive

Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run applications that process vast amounts of data), Yahoo being the first.

Learn how they do it and what are the challenges on DBMS2 blog, which is a blog for people who care about database and analytic technologies.

Sunday
Apr262009

Map-Reduce for Machine Learning on Multicore

We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up.
In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time.
Specifically, we show that algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors.

Read more about this study here (PDF - you can download also)

Click to read more ...

Sunday
Apr262009

Scale-up vs. Scale-out: A Case Study by IBM using Nutch/Lucene

Scale-up solutions in the form of large SMPs have represented the mainstream of commercial computing for the past several years. The major server vendors continue to provide increasingly larger and more powerful machines. More recently, scale-out solutions, in the form of clusters of smaller machines, have gained increased acceptance for commercial computing.
Scale-out solutions are particularly effective in high-throughput web-centric applications. In this paper, we investigate the behavior of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application. Our conclusions show that a scale-out strategy can be the key to good performance even on a scale-up machine.
Furthermore, scale-out solutions offer better price/performance, although at an increase in management complexity.

Read more about scaling out/up and about the conclusions here (PDF - you can also download it)

Click to read more ...

Wednesday
Apr222009

Gear6 Web cache - the hardware solution for working with Memcache

The Gear6 Web Cache hybrid DRAM-flash memory architecture allows for 5-10 times more memcache memory per unit of rack space than DRAM-only configurations, and cuts memory costs by 50%. Other software enhancements include a slab allocator that is more efficient than traditional memcache implementations due to its fine-grained bucket sizing. Gear6 Web Cache also supports object sizes greater than 1 megabyte and manages evictions based on the cost of replacing objects, depending on the size and frequency of object access. It intelligently places cache instances across DRAM and flash, taking into account their different characteristics, while at the same time monitoring their health and detecting and de‐allocating faulty or failing memory.

Gear6 Web Cache is a Memcached protocol compliant solution that scales and accelerates web applications, reduces memory footprint, enhances availability and implements comprehensive Memcached management features. Designed to work with all popular memcache clients, Gear6 Web Cache integrates seamlessly into existing deployments and immediately provides a scalable, high density caching solution for your web application environment.

Some of the web services which are using Gear6 are Answers.com (wiki answers), Veoh.com (online video), myYearBook.com (social network).

Read more about Gear6 hardware and customer cases studies on Gear6 website