High Scalability -

Entries in Map Reduce (16)

Thursday

Jan122012

Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs

Thursday, January 12, 2012 at 9:10AM

The Peregrine falcon is a bird of prey, famous for its high speed diving attacks, feeding primarily on much slower Hadoops. Wait, sorry, it is Kevin Burton of Spinn3r's new Peregrine project--a new FAST modern map reduce framework optimized for iterative and pipelined map reduce jobs--that feeds on Hadoops.

If you don't know Kevin, he does a lot of excellent technical work that he's kind enough to share it on his blog. Only he hasn't been blogging much lately, he's been heads down working on Peregrine. Now that Peregrine has been released, here's a short email interview with Kevin on why you might want to take up falconry, the ancient sport of MapReduce.

What does Spinn3r do that Peregrine is important to you?

For the rest of the interview please click below...

Click to read more ...

HighScalability Team |

2 Comments |

Permalink |

Print Article

Email Article

Map Reduce,

Product

Wednesday

Jul272011

Making Hadoop 1000x Faster for Graph Problems

Wednesday, July 27, 2011 at 9:07AM

Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt, which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it), which shares the secrets of getting Hadoop to perform 1000x better on graph data.

TL;DR:

Click to read more ...

HighScalability Team |

4 Comments |

Permalink |

Print Article

Email Article

Hadoop,

Map Reduce,

Paper,

Strategy,

graph

Tuesday

Jan042011

Map-Reduce With Ruby Using Hadoop

Tuesday, January 4, 2011 at 9:03AM

A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.

Click to read more ...

Phil Whelan |

1 Comment |

Permalink |

tagged

hdfs,

whirr in

Hadoop,

amazon,

ruby

Thursday

Jul302009

Learn How to Think at Scale

Thursday, July 30, 2009 at 3:10PM

Aaron Kimball of Cloudera gives a wonderful 23 minute presentation titled Cloudera Hadoop Training: Thinking at Scale Cloudera which talks about "common challenges and general best practices for scaling with your data." As a company Cloudera offers "enterprise-level support to users of Apache Hadoop." Part of that offering is a really useful series of tutorial videos on the Hadoop ecosystem.

Like TV lawyer Perry Mason (or is it Harmon Rabb?), Aaron gradually builds his case. He opens with the problem of storing lots of data. Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. They jury loved it.

Here's the video Thinking at Scale. And here's a summary of some of the lessons learned from the talk:

Lessons Learned

We can process data much faster than we can read it and much faster than we can write results back to disk.
* Say 32 GB of RAM available to a machine. You can get 1-2 TB of data on disk. The amount of data a machine can store is greater than the amount it can manipulate in memory so you have to swap out RAM.
* With an average job size of 180 GB it would take 45 minutes to read that data off of disk sequentially. Random access would be much slower.
* An individual SATA drive can read at 75 MB/sec. To process 180 GB you would have to read at 75 MB/sec which leaves the CPU doing very little with the data.

The solution is to parallelize the reads. Have a 1000 hard drives working together and you can read 75 GB/sec.

With a parallel system in place the next step is to move computation to where the data is already stored.
* Grids moved data to computation. Data was typically stored on a large filer/SAN.
* The new large scale computing approach is to move computation to where the data is already stored. A file has limited processing power relative to storage size so not useful.
* So move processing to individual nodes that store only a small amount of the data at a time.
* This gets around implementation complexity and bandwidth limitations of a centralized filer. Distributed systems can drown themselves if they start sharing data.

Large distributed systems must be able to support partial failure and adapt to new additional capacity.
* Failure with large systems is inevitable so partial progress must be kept for long jobs and jobs must be restarted when a failure is detected. Complex distributed systems make job restarting difficult because of the state that must be maintained.
* Processing should be close to linear with the number of nodes. Losing 5% of nodes should not end up with a 50% loss in throughput. Doubling the size of the cluster should double the number of jobs that can be processed. No job should be able to nuke the system.
* Workload should be transferred as new nodes are added and failures occur.
* Node changes (failures, additions, new hard drives, more memory, etc) should be transparent to jobs. Users shouldn't have to deal with changes, the system should handle them transparently.

Solution to large scale data processing problems is to build a shared nothing architecture.
* To get around the limits faced byMPI (message passing interface) based systems nothing is shared.
* In map-reduce (MR) systems data is read locally and processed locally. Results are written back locally.
* Nodes do not talk to each other.
* Data is paritioned onto machines in advance and computations happen where data is stored.
* In MPI communication is explicit. Programs know who they are talking to and what they are talking about. In MR communication is implicit. It is taken care of by the system. Data is routed where it needs to go. This simplifies programs by removing the complexity of explicit coordination. Allows developers to concentrate on solving their problem without know level network stack and programming details.
* On multi-core computers each core would be treated as a separate node.
* Goal is to have locality of reference. Tasks are processed on the same node as where the data is stored or at least on the same rack. This removes a load step. The data isn't loaded onto a filer. It's not then loaded onto a processing machines. It's already where spreat around the cluster where it needs to be used upfront.
* In standard MR it's processing large files of data, typically 1 GB or more. This allows streaming reads from this disk. Typical file system block sizes are 4K, for MR they are 64MB to 256MB, which allows writing large linear chunks which reduces seeks on reading.
* Tasks are restarted transparently on failure because tasks are independent of each other.
* Data is replicated across nodes for fault tolerance purposes.
* Task independence allows speculative task execution. The same task can be started on difference nodes and the fastest result can be used. This allows problems like broken disk controllers to be worked around.
* If necessary inputs can be processed on another machine. There's a big penalty for going off node and off rack.
* Nodes have no identity to the programmer. Nodes can run multiple jobs.

Researchers: Databases still beat Google's MapReduce by Eric Lai

Relational Database Experts Jump The MapReduce Shark by Greg Jorgensen

Hadoop and HBAse vs RDBMS by Jonathan Gray

Database Technology for the Web: Part 1 – The MapReduce Debate by Colin White

Todd Hoff |

2 Comments |

Permalink |

Print Article

Email Article

Map Reduce,

Strategy

Sunday

May172009

Product: Hadoop

Sunday, May 17, 2009 at 4:10AM

Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig. Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3: Scaling Hadoop to 4000 nodes at Yahoo!. 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2: Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides. Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity in Data Systems at Scale, Handling Large Datasets at Google: Current Systems and Future Directions, Mining the Web Graph. and Sherpa: Hosted Data Serving. Update: Kevin Burton points out Hadoop now has a blog and an introductory video staring Beyonce. Well, the Beyonce part isn't quite true. Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. It replicates much of Google's stack, but it's for the rest of us. Jeremy Zawodny has a wonderful overview of why Hadoop is important for large website builders: For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy. The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy It's hard work. And it needs to be commoditized, just like the hardware has been... Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. The obvious question of the day is: should you build your website around Hadoop? I have no idea. There seems to be a few types of things you do with lots of data: process, transform, and serve. Yahoo literally has petabytes of log files, web pages, and other data they process. Process means to calculate on. That is: figure out affinity, categorization, popularity, click throughs, trends, search terms, and so on. Hadoop makes great sense for them for the same reasons it does Google. But does it make sense for your website? If you are YouTube and you have petabytes of media to serve, do you really need map/reduce? Maybe not, but the clustered file system is great. You get high bandwidth with the ability to transparently extend storage resources. Perfect for when you have lots of stuff to store. YouTube would seem like it could use a distributed job mechanism, like you can build with Amazon's services. With that you could create thumbnails, previews, transcode media files, and so on. When they have Hbase up and running that could really spike adoption. Everyone needs to store structured data in a scalable, reliable, highly performing data store. That's an exciting prospect for me. I can't wait for experience reports about "normal" people, familiar with a completely different paradigm, adopting this infrastructure. I wonder what animal O'Reilly will use on their Hadoop cover?

Paper: MapReduce: Simplified Data Processing on Large Clusters

Sunday, January 4, 2009 at 12:50AM

Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster. From the abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. Thanks to Kevin Burton for linking to the complete article.

MapReducing 20 petabytes per day by Greg Linden

2004 Version of the Article by Jeffrey Dean and Sanjay Ghemawat

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Map Reduce,

Paper,

google

Saturday

Nov222008

Google Architecture

Saturday, November 22, 2008 at 10:01AM

Update 2: Sorting 1 PB with MapReduce. PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

Information Sources

Video: Building Large Systems at Google

Google Lab: The Google File System

Google Lab: MapReduce: Simplified Data Processing on Large Clusters

Google Lab: BigTable.

Video: BigTable: A Distributed Structured Storage System.

Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.

How Google Works by David Carr in Baseline Magazine.

Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.

Dare Obasonjo's Notes on the scalability conference.

Platform

Linux

A large diversity of languages: Python, Java, C++

What's Inside?

The Stats

Estimated 450,000 low-cost commodity servers in 2006

In 2005 Google indexed 8 billion web pages. By now, who knows?

Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.

Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.

BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.

The Stack

Google visualizes their infrastructure as a three layer stack:

Products: search, advertising, email, maps, video, chat, blogger

Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.

Computing Platforms: a bunch of machines in a bunch of different data centers

Make sure easy for folks in the company to deploy at a low cost.

Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

Reliable Storage Mechanism with GFS (Google File System)

Reliable scalable storage is a core need of any application. GFS is their core storage platform.

Google File System - large distributed log structured file system in which they throw in a lot of data.

Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required: - high reliability across data centers - scalability to thousands of network nodes - huge read/write bandwidth requirements - support for large blocks of data which are gigabytes in size. - efficient distribution of operations across nodes to reduce bottlenecks

System has master and chunk servers. - Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk. - Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.

A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.

Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.

Do Something With the Data Using MapReduce

Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Why use MapReduce? - Nice way to partition tasks across lots of machines. - Handle machine failure. - Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc. - Computation can automatically move closer to the IO source.

The MapReduce system has three different types of servers. - The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks. - The Map servers accept user input and performs map operations on them. The results are written to intermediate files - The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.

For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically. - The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS. - In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count. - Shuffling aggregates key types. - The reductions sums up all the key value pairs and produces the final answer.

The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.

Programs can be very small. As little as 20 to 50 lines of code.

One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.

Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Storing Structured Data in BigTable

BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.

BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.

It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.

Commercial databases simply don't scale to this level and they don't work across 1000s machines.

By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.

Machines can be added and deleted while the system is running and the whole system just works.

Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.

Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.

BigTable has three different types of servers: - The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed. - The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. - The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.

A locality group can be used to physically store related bits of data together for better locality of reference.

Tablets are cached in RAM as much as possible.

Hardware

When you have a lot of machines how do you build them to be cost efficient and use power efficiently?

Use ultra cheap commodity hardware and built software on top to handle their death.

A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.

Linux, in-house rack design, PC class mother boards, low end storage.

Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.

Use a mix of collocation and their own data centers.

Misc

Push changes out quickly rather than wait for QA.

Libraries are the predominant way of building programs.

Some are applications are provided as services, like crawling.

An infrastructure handles versioning of applications so they can be release without a fear of breaking things.

Future Directions for Google

Support geo-distributed clusters.

Create a single global namespace for all data. Currently data is segregated by cluster.

More and better automated migration of data and computation.

Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

Lessons Learned

Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.

Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.

Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.

An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.

Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.

Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.

Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.

Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.

Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

Click to read more ...

Todd Hoff | 73 Comments | Permalink | Share Article Print Article Email Article

in BigTable, C, Cluster File System, Example, Geo-distributed Clusters, Java, Linux, Map Reduce, Python Tweet

Friday
Nov142008

Paper: Pig Latin: A Not-So-Foreign Language for Data Processing

Friday, November 14, 2008 at 1:05AM

Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki

Click to read more ...

geekr | Post a Comment | Permalink | Share Article Print Article Email Article

in Hadoop, Map Reduce, Paper, pig, pig latin, yahoo Tweet

Sunday
Sep282008

Product: Happy = Hadoop + Python

Sunday, September 28, 2008 at 2:17AM

Has a Java only Hadoop been getting you down? Now you can be Happy. Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze. There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages. The more ways the better. From the website:
Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanying libraries out to the Hadoop cluster. Then for each task in the Hadoop job, your job instance is de-serialized and map or reduce is called. The task results are written out using a collector, but aggregate statistics and other roll-up information can be stored in the happy.results dictionary, which is returned from the run() call. Jython modules and Java jar files that are being called by your code can be specified using the environment variable HAPPY_PATH. These are added to the Python path at startup, and are also automatically included when jobs are sent to Hadoop. The path is stored in happy.path and can be edited at runtime.

Click to read more ...

Todd Hoff | Post a Comment | Permalink | Share Article Print Article Email Article

in Hadoop, Map Reduce, Product, Python Tweet

Wednesday
Apr232008

Behind The Scenes of Google Scalability

Wednesday, April 23, 2008 at 2:09AM

The recent Data-Intensive Computing Symposium brought together experts in system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to better understand existing capabilities in the development and application of large-scale computing systems, and to explore future opportunities. Google Fellow Jeff Dean had a very interesting presentation on Handling Large Datasets at Google: Current Systems and Future Directions. He discussed: • Hardware infrastructure • Distributed systems infrastructure: –Scheduling system –GFS –BigTable –MapReduce • Challenges and Future Directions –Infrastructure that spans all datacenters –More automation It is really like a "How does Google work" presentation in ~60 slides? Check out the slides and the video!

Click to read more ...

geekr | 2 Comments | Permalink | Share Article Print Article Email Article

in BigTable, GFS, Map Reduce, google Tweet

Page 1 2 Next 10 Entries »

Entries in Map Reduce (16)

Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs

What does Spinn3r do that Peregrine is important to you?

Making Hadoop 1000x Faster for Graph Problems

Map-Reduce With Ruby Using Hadoop

Learn How to Think at Scale

Lessons Learned

Related Articles

Product: Hadoop

See Also

Paper: MapReduce: Simplified Data Processing on Large Clusters

Related Articles

Google Architecture

Information Sources

Platform

What's Inside?

The Stats

The Stack

Reliable Storage Mechanism with GFS (Google File System)

Do Something With the Data Using MapReduce

Storing Structured Data in BigTable

Hardware

Misc

Future Directions for Google

Lessons Learned

Paper: Pig Latin: A Not-So-Foreign Language for Data Processing

Product: Happy = Hadoop + Python

Behind The Scenes of Google Scalability