Entries in google (39)

Saturday
Sep112010

Google's Colossus Makes Search Real-time by Dumping MapReduce

As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce, an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant, Google's new faster, real-time search system.

The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus.

Details are slowly coming out about their new goals and approach:

Click to read more ...

Friday
Jan222010

How BuddyPoke Scales on Facebook Using Google App Engine

How do you scale a viral Facebook app that has skyrocketed to a mind boggling 65 million installs (the population of France)? That's the fortunate problem BuddyPoke co-founder Dave Westwood has and he talked about his solution at Wednesday's Facebook Meetup. Slides for the complete talk are here. For those not quite sure what BuddyPoke is, it's a social network application that lets users show their mood, hug, kiss, and poke their friends through on-line avatars.

In many ways BuddyPoke is the quintessentially modern web application. It thrives off the energy of social network driven ecosystems. Game play mechanics, viral loops, and creative monetization strategies are all part of if its everyday conceptualization. It mashes together different technologies, not in a dark Frankensteining sort of way, but in a smart way that gets the most bang for the buck. Part of it runs on Facebook servers (free). Part of it runs on flash in a browser (free). Part of it runs on a storage cloud (higher cost). And part of runs on a Platform as a Service environment (that's GAE) (low cost). It also integrates tightly with other services like PayPal (a slice). Real $$$ are made selling virtual goods like gold coins redeemable in pokes. User's can also have their avatars made into dolls, t-shirts, and a whole army of other Zazzle powered gifts.

Click to read more ...

Monday
Jan112010

Strategy: Don't Use Polling for Real-time Feeds

Ivan Zuzak wrote a fascinating article on Real-time feed processing and filtering using Google App Engine to build Feed-buster, a service that inserts MediaRSS tags into feeds that don't have them. He talks about using polling and PubSubHubBub (real-time) to process FriendFeed feeds. Ivan is trying to devise a separate filtering service where: 

  1. filtering services should be applied as close to the publisher as possible so notifications that nobody wants don’t waste network resource.
  2. processing services should be applied as close to the subscriber so that the original update may be transported through the network as a single notification for as long as possible.

Besides being a generally interesting article, Ivan makes an insightful observation on the nature of using polling services in combination with metered Infrastructure/Platform services:

Polling is bad because AppEngine applications have a fixed free daily quota for consumed resources, when the number of feeds the service processed increased - the daily quota was exhausted before the end of the day because FF polls the service for each feed every 45 minutes.

This fits directly in with the ideas in Cloud Programming Directly Feeds Cost Allocation Back into Software Design. My general preference is to poll a distributed queue for work items. It's robust and allows your system to control it's own resource usage by determining when to poll. Otherwise you can easily be overwhelmed by fast pushers. Here the overwhelming is going the other way. Your budget is being overwhelmed by the polling requests. And the more you try approximate real-time with frequent polling requests the more your budget is busted.
It's a cool example of how costs, algorithm, and platform choices all feed into and shape product architectures.

 

Friday
Oct302009

Hot Scalabilty Links for October 30 2009

Sunday
Jun282009

Google Voice Architecture

Hi High Scalability community!

Do you have any information on the architecture behind Google Voice, the new service by Google that offers one Google Number for all your calls and SMS? It is based on GrandCentral who has been acquired by Google 2 years ago.

Thanks!

Friday
Jun052009

Google Wave Architecture

Update: Good Vibrations by Radovan Semančík. Lot's of interesting questions about how Wave works, scalability, security, RESTyness, and so on.

Google Wave is a new communication and collaboration platform based on hosted XML documents (called waves) supporting concurrent modifications and low-latency updates. This platform enables people to communicate and work together in new, convenient and effective ways. We will offer these benefits to users of Google Wave and we also want to share them with everyone else by making waves an open platform that everybody can share. We welcome others to run wave servers and become wave providers, for themselves or as services for their users, and to "federate" waves, that is, to share waves with each other and with Google Wave. In this way users from different wave providers can communicate and collaborate using shared waves. We are introducing the Google Wave Federation Protocol for federating waves between wave providers on the Internet.

Here are the initial white papers that are available to complement the Google Wave Federation Protocol:

  • Google Wave Federation Architecture

  • Google Wave Data Model and Client-Server Protocol

  • Google Wave Operational Transform

  • General Verifiable Federation

The Google Wave APIs are documented here.

Thursday
Jun042009

New Book: Even Faster Web Sites: Performance Best Practices for Web Developers

Performance is critical to the success of any web site, and yet today's web applications push browsers to their limits with increasing amounts of rich content and heavy use of Ajax. In his new book Even Faster Web Sites: Performance Best Practices for Web Developers, Steve Souders, web performance evangelist at Google and former Chief Performance Yahoo!, provides valuable techniques to help you optimize your site's performance.

Souders' previous book, the bestselling High Performance Web Sites, shocked the web development world by revealing that 80% of the time it takes for a web page to load is on the client side. In Even Faster Web Sites, Souders and eight expert contributors provide best practices and pragmatic advice for improving your site's performance in three critical categories:

  • JavaScript - Get advice for understanding Ajax performance, writing efficient JavaScript, creating responsive applications, loading scripts without blocking other components, and more.

  • Network - Learn to share resources across multiple domains, reduce image size without loss of quality, and use chunked encoding to render pages faster.

  • Browser - Discover alternatives to iframes, how to simplify CSS selectors, and other techniques.

Speed is essential for today's rich media web sites and Web 2.0 applications. With this book, you'll learn how to shave precious seconds off your sites' load times and make them respond even faster.

About the Author

Steve Souders works at Google on web performance and open source initiatives. His book High Performance Web Sites explains his best practices for performance along with the research and real-world results behind them. Steve is the creator of YSlow, the performance analysis extension to Firebug. He is also co-chair of Velocity 2008, the first web performance conference sponsored by O'Reilly. He frequently speaks at such conferences as OSCON, Rich Web Experience, Web 2.0 Expo, and The Ajax Experience.

Steve previously worked at Yahoo! as the Chief Performance Yahoo!, where he blogged about web performance on Yahoo! Developer Network. He was named a Yahoo! Superstar. Steve worked on many of the platforms and products within the company, including running the development team for My Yahoo!.

Wednesday
Apr292009

Presentations: MySQL Conference & Expo 2009

The Presentations of the MySQL Conference & Expo 2009 held April 20-23 in Santa Clara is available on the above link.

They include:

  • Beginner's Guide to Website Performance with MySQL and memcached by Adam Donnison

  • Calpont: Open Source Columnar Storage Engine for Scalable MySQL DW by Jim Tommaney

  • Creating Quick and Powerful Web Applications with MySQL, GlassFish, and NetBeans by Arun Gupta

  • Deep-inspecting MySQL with DTrace by Domas Mituzas

  • Distributed Innodb Caching with memcached by Matthew Yonkovit and Yves Trudeau

  • Improving Performance by Running MySQL Multiple Times by MC Brown

  • Introduction to Using DTrace with MySQL by Vince Carbone

  • MySQL Cluster 7.0 - New Features by Johan Andersson

  • Optimizing MySQL Performance with ZFS by Allan Packer

  • SAN Performance on a Internal Disk Budget: The Coming Solid State Disk Revolution by Matthew Yonkovit

  • This is Not a Web App: The Evolution of a MySQL Deployment at Google by Mark Callaghan
Thursday
Mar122009

Google TechTalk: Amdahl's Law in the Multicore Era

Over the last several decades computer architects have been phenomenally successful turning the transistor bounty provided by Moore's Law into chips with ever increasing single-threaded performance. During many of these successful years, however, many researchers paid scant attention to multiprocessor work. Now as vendors turn to multicore chips, researchers are reacting with more papers on multi-threaded systems. While this is good, we are concerned that further work on single-thread performance will be squashed. To help understand future high-level trade-offs, we develop a corollary to Amdahl's Law for multicore chips [Hill & Marty, IEEE Computer 2008]. It models fixed chip resources for alternative designs that use symmetric cores, asymmetric cores, or dynamic techniques that allow cores to work together on sequential execution. Our results encourage multicore designers to view performance of the entire chip rather than focus on core efficiencies. Moreover, we observe that obtaining optimal multicore performance requires further research BOTH in extracting more parallelism and making sequential cores faster. This talk is based on an HPCA 2008 keynote address. Speaker: Mark D. Hill Mark D. Hill (http://www.cs.wisc.edu/~markhill) is professor in both the computer sciences department and the electrical and computer engineering department at the University of Wisconsin--Madison, where he also co-leads the Wisconsin Multifacet (http://www.cs.wisc.edu/multifacet/) project with David Wood. His research interests include parallel computer system design, memory system design, computer simulation, and recently transactional memory. He earned a PhD from University of California, Berkeley. He is an ACM Fellow and a Fellow of the IEEE.

Click to read more ...

Sunday
Jan042009

Paper: MapReduce: Simplified Data Processing on Large Clusters

Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve and make programmers more productive faster. From the abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day. Thanks to Kevin Burton for linking to the complete article.

Related Articles

  • MapReducing 20 petabytes per day by Greg Linden
  • 2004 Version of the Article by Jeffrey Dean and Sanjay Ghemawat

    Click to read more ...