How Google Taught Me to Cache and Cash-In

Saturday

Sep122009

How Google Taught Me to Cache and Cash-In

Saturday, September 12, 2009 at 2:31AM

A user named Apathy on how Reddit scales some of their features, shares some advice he learned while working at Google and other major companies.

To be fair, I [Apathy] was working at Google at the time, and every job I held between 1995 and 2005 involved at least one of the largest websites on the planet. I didn't come up with any of these ideas, just watched other smart people I worked with who knew what they were doing and found (or wrote) tools that did the same things. But the theme is always the same:

Cache everything you can and store the rest in some sort of database (not necessarily relational and not necessarily centralized).
Cache everything that doesn't change rapidly. Most of the time you don't have to hit the database for anything other than checking whether the users' new message count has transitioned from 0 to (1 or more).
Cache everything--templates, user message status, the front page components--and hit the database once a minute or so to update the front page, forums, etc. This was sufficient to handle a site with a million hits a day on a couple of servers. The site was sold for $100K.
Cache the users' subreddits. Blow out the cache on update.
Cache the top links per subreddit. Blow out cache on update.
Combine the previous two steps to generate a menu from cached blocks.
Cache the last links. Blow out the cache on each outlink click.
Cache the user's friends. Append 3 characters to their name.
Cache the user's karma. Blow out on up/down vote.
Filter via conditional formatting, CSS, and an ajax update.
Decouple selection/ranking algorithm(s) from display.
Use Google or something like Xapian or Lucene for search.
Cache "for as long as memcached will stay up." That depends on how many instances you're running, what else is running, how stable the Python memcached hooks are, etc.
The golden rule of website engineering is that you don't try to enforce partial ordering simultaneously with your updates.
When running a search engine operate the crawler separately from the indexer.
Ranking scores are used as necessary from the index, usually cached for popular queries.
Re-rank popular subreddits or the front page once a minute. Tabulate votes and pump them through the ranker.
Cache the top 100 per subreddit. Then cache numbers 100-200 when someone bothers to visit the 5th page of a subreddit, etc.
For less-popular subreddits, you cache the results until an update comes in.
With enough horsepower and common sense, almost any volume of data can be managed, just not in realtime.
Never ever mix your reads and writes if you can help it.
Merge all the normalized rankings and cache the output every minute or so. This avoids thousands of queries per second just for personalization.
It's a lot cheaper to merge cached lists than build them from scratch. This delays the crushing read/write bottleneck at the database. But you have to write the code.
Layering caches is a clasisc strategy for milking your servers as much as possilbe. First look for an exact match. If that's not found, look for the components and build an exact match.
The majority of traffic on almost all websites comes from the default, un-logged-in front page or from random forum/comment/result pages. Make sure those are cached as much as possible.. If one or more of the components aren't found, regenerate those from the DB (now it's cached!) and proceed. Never hit the database unless you have to.
You (almost) always have to hit the database on writes. The key is to avoid hitting it for reads until you're forced to do so.

Todd Hoff |

5 Comments |

Permalink |

Print Article

Email Article

Caching,

Strategy

Reader Comments (5)

You don't have to hit the db for a write depending on how critical the update is. If you don't need (near) real-time then you can dump the updates out and then read them into the database later. We use a system like that for quite a few things. Keeps the request time down since the write only requires a file handle and not a remote socket call.

And yes, you must cache if you want to truly scale. But you also have to get good at building out logical blocks for your data so it's easy to access. Usually you can get to issues only when trying to merge large data sets into a view for the application. The choosing storage layer actually isn't the most important decision in my opinion since it should reflect what the application is going to require of the data layer.

Some of the biggest problems we've had with performance is trying to just use some system we already have because we have it and not trying new systems that might be more appropriate for the application requirements. Then trying to force the application to deal with the storage layer. We're working on that though.

---------------------------------------------------------------
http://blog.pe-ell.net http://wetnun.net

December 31, 1999 |

jstephens

What do you mean by "don't try to enforce partial ordering simultaneously with your updates."? Please elaborate.

December 31, 1999 |

Anonymous

I second the "don't try to enforce partial ordering simultaneously with your updates" question.

December 31, 1999 |

Anonymous

THINK WE'RE ALL GOT CONFUSED WITH THAT SENTENCE EHY?

-Nhoel

December 31, 1999 |

Nhoel

My impression when reading was that an update has to be sent to a set of services (database, search, log, etc). Don't try and make an update synchronous and transactional. Send it to each subsystem, let it be processed on its own, and when it's done it's done. Trying for a strong form of consistency or ordering across different services isn't practical or performant. Could be wrong though...

December 31, 1999 |

Todd Hoff

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>