Strategy: Caching 404s Saved the Onion 66% on Server Time

In the article The Onion Uses Django, And Why It Matters To Us, a lot of interesting points are made about their ambitious infrastructure move from Drupal/PHP to Django/Python: the move wasn't that hard, it just took time and work because of their previous experience moving the A.V. Club website; churn in core framework APIs make it more attractive to move than stay; supporting the structure of older versions of the site is an unsolved problem; the built-in Django admin saved a lot of work; group development is easier with "fewer specialized or hacked together pieces"; they use IRC for distributed development; sphinx for full-text search; nginx is the media server and reverse proxy; haproxy made the launch process a 5 second procedure; capistrano for deployment; clean component separation makes moving easier; Git for version control; ORM with complicated querysets is a performance problem; memcached for caching rendered pages; the CDN checks for updates every 10 minutes; videos, articles, images, 404 pages are all served by a CDN.
But the most surprising point had to be:
And the biggest performance boost of all: caching 404s and sending Cache-Control headers to the CDN on 404. Upwards of 66% of our server time is spent on serving 404s from spiders crawling invalid urls and from urls that exist out in the wild from 6-10 years ago. [Edit: We dropped our outgoing bandwidth by about 66% and our load average on our web server cluster by about 50% after implementing that change].
A minority of our links are from old content that I no longer know the urls for. We redirect everything from 5 years ago and up. Stuff originally published 6-10 years ago could potentially be redirected, but none of it came from a database and was all static HTML in its initial incarnation and redirects weren't maintained before I started working for The Onion.
Spiders make up the vast majority of my 404s. They request URIs that simply are not present in our markup. I can't fix a broken spider and tell it not to request these links that do not even exist, but I still have to serve their 404s.
Our 404 pages were not cached by the CDN. Allowing them to be cached reduced the origin penetration rate substantially enough to amount to a 66% reduction in outgoing bandwidth over uncached 404s.
Edit: This is not to say that our 404s were not cached at our origin. Our precomputed 404 was cached and served out without a database hit on every connection, however this still invokes the regular expression engine for url pattern matching and taxes the machine's network IO resources.
No joke, irony, or sarcasm intended. Most of this traffic is from spiders looking for made up pages so preserving URLs isn't the issue. The issue was reducing the impact of these poisonous spiders and caching the 404 page was the antidote. Even if you haven't been on the web for over decade like The Onion, there may be a big within is easy reach.
Related Articles
- Hacker News Thread on the Article in which John Onion patiently tries to explain why The Onion isn't pissing away ad revenue.
- HTTP 404 Response Code
- Fighting Linkrot by Jakob Nielsen
Reader Comments (5)
The reason 404s made up such a large percentage is because the onion is CDNed: legit (non-404) requests are mostly served by the CDN. But since the 404s were cached by the CDN, the CDN sent the requests to the origin server. Thus 404s weer a disproportionate percentage of hits on the origin servers.
What about stopping some of these spiders? anyone has a good list of bad spiders.
One blog I read Bing spider does a repeated query on some web pages in a short amount of time that put a load on the site. Thus, owner has to ban it.
bad move ONions..if those crawlers with a long memory happen to be googlebot, your throwing away plenty of revenue in the form of SEO. Old pages from years ago equals a big embrace from an 800lb. gorilla/spider/octopus/elephant-in-the-room.
"Most of this traffic is from spiders looking for made up pages so preserving URLs isn't the issue."
If by most you mean above 50% then they are still throwing away up to 25% of their traffic. In addition to that, spiders will only crawl links from other websites, so they are throwing away the SEO benefits of their incoming links, too. It's a good thing they write funny stories. :-)
"bad move ONions..if those crawlers with a long memory happen to be googlebot, your throwing away plenty of revenue in the form of SEO. Old pages from years ago equals a big embrace from an 800lb. gorilla/spider/octopus/elephant-in-the-room."
They don't throwing away, just moving the stress to serve this kind of 404 pages from their servers to the CDNs.
So before this change when you requested some nonexistent site from the server it was like:
client -> CDN -> Origin Server and back to the CDN and to the client.
now when somebody request a page, then the CDN checks its cache if its not found or not allowed to serve from cache, then requests the Origin Server, which serves the page, and if thats a 404, then sends a header which tells to the CDN to cache that page.
So next time the request will be directly served from the CDN's local cache.
Tyrael