Saturday
Aug162008
Strategy: Serve Pre-generated Static Files Instead Of Dynamic Pages

Pre-generating static files is an oldy but a goody, and as Thomas Brox Røst says, it's probably an underused strategy today. At one time this was the dominate technique for structuring a web site. Then the age of dynamic web sites arrived and we spent all our time worrying how to make the database faster and add more caching to recover the speed we had lost in the transition from static to dynamic.
Static files have the advantage of being very fast to serve. Read from disk and display. Simple and fast. Especially when caching proxies are used. The issue is how do you bulk generate the initial files, how do you serve the files, and how do you keep the changed files up to date? This is the process Thomas covers in his excellent article Serving static files with Django and AWS - going fast on a budget", where he explains how he converted 600K thousand previously dynamic pages to static pages for his site Eventseer.net, a service for tracking academic events.
Eventseer.net was experiencing performance problems as search engines crawled their 600K dynamic pages. As a solution you could imagine scaling up, adding more servers, adding sharding, etc etc, all somewhat complicated approaches. Their solution was to convert the dynamic pages to static pages in order to keep search engines from killing the site. As an added bonus non logged-in users experienced a much faster site and were more likely to sign up for the service.
The article does a good job explaining what they did, so I won't regurgitate it all here, but I will cover the highlights and comment on some additional potential features and alternate implementations...
They estimated it would take 7 days on single server to generate the initial 600K pages. Ouch. So what they did was use EC2 for what it's good for, spin up a lot of boxes to process data. Their data is backed up on S3 so the EC2 instances could read the data from S3, generate the static pages, and write them to their deployment area. It took 5 hours, 25 EC2 instances, and a meager $12.50 to perform the initial bulk conversion. Pretty slick.
The next trick is figuring out how to regenerate static pages when changes occur. When a new event is added to their system hundreds of pages could be impacted, which would require the effected static pages to be regenerated. Since it's not important to update pages immediately they queued updates for processing later. An excellent technique. A local queue of changes was maintained and replicated to an AWS SQS queue. The local queue is used in case SQS is down.
Twice a day EC2 instances are started to regenerate pages. Instances read twork requests from SQS, access data from S3, regenerate the pages, and shutdown when the SQS is empty. In addition they use AWS for all their background processing jobs.
I like their approach a lot. It's a very pragmatic solution and rock solid in operation. For very little money they offloaded the database by moving work to AWS. If they grow to millions of users (knock on wood) nothing much will have to change in their architecture. The same process will still work and it still not cost very much. Far better than trying to add machines locally to handle the load or moving to a more complicated architecture.
Using the backups on S3 as a source for the pages rather than hitting the database is inspired. Your data is backed up and the database is protected. Nice.
Using batched asynchronous work queues rather than synchronously loading the web servers and the database for each change is a good strategy too.
As I was reading I originally thought you could optimize the system so that a page only needed to be generated once. Maybe by analyzing the events or some other magic. Then it hit me that this was old style thinking. Don't be fancy. Just keep regenerating each page as needed. If a page is regenerated a 1000 times versus only once, who cares? There's plenty of cheap CPU available.
The local queue of changes still bothers me a little because it adds a complication into the system. The local queue and the AWS SQS queue must be kept synced. I understand that missing a change would be a disaster because the dependent pages would never be regenerated and nobody would ever know. The page would only be regenerated the next time an event happened to impact the page. If pages are regenerated frequently this isn't a serious problem, but for seldom touched pages they may never be regenerated again.
Personally I would drop the local queue. SQS goes down infrequently. When it does go down I would record that fact and regenerate all the pages when SQS comes back up. This is a simpler and more robust architecture, assuming SQS is mostly reliable.
Another feature I have implemented in similar situations is to setup a rolling page regeneration schedule where a subset of pages are periodically regenerated, even if no event was detected that would cause a page to be regenerated. This protects against any event drops that may cause data be undetectably stale. Over a few days, weeks, or whatever, every page is regenerated. It's a relatively cheap way to make a robust system resilient to failures.
Static files have the advantage of being very fast to serve. Read from disk and display. Simple and fast. Especially when caching proxies are used. The issue is how do you bulk generate the initial files, how do you serve the files, and how do you keep the changed files up to date? This is the process Thomas covers in his excellent article Serving static files with Django and AWS - going fast on a budget", where he explains how he converted 600K thousand previously dynamic pages to static pages for his site Eventseer.net, a service for tracking academic events.
Eventseer.net was experiencing performance problems as search engines crawled their 600K dynamic pages. As a solution you could imagine scaling up, adding more servers, adding sharding, etc etc, all somewhat complicated approaches. Their solution was to convert the dynamic pages to static pages in order to keep search engines from killing the site. As an added bonus non logged-in users experienced a much faster site and were more likely to sign up for the service.
The article does a good job explaining what they did, so I won't regurgitate it all here, but I will cover the highlights and comment on some additional potential features and alternate implementations...
They estimated it would take 7 days on single server to generate the initial 600K pages. Ouch. So what they did was use EC2 for what it's good for, spin up a lot of boxes to process data. Their data is backed up on S3 so the EC2 instances could read the data from S3, generate the static pages, and write them to their deployment area. It took 5 hours, 25 EC2 instances, and a meager $12.50 to perform the initial bulk conversion. Pretty slick.
The next trick is figuring out how to regenerate static pages when changes occur. When a new event is added to their system hundreds of pages could be impacted, which would require the effected static pages to be regenerated. Since it's not important to update pages immediately they queued updates for processing later. An excellent technique. A local queue of changes was maintained and replicated to an AWS SQS queue. The local queue is used in case SQS is down.
Twice a day EC2 instances are started to regenerate pages. Instances read twork requests from SQS, access data from S3, regenerate the pages, and shutdown when the SQS is empty. In addition they use AWS for all their background processing jobs.
Comments
I like their approach a lot. It's a very pragmatic solution and rock solid in operation. For very little money they offloaded the database by moving work to AWS. If they grow to millions of users (knock on wood) nothing much will have to change in their architecture. The same process will still work and it still not cost very much. Far better than trying to add machines locally to handle the load or moving to a more complicated architecture.
Using the backups on S3 as a source for the pages rather than hitting the database is inspired. Your data is backed up and the database is protected. Nice.
Using batched asynchronous work queues rather than synchronously loading the web servers and the database for each change is a good strategy too.
As I was reading I originally thought you could optimize the system so that a page only needed to be generated once. Maybe by analyzing the events or some other magic. Then it hit me that this was old style thinking. Don't be fancy. Just keep regenerating each page as needed. If a page is regenerated a 1000 times versus only once, who cares? There's plenty of cheap CPU available.
The local queue of changes still bothers me a little because it adds a complication into the system. The local queue and the AWS SQS queue must be kept synced. I understand that missing a change would be a disaster because the dependent pages would never be regenerated and nobody would ever know. The page would only be regenerated the next time an event happened to impact the page. If pages are regenerated frequently this isn't a serious problem, but for seldom touched pages they may never be regenerated again.
Personally I would drop the local queue. SQS goes down infrequently. When it does go down I would record that fact and regenerate all the pages when SQS comes back up. This is a simpler and more robust architecture, assuming SQS is mostly reliable.
Another feature I have implemented in similar situations is to setup a rolling page regeneration schedule where a subset of pages are periodically regenerated, even if no event was detected that would cause a page to be regenerated. This protects against any event drops that may cause data be undetectably stale. Over a few days, weeks, or whatever, every page is regenerated. It's a relatively cheap way to make a robust system resilient to failures.
Reader Comments (14)
If search engine crawlers are already killing your site, then you really have application performance problems.
This strategy is one of the most basic ones and if you didn't think about that when you designed the platform, then you deserve to fail.
I think that's an excellent idea. I think it mirrors nature in some ways. I imagine a little worker bee who's job is to trawl through the quieter parts of the hive and keep them clean. If you need the hive to be cleaner, put more bees on it. If it's too clean, put less bees on it.
Ok, enough of the metaphors. This is a strategy I've long been mulling over, but until now I hadn't read about it anywhere. It's nice to see that it's being used and that it actually works.
With a rolling regeneration schedule, where every page is regenerated at some point, I think you have a very robust system.
http://www.callum-macdonald.com/" title="Callum" target="_blank">Callum
PS> Todd, the title of my comment defaulted to 71 characters, but it can only be 64. This is a long standing error / bug.
> until now I hadn't read about it anywhere.
It's a fairly common technique in network management systems. One study that I can't find now indicated performing continual error reconciliation runs was one of the better ways of keeping data consistent in an error prone world. The page regeneration is just a variation of that idea. It seems a bit like cheating, but I think of it like an immune system cleaning up after anything that made it through the skin barrier.
And yah, the max subject length sure sucks.
It is indeed not a new concept. See "Improving Web Server Performance by Caching Dynamic Data" from 1997 (http://is.gd/1HCD). The authors discuss a system that was in place for the 1996 Olympics. Even in 1997, the web site for the Olympics generated lot of traffic.
I am wondering if static generated html pages would work well for a bug tracking site where each bug is pretty much distinct from another. We keep a Google Search Appliance index to enable full text search across the bugs. Currently we have a memcached/Java/Oracle infrastructure. When a bug is updated we update the GSA with a new XML file. If we instead uploaded a static html file to GSA or let GSA crawl a static site would that make more sense. We would have to generate a new static html file every time the bug changes. I am wondering if it would be faster than rendering everything via JSP and memcached everytime.
How is this different or better than simply using a caching system -- either at the network level (like SQUID) or at the App Server level (like APC or memcached)? Obviously pregenerated static files will always perform better, but is it significantly better?
This "pre-generated Static files" idea is the way Apple's handling MobileMe--using the SproutCore framework. Code in Ruby then hit "generate" and it builds a site of static files complete with necessary javascript.
http://sproutcore.com
Røst's solution only addresses unauthenticated visitors, which means the site will still be slow for the people most interested in it. The next level of optimization could be dynamic inclusion of user-specific data using Ajax calls from the browser.
All that would need to be done is move the user specific page generation to a different URL and always let those URLs hit Django instead of lighttpd. Then add some quick Javascript (in your choice of framework) to hit those URLs and include the results in the static page. Viola!
Even better, the Javascript could fetch some minimal JSON data and format it into HTML during insertion. Now only the bare minimum of dynamic page generation is occuring...
I've been saying for a long time that most people don't really need dynamic pages. Take this very page for example; the only dynamic piece here is user comments. If there is no expecation from the user to see changes reflected immediately, then why both resconstructing the whole thing? Something like a JSON based way of updaing the dynamic pieces or even IFRAMEs if you are not a http://www.bikeshed.com/">bikeshed person works.
Ajax is not so upgraded version as java.. we all know ajax came from java...
any how thanks for sharing this valuable info..
regards,
http://www.wholesale-burglar-alarms.com/">wireless burglar alarm
Yes you are absolutely right. ajax came after java.. Even the letter J stands for java :)
Microsoft used java script in Ajax development..
regards,
http://www.parfumy.com/">discount fragrance
I agree that dynamic pages are simply not required for many small companies websites. If you are starting a small business, i doubt that dynamic pages would be that useful in the short term.
I agree with this strategy. It may not be be new but it works.
I imagine a little worker bee who's job is to trawl through the quieter parts of the hive and keep them clean. If you need the hive to be cleaner, put more bees on it. If it's too clean, put less bees on it.
Ok, enough of the metaphors. This is a strategy I've long been mulling over, but until now I hadn't read about it anywhere. It's nice to see that it's being used and that it actually works.
With a rolling regeneration schedule, where every page is regenerated at some point, I think you have a very robust system.
Sports Supplements