Monday
Oct012007
Statistics Logging Scalability
Monday, October 1, 2007 at 2:26AM
My company is developing a centralized web platform to service our clients. We currently use about 3Mb/s on our uplink at our ISP serving web pages for about 100 clients. We'd like to offer them statistics that mean something to their businesses and have been contemplating writing our own statistics code to handle the task.
All statistics would be gathered at the page view level and we're implementing a HttpModule in ASP.Net 2.0 to handle the gather of the data. That said, I'm curious to hear comments on writing this data (~500 bytes of log data/page request). We need to write this data somewhere and then build a process to aggregate the data into a warehouse application used in our reporting system. Google Analytics is out of the question because we do not want our hosting infrastructure dependant upon a remote server. Web Trends et al. are too expensive for our clients.
I'm thinking of a couple of options.
1) Writing log data directly to a SQL Server 2000 db and having a Windows Service come in periodically to summarize and aggregate the data to the reporting server. I'm not sure this will scale with higher load and that the aggregation process will timeout because of the number of inserts being sent to the table.
2) Write the log data to a structure in memory on the web server and periodically flush the data to the db. The fear here is that the web server goes down and we lose all the data in memory. Other fears are that the IIS processes and worker threads might mangle one another when contending for the memory system resource.
3) Don't use memory and write to a file instead. Save the file handler as an application variable and use it for all accesses to the file. Not sure about threading issues here as well and am reluctant to use anything which might corrupt a log file under load.
4) Add comment data to the IIS logs. This theoretically should remove the threading issues but leaves me to think that the data would not be terribly useful once its in the IIS logs.
The major driver here is that we do not want to use any of the web sites and canned reports built into 90% of all statistics platforms. Our users shouldn't have to "leave" the customer care portal we're creating just to see stats for their sites. IFrames are not an option. I'm looking for a solution that's not entirely complex, nor is it overly expensive and it will give me the access to the data we need to record on page views. It has to scale with volume. Thoughts are appreciated.
Derek
All statistics would be gathered at the page view level and we're implementing a HttpModule in ASP.Net 2.0 to handle the gather of the data. That said, I'm curious to hear comments on writing this data (~500 bytes of log data/page request). We need to write this data somewhere and then build a process to aggregate the data into a warehouse application used in our reporting system. Google Analytics is out of the question because we do not want our hosting infrastructure dependant upon a remote server. Web Trends et al. are too expensive for our clients.
I'm thinking of a couple of options.
1) Writing log data directly to a SQL Server 2000 db and having a Windows Service come in periodically to summarize and aggregate the data to the reporting server. I'm not sure this will scale with higher load and that the aggregation process will timeout because of the number of inserts being sent to the table.
2) Write the log data to a structure in memory on the web server and periodically flush the data to the db. The fear here is that the web server goes down and we lose all the data in memory. Other fears are that the IIS processes and worker threads might mangle one another when contending for the memory system resource.
3) Don't use memory and write to a file instead. Save the file handler as an application variable and use it for all accesses to the file. Not sure about threading issues here as well and am reluctant to use anything which might corrupt a log file under load.
4) Add comment data to the IIS logs. This theoretically should remove the threading issues but leaves me to think that the data would not be terribly useful once its in the IIS logs.
The major driver here is that we do not want to use any of the web sites and canned reports built into 90% of all statistics platforms. Our users shouldn't have to "leave" the customer care portal we're creating just to see stats for their sites. IFrames are not an option. I'm looking for a solution that's not entirely complex, nor is it overly expensive and it will give me the access to the data we need to record on page views. It has to scale with volume. Thoughts are appreciated.
Derek
Reader Comments (4)
I worked on a system that instrumented the web server to send all the log data directly through TCP to a http://highscalability.com/product-hadoop">Hadoop type system. It's really the only way you can keep up with the constant flood of writes. You can then use the map/reduce approach to aggregate data for writing into your data warehouse or you can simply write your reports on top of the map/reduce infrastructure. Losing a bit of log from one web server wasn't considered important enough to worry about given the infrequency of how often it happened and the small window of data involved.
One thing I wanted to try, but never got a chance to try, was to write most of logging information during the page creation process rather than store a ton of data each time something happened on a page. You can then just say click xxxx and match that back in the data warehouse with a complete description of the page. That would save a lot of logging for each event and the page description would only have to be written once. Just a thought.
Thanks for the response. The key for us isn't logging from a web server performance standpoint but from an application event standpoint. Even with the decreased granularity of the application event writes versus teh GET/POST writes in the IIS log, I felt a standard DB was probably not going to be able to keep up though I may have to do some benchmarking to see what the scalability costs would be. We're more interested in keeping track of what pages a specific user/session hits during a session and knowing which buttons they clicked and in what sequence. Its all fine to tell a client that they've had 30 hits to their shopping cart page but how many of those sessions proceeded to the checkout page and then through to the confirmation/payment page.
Web stats tracks teh GET/POST activity on a server. I really want to track application events including the main request to the original aspx page.
> We're more interested in keeping track of what pages a specific user/session hits during a session
So you usually annotate each URL to include info like the session, operation name and categories, page position, location, user information, etc and etc. Anything you can think of that you'll use in later analysis to measure engagement, affinity, abandonment, conversion, path analysis, and so on. This shows up in the log. For your Ajax code you need to do something similar.
My point is much of this information is static for the page. If you have a template ID the position of an ad block is fairly static. So as long as you have a template ID and field ID you learn most of what you need in the analysis phase rather than trying to encode it all in the URL. So the amount of data you have to log for each operation can be much much less.
I can see that. For the most part our pages are static in layout and we know what's where so your comments are appropriate and well taken. Thanks. I hadn't thought about it that way...