« Looking for good business examples of compaines using Hadoop | Main | Handling of Session for a site running from more than 1 data center »
Tuesday
Feb052008

SLA monitoring

Hi,

We're running a enterprise SaaS solution that currently holds about 700 customers with up to 50.000 users per customer (growing quickly). Our customers have SLA agreements with us that contains guaranteed uptimes, response times and other performance counters. With an increasing number of customers and traffic we find it difficult to provide our customer with actual SLA data. We could set up external probes that monitors certain parts of the application, but this is time consuming with 700 customers (we do it today for our biggest clients). We can also extract data from web logs but they are now approaching about 30-40 GB a day.

What we really need is monitoring software that not only focuses on the internal performance counters but also lets us see the application from the customers viewpoint and allows us to aggregate data in different ways. Would the best approach be to develop a custom solution (for instance a distributed app that aggregates data from different logs every night and store them in a data warehouse) or are there products out there that are suitable for a high scalability environment?

Any input is greatly appreciated!

Reader Comments (5)

There are solutions out there for both analyzing logs and user activities on your sites (often through Javascript on the client), but I've heard most of the users of commercial log analytic tools being troubled by the sheer amount of data, and the client-based approach seems really brittle. How about using something like Hadoop to store and analyze the TBs of data you already have? I've not got any personal experience with this, but it does seem like a good fit for your problems. And from what I've heard, it gives support for ad-hoc queries as well, giving you the added benefit of being able to get valuable information within a reasonable time (and the ability to easilly scale up when the time spent is no longer reasonable).

December 31, 1999 | Unregistered CommenterKyrre

For an example of the Hadoop angle take a look at http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data. The number aren't quite as large as yours, but they are doing what you want to do. And it's basically the same sort problem Yahoo wants to solve internally using Hadoop on their own log stream.

December 31, 1999 | Unregistered CommenterTodd Hoff

Thanks a lot for the input. First, I have to apoligize for my brain going a bit megalomaniac on me; we store 30-40 GB of logfiles / day. (I really hope we have solved this by the time we reach a few terrabytes/day in logfiles...)

I can se that the Rackspace approach is very relevant. Being able to run MapReduce queries on logfiles could allow us to aggregate all sorts of interesting performance statistics into a database/ datawarehouse. But if I had limited developer resources on my hands (customer's want new buttons, not logparsers), are there any commercial software that you think would be up for the job?

Another headache is that analysing the webserver's logfiles doesn't really give relevant downtime information because there's plenty of hardware in front of the web-server that could potentially fail (load balancers, etc). Also, a web-server that is running into a problem tends to not write to it's log file. I guess this could be solved by setting up external probes that pings different resources/customers... But it just doesn't seem very scalable if adding a customer means setting up new probes...

December 31, 1999 | Unregistered Commenterjab

In the products list there are few monitoring solutions. Take a look at http://highscalability.com/product-hyperic and http://highscalability.com/product-collectl-performance-data-collector.

December 31, 1999 | Unregistered CommenterTodd Hoff

In my opinion the whole topic of SLA monitoring can be pretty wide-open. A number of years ago I developed a solution for a customer based on the Tru64 tool collect - something I based collectl on when I wrote it. The general idea was since collect was monitoring system counters every 10 seconds and process information every 60 seconds, it was pretty easy to post process the log and watch the timestamps on the samples increment. As long as they kept incrementing you knew at least the O/S was up and holes indicated a reboot and time time in between was down time that could easily be measures within the 10 second window.

As for application availability, or in my particular case we were interested in the availability of services like ftp, ntp, and several others, it the deamon processes showed up in the process log with the same pid, we could infer they were still running and hadn't crashed. If they did have a different pid you could again tell based on time stamps the amount of down-time. Just because the daemon is running doesn't mean the service is necessarily there, but it is at least a first pass and looking at the problem.

Just a few thoughts...

-mark

December 31, 1999 | Unregistered CommenterMark Seger

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>