« ThePort Network Architecture | Main | 13 Scalability Best Practices »
Thursday
Aug132009

Reconnoiter - Large-Scale Trending and Fault-Detection

One of the top recommendations from the collective wisdom contained in Real Life Architectures is to add monitoring to your system. Now! Loud is the lament for not adding monitoring early and often. The reason is easy to understand. Without monitoring you don't know what your system is doing which means you can't fix it and you can't improve it. Feedback loops require data.

Some popular monitor options are Munin, Nagios, Cacti and Hyperic. A relatively new entrant is a product called Reconnoiter from Theo Schlossnagle, President and CEO of OmniTI, leading consultants on solving problems of scalability, performance, architecture, infrastructure, and data management. Theo's name might sound familiar. He gives lots of talks and is the author of the very influential Scalable Internet Architectures book.

So right away you know Reconnoiter has a good pedigree. As Theo says, their products are born of pain, from the fire of solving real-life problems and that's always a harbinger of good things to come.

The problem Reconnoiter is trying to solve is monitoring thousands of nodes across many datacenters where the nodes can vary widely in power, architecture, and software configuration. With that kind of problem what they really want is the ability to:

 

  • Configure everything from one place.
  • Cheap checks that are made on the specified time interval and aren't late and don't cause a heavy load on the machine.
  • Change the configuration from any datacenter without coordination.
  • Add checks in the field.
  • Separate data collection from visualization and fault-detection.
  • Analyze trends for long-term capacity planning and postmortem analysis.
  • Detect when faults have happened and when they are about to happen.
  • Support trending: the intelligent data correlation, regression analysis/curve fitting and looking into the past to see how much you go where you are now so you can do better next time.
  • Create a monitoring system that doesn't require a separate powerful network and its own set of hosts on which to run.

    If you've ever used or written a distributed stats collection system the architecture of Reconnoiter will look somewhat familiar:


    Some of the more interesting bits of the architecture are:
  • PostgresSQL stores all the data. The data isn't stuck in funky little files.
  • Fault-detection is based on Esper, a streaming complex event processing system. It's not clear how well this approach will work but the hooks are there.
  • A Comet-style web server is used to feed real-time updates. Much better than your traditional polling cycle.
  • Although the web console is PHP based, PHP is used mainly to execute Json calls. Rendering happens in the browser in an AJAX client.
  • Canvas is used for real time graphics. No images are created on the fly.
  • Data is transferred securely over SSL.
  • The system is robust against failures.
  • Data is not thrown away as it is with some systems so you can check against history.

    Reconnoiter isn't completely pain free. Lua for an extension language is an interesting choice. The installation and configuration process is very complex. There are a lot of separate steps and bits to configure. Another potential problem is monitoring produces a lot of real-time data. I have to wonder if PostgresSQL can handle that flow for very large systems. The data is partitioned by month, but a large number of machines and a large number of events can be crushing. And I wasn't sure if graph data could be correlated with released features or other system changes. In the video Theo mentions seeing in the graphs that using deflate improved performance, but I'm not sure just looking at the graph how you would be able correlate system data with system changes.

    It's droolingly clear where Reconnoiter shines is on creating complex graphs, charts, and other visualizations. The graphs look useful and quick to render. The real time visualizations are spectacular and extremely are difficult to do in other systems.

    Related Articles



  • OmniTI Reconnoiter: Web Management and Analysis by Eric J. Bruno
  • Reconnoiter Update by Theo Schlossnagle
  • Reconnoiter Project Home Page
  • Video: Reconnoiter: a whirlwind tour
  • Big Picture of the Overall System
  • Reconnoiter: Monitoring and Trend Analysis from OSCON
  • OmniTI Unveils Open Source Monitoring Tool, Reconnoiter by Jayashree Adkoli
  • The sad state of open source monitoring tools by Grig Gheorghiu
  • How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr's John Allspaw on His New Book.
  • New open source IT management tool: Lighter-weight than Nagios, more granular than Cacti by Matt Stansberry
  • Reader Comments (2)

    Nagios already supports Distributed Monitoring ( http://nagios.sourceforge.net/docs/1_0/distributed.html )

    I recommend Opsview ( One of many Nagios enhancement ).
    You can easlily build distributed monitoring structure with it.

    December 31, 1999 | Unregistered Commenterpassenger

    We've been keeping tabs on Reconnoiter. It looks like it is coming along nicely. Thanks for this review. Go Theo.

    December 31, 1999 | Unregistered Commentertabkeeper

    PostPost a New Comment

    Enter your information below to add a new comment.
    Author Email (optional):
    Author URL (optional):
    Post:
     
    Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>