Todd had originally posted an entry on collectl here at Collectl - Performance Data Collector. Collectl collects real-time data from a large number of subsystems like buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp, all using one tool and in one consistent format.
Since then a lot has happened. It's now part of both Fedora and Debian distros, not to mention several others. There has also been a pretty good summary written up by Joe Brockmeier. It's also pretty well documented (I like to think) on sourceforge. There have also been a few blog postings by Martin Bach on his blog.
Anyhow, awhile back I released a new version of collectl-utils and gave a complete face-lift to one of the utilities, colmux, which is a collectl multiplexor. This tool has the ability to run collectl on multiple systems, which in turn send all their output back to colmux. Colmux then sorts the output on a user-specified column and reports the 'top-n' results.
For example, here's an example of the top users of slab memory from a 41 node sample:
In fact I used this very command to track down a very strange problem on a large cluster running the lustre file system. Several nodes were running slower than others and nobody knew why. I ran collectl on each, comparing virtually everything I could think of from cpu loads, to interrupts, context switches, disks, networks (both ethernet and infiniband) and other subsystems as well.
It wasn't until I stumbled on the fact that some machines were using a lot more slab memory that I tried unmounting/remounting lustre on them and sure enough, the slab sizes dropped and their performance immediately improved. Drilling down into the individual slab allocations with collectl I then discovered a single slab called ll_async_page seemed to be the culprit and digging deeper with google I discovered a known problem with lustre memory leaks. With this factoid in mind, I then could use colmux to identify all the top slab memory consumers and sure enough, a small number of them had significantly higher values than the bulk of the nodes.
Therefore, it was simply a matter of unmounting/remounting lustre on just those and the problem was resolved. While it didn't fix the memory leak problem, which is a slow one, it at least got the cluster operating at full efficiency for a couple of months when the process had to be repeated. This was an older version of lustre and so maybe the problem has been resolved.
I've also used colmux on a large disk farm to track down disks with high I/O wait times. The possibilities are limitless, but naturally the commands/columns you choose to look at are highly dependent on the problem you're trying to solve.
One other use that was pretty cool was a colleague used this with collectl's ability to monitor temperatures to track down 'hot' system on a 2000+ node cluster during a linpack run.
You can also change columns dynamically by typing in the column number or using the arrow keys (if you installed the perl module TermReadKey). You can even reverse the sort order!
Furthermore, if you have historical data you've collected over several days, you can instruct colmux to play it back and sort it. So let's say you had some sort of hang on the cluster yesterday at 2PM. Just play back ALL the data across all the nodes and look at the top processes or maybe the network or anything else that could cause a hang.
Anyhow, if you think this might be worth a look install collectl-utils and take it for a spin. If you havent' tried collectl yet, perhaps this would be a good reason to do so.
There is an alternative output format I call single line, in which a small number of columns are all reported on the same line, making it real easy to spot change. If you look at the bottom of the colmux page, there's a cool picture if monitoring close to 200 systems on a single line, of course it takes 3-30" monitors to see them all.