Recommend Colmux - Finding Memory Leaks, High I/O Wait Times, and Hotness on 3000 Node Clusters (Email)

This action will generate an email recommending this article to the recipient of your choice. Note that your email address and your recipient's email address are not logged by this system.

EmailEmail Article Link

The email sent will contain a link to this article, the article title, and an article excerpt (if available). For security reasons, your IP address will also be included in the sent email.

Article Excerpt:

Todd had originally posted an entry on collectl here at Collectl - Performance Data Collector. Collectl collects real-time data from a large number of subsystems like buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp, all using one tool and in one consistent format.

Since then a lot has happened.  It's now part of both Fedora and Debian distros, not to mention several others. There has also been a pretty good summary written up by Joe Brockmeier. It's also pretty well documented (I like to think) on sourceforge. There have also been a few blog postings by Martin Bach on his blog.

Anyhow, awhile back I released a new version of collectl-utils and gave a complete face-lift to one of the utilities, colmux, which is a collectl multiplexor.  This tool has the ability to run collectl on multiple systems, which in turn send all their output back to colmux.  Colmux then sorts the output on a user-specified column and reports the 'top-n' results.  

For example, here's an example of the top users of slab memory from a 41 node sample:


Article Link:
Your Name:
Your Email:
Recipient Email:
Message: