Product: Collectl - Performance Data Collector

Sunday

Feb032008

Product: Collectl - Performance Data Collector

Sunday, February 3, 2008 at 1:11PM

From their website:
There are a number of times in which you find yourself needing performance data. These can include benchmarking, monitoring a system's general heath or trying to determine what your system was doing at some time in the past. Sometimes you just want to know what the system is doing right now. Depending on what you're doing, you often end up using different tools, each designed to for that specific situation. Features include:

You are be able to run with non-integral sampling intervals.

Collectl uses very little CPU. In fact it has been measured to use <0.1% when run as a daemon using the default sampling interval of 60 seconds for process and slab data and 10 seconds for everything else.

Brief, verbose, and plot formats are supported.

You can report aggregated performance numbers on many devices such as CPUs, Disks, interconnects such as Infiniband or Quadrics, Networks or even Lustre file systems.

Collectl will align its sampling on integral second boundaries.

Supports process and slab monitoring.

New to the 2.4.0 release is the monitoring of process i/o statistics.

Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interactively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp. The following is an example of simply running the collectl command with no arguments and using its default settings. Below we see what the cpu, network and disk were doing while writing a large file:

#<--------CPU--------><-----------Disks-----------><-----------Network---------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
37 37 382 188 0 0 27144 254 45 68 3 21
25 25 366 180 20 4 31280 296 0 1 0 0
25 25 368 183 0 0 31720 275 2 20 0 1

Output can also be saved in a rolling set of logs for later playback or displayed interactively in a variety of formats. If all that isn't enough there are additional mechanisms for supplying data to external tools via a socket interface or by generating its output as s-expressions, a format of choice for some tools such as supermon. You can even create files in space-separated formats for plotting with external packages like the one below which was done with gnuplot using 1 second samples.

Todd Hoff |

21 Comments |

Permalink |

Print Article

Email Article

Product,

performance monitor

Reader Comments (21)

use sar.

December 31, 1999 |

Anonymous

I use Munin a lot lately for collecting data.

December 31, 1999 |

Kent

I'm the author of collectl and in response to the 2 word comment of 'use sar' I have to say before I wrote collectl I looked very closely at sar. It does some things very well but I also think it's gotten a little long on the tooth. The problem with some of the older tools are that so many scripts are dependent on their output formats they can't be changed and are locked in an older way of doing things. For example:
- can sar display multiple types of data on a single line?
- can it report Inifiniband stats?
- how about lustre? nfs? tcp? slabs?
- is it possible to load sar data directly into a spreadsheet?
- how about plotting sar data without having to manipulate the data?
- what about sub-second monitoring intervals?

enough rambling. the list is much longer...
-mark

December 31, 1999 |

Mark Seger

thanks for your answer, mark

put in a point-to-point comparison on your site, or people will just "use sar" and never complain to you. Also you may consider supporting windows os. Remember these are the OSS freaks, they would rather sell their mom on ebay than pay for commercial software. The slashdot clique isn't the most distinguished customer base.

bye,
have fun competing with the community that does things for free (u.e. $0).

December 31, 1999 |

Anonymous

Also you may consider nicer graphing output or support for RRDTool or nagios

December 31, 1999 |

Anonymous

re: side-by-side comparison to SAR
I didn't want to get too deeply into that as I've noticed many people who use SAR are quite happy with the default 10 minute monitoring intervals which I find relatively useless for any kind of analytical trouble shooting. After all, if you're told the cpu was 25% busy for a give 10 minute period how could you ever tell it was idle for 7:30 and pegged for the other 2:30?

As I said before the main differences are in the types of data collected. SAR collects a lot but collectl collects a lot more.

re: windows - ain't gonna happen! collectl is based on the very light-weight /proc interface to get it's data and doing something similar in windows would be very painful.

re: 'competing with the community that does things for free'
what gives anyone the impression collectl costs something? it is open source and free!

re: graphing
You can generate data in plottable form and load it into a spreadsheet and use its graphing features or call something like gnuplot to do it for you. If yo have a particular set of data you want to plot over and over again, you could always script it.

re: rrd
Collectl can generate data in rrd format. I was contemplating trying to actually load an rrd database directrly from collectl and asked if anyone wanted to work with me and I didn't get any takers. I also did some experimenting with rrd and found it didn't really meet my plotting needs which requires I get 100% accurate plotting data and you can't do that with rrd since it normalizes multiple data points into a single one and you therefore lose information. Since rrd was never intended to have a highly accurate plotting package but rather focus on trends, this is fine for it to do but that's not ok when you're trying to use that data to diagnose a system problem.

-mark

December 31, 1999 |

Mark Seger

@mark
interesting little hack you made there but it can not imagine why anyone would want to use it.
for "getting numbers fast" there is sar, which is timetested and does pretty much everything
one would need in such a use-case.

for larger scale or longtime monitoring people commonly use munin, nagios or other
plugin-based solutions. seems like your tool has nothing to offer on that front.

December 31, 1999 |

John

@mark, I am sorry but these 2 are non-arguments
- is it possible to load sar data directly into a spreadsheet?
Why would you want a specialized program to output into a specialized format (spreadsheet.. wasnt this the tool for the accounting types?)

- how about plotting sar data without having to manipulate the data?
Similar to the one above.

PS: I have not looked at collectd yet, just commenting as knee-jerk.

December 31, 1999 |

atif.ghaffar

re: Interesting little hack
It may be a hack to you but in the world of High Performance Computing, something many people may not be all that familiar with, it's proven to be invaluable as some of the largest computers in the world run collectl on a daily basis. Ever hear of the top500 list? The majority of the systems listed are HP and many of those run collectl. People who have used it recognize its worth. While sar provides a lot of data, it does not include information on Infiniband and
Lustre and they are far too important to not have at your fingertips.

As for tools like nagios, etc I find they don't scale. Can you sends hundreds of performance counters to them every 10 seconds (or less) from over 1K nodes and not choke it? Furthermore, what happens when you're trying to debug a network problem and you can't get the data to nagios to display?

re: output to a spreadsheet
I tried to choose my words carefully but perhaps not carefully enough 9-)
What I was trying to say is collectl can generate data in space-separated format (or on fact let you choose you own separator) and as such can be easily imported into any tool that recognizes such a format. Speadsheets are the main ones that come to mind. However more importanly you can also run gnuplot directly.

tough crowd, but I like to hear all feedback 8-)
btw - at least Kevin Closson's agrees with me
see http://kevinclosson.wordpress.com/2007/12/18/its-your-choice-collectl-or-some-odd-collection-of-sundry-commands/

-mark

December 31, 1999 |

Mark Seger

I for one will definitely have a look into Collectl.

We run a bunch of clusters and some SMP-machines, most of them doing HPC.
Right now we primarily use Ganglia and what I find useful is that you can define and implement your own metrics with gmetric - just get your data in any way you want (e.g. shell script that gets the temperature via IMPI) and gmetric will feed it into Ganglia, which produces all charts, statistics, etc.
I wonder if we could use Collectl to get some data into Ganglia.

BTW - for the moment I thought your tool is not free (as beer ;-)) but it is and even if it wasn't that really doesn't matter when we talk about destination hardware it is meant for (this is in respect to one of the previous comments).

-marek

P.S.
I run a blog about building and administering clusters - http://clusteradmin.blogspot.com/">http://clusteradmin.blogspot.com - perhaps somebody here will be interested.

December 31, 1999 |

clusteradmin.blogspot.com

Actually beer isn't free, but collectl is 9-) so by all means give it a shot. I suspect it's farily easy to pass data from colletl to ganglia, but just be aware that when you plot data you lose accuracy and so really want to keep collectl data local as well. Perhaps the best place to have that discussion is on your blog so I'll enter a few comments there...
-mark

December 31, 1999 |

Mark Seger

Sounds great. I'll cover some monitoring aspects (including your tool) over the weekend. Your comments will be very valuable.

--
http://clusteradmin.blogspot.com/">clusteradmin.blogspot.com :: blog about building and administering clusters.

December 31, 1999 |

clusteradmin.blogspot.com

"This blog is open to invited readers only" -- I don't know why you are advertising your blog when you apparently don't want anyone to read it....

December 31, 1999 |

Anonymous

Also you may consider nicer graphing output or support for RRDTool or nagios

December 31, 1999 |

Culture - Bilisim

The short answers to the last questions are yes and yes. Collectl's mission is to be the best data collector and logged around. To that end it doesn't try to aggregate data from multiple nodes, load it into data bases or do fancy graphics. Those are jobs for other tools like Ganlia, Nagios or RRD.

However what collectl does do is make data available for importation in a number of mechanisms. What I don't want to do is dictate how that data should be loaded and therefore leave it to others. is that a cop-out on my part? I say no, because what ever implementation I might choose there will be others who either disagree with that mechanism or simply know the external tools better and have more efficient ways to implement those mechanisms. Quite frankly I'm waiting for someone to raise their hand and say they want to import collectl data into another tool and are looking for help. I'd be more than happy to hear what they have to say and help where I can.

-mark

December 31, 1999 |

Mark Seger

It's been awhile but I thought I'd post an update on collectl. A couple of weeks ago collectl was added to the Fedora 10 release and quickly back-ported to releases 8 & 9 so it looks like it's starting to gain some traction.

I was also looking through some previous discussions in this thread and as a more detailed comparison of collectl to other utilities I had developed a chart which shows a subset of collectl's commands mapped against exisiting tools like sar. If I've missed any sar (or other tool options) let me know and I'll be happy to update the table. It's at - http://collectl.sourceforge.net/Matrix.html

I also thought I'd take the opportunity to mention that I have released collectl 3.0.0 which I think is pretty cool because of a key new feature I added, specifically the --top switch which makes collectl sort of work like top, only better! With this switch you not only can display processes sorted by cpu, you can also display top processes by I/O (assuming your kernel supports that). Furthermore you can simultaneously display others stats such as disk traffic, network, etc. In fact, you can even include process threads. But wait - there's more! Since collectl has a highly integrated set of capabilities, if you've had it running as a daemon and writing statistics to a file, you can play back that file with --top, multiple times if you like, and see who the top processes were at different times in the past! More on process monitoring here - http://collectl.sourceforge.net/Process.html

Something new I'm currently working on is adding ipmi data such as fan/temp data. That tends to be a bit more challenging because every systems reports its ipmi data in different formats!

If anyone has tried collectl since my last posting and has any feedback (both good and bad) I'd be interested in hearing what you think.

marek - I finally got around to trying to access your blog but apparently I need to be authorized by you to access it or am I missing something? If only those who are given permission are allowed in and don't know your email address isn't that self-defeating?

-mark

December 31, 1999 |

Mark Seger

Thanks for the update Mark. All sounds good. You aren't missing anything on the new users. Because of spam I approve new users in batches. Sorry for the awkwardness of the process but I'm not sure what else to do.

December 31, 1999 |

Todd Hoff

Actually I'm talking about the pointer to clusteradmin.blogspot.com that was mentioned by Marek. He invited people to join in but you can't unless you're a member and without an email to ask him it's kind of a catch-22.
-mark

December 31, 1999 |

Mark Seger

Interesting...perhaps a way to correlate to actual end-user performance experience? I use an appliance to stitch together all the http packets together, I guess if I knew what to correlate onto this collectl would be good for deep-dive forensics?

December 31, 1999 |

Tim

The whole point is you never know ahead of time what to correlate and so you collect everything including process data. When the time comes to analyze a problem at a particular time, you just play back all different types of data around the time and start trying to correlate it with what else you might know such as a message in /var/log/messages or perhaps an entry in a web log that might have caused a failure or perhaps resulted in a slow response time. Does that help?
-mark

December 31, 1999 |

markseger

We have been having strange problems with our processes on a large server (>60T disk, 700GB ram). The problem with sar is trying to correlate all the data points. It's very difficult. Anything that can make this job easier would be welcome. I'm very interested to see what collectl can do.

October 28, 2016 |

GreyGnome

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>