Colmux - Finding Memory Leaks, High I/O Wait Times, and Hotness on 3000 Node Clusters

Thursday

Aug252011

Colmux - Finding Memory Leaks, High I/O Wait Times, and Hotness on 3000 Node Clusters

Thursday, August 25, 2011 at 9:01AM

Todd had originally posted an entry on collectl here at Collectl - Performance Data Collector. Collectl collects real-time data from a large number of subsystems like buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp, all using one tool and in one consistent format.

Since then a lot has happened. It's now part of both Fedora and Debian distros, not to mention several others. There has also been a pretty good summary written up by Joe Brockmeier. It's also pretty well documented (I like to think) on sourceforge. There have also been a few blog postings by Martin Bach on his blog.

Anyhow, awhile back I released a new version of collectl-utils and gave a complete face-lift to one of the utilities, colmux, which is a collectl multiplexor. This tool has the ability to run collectl on multiple systems, which in turn send all their output back to colmux. Colmux then sorts the output on a user-specified column and reports the 'top-n' results.

For example, here's an example of the top users of slab memory from a 41 node sample:

> colmuxl -addr cn[10-50] -command "-sm" -column 5

# Thu Aug 25 06:15:41 2011 Connected: 41 of 41

# <-----------Memory----------->

#Host Free Buff Cach Inac Slab Map

cn23 60G 0 174M 87M 109M 190M

cn28 60G 0 177M 89M 107M 177M

cn27 60G 0 186M 101M 105M 139M

cn24 60G 0 103M 48M 105M 175M

cn21 123G 0 43M 27M 105M 90M

cn17 123G 0 42M 27M 104M 48M

cn35 60G 0 102M 54M 104M 173M

cn25 60G 0 125M 63M 104M 176M

cn18 123G 0 42M 27M 103M 49M

cn36 60G 0 103M 54M 103M 135M

cn34 60G 0 76M 54M 103M 174M

cn31 60G 0 103M 54M 103M 174M

cn19 123G 0 42M 27M 102M 49M

cn32 60G 0 103M 54M 102M 135M

cn22 123G 0 43M 27M 102M 90M

cn30 60G 0 103M 54M 101M 176M

cn26 60G 0 110M 55M 101M 175M

cn20 123G 0 42M 27M 101M 49M

cn15 123G 0 42M 26M 100M 47M

cn14 123G 0 42M 26M 100M 54M

cn13 123G 0 42M 27M 100M 50M

Debugging a Memory Leak Across a 64 Node Cluster

In fact I used this very command to track down a very strange problem on a large cluster running the lustre file system. Several nodes were running slower than others and nobody knew why. I ran collectl on each, comparing virtually everything I could think of from cpu loads, to interrupts, context switches, disks, networks (both ethernet and infiniband) and other subsystems as well.

It wasn't until I stumbled on the fact that some machines were using a lot more slab memory that I tried unmounting/remounting lustre on them and sure enough, the slab sizes dropped and their performance immediately improved. Drilling down into the individual slab allocations with collectl I then discovered a single slab called ll_async_page seemed to be the culprit and digging deeper with google I discovered a known problem with lustre memory leaks. With this factoid in mind, I then could use colmux to identify all the top slab memory consumers and sure enough, a small number of them had significantly higher values than the bulk of the nodes.

Therefore, it was simply a matter of unmounting/remounting lustre on just those and the problem was resolved. While it didn't fix the memory leak problem, which is a slow one, it at least got the cluster operating at full efficiency for a couple of months when the process had to be repeated. This was an older version of lustre and so maybe the problem has been resolved.

Tracking Down High I/O Wait Times

I've also used colmux on a large disk farm to track down disks with high I/O wait times. The possibilities are limitless, but naturally the commands/columns you choose to look at are highly dependent on the problem you're trying to solve.

Finding a Hot Needle in a 2000+ Node Haystack

One other use that was pretty cool was a colleague used this with collectl's ability to monitor temperatures to track down 'hot' system on a 2000+ node cluster during a linpack run.

Reliving History

You can also change columns dynamically by typing in the column number or using the arrow keys (if you installed the perl module TermReadKey). You can even reverse the sort order!

Furthermore, if you have historical data you've collected over several days, you can instruct colmux to play it back and sort it. So let's say you had some sort of hang on the cluster yesterday at 2PM. Just play back ALL the data across all the nodes and look at the top processes or maybe the network or anything else that could cause a hang.

Take a Look

Anyhow, if you think this might be worth a look install collectl-utils and take it for a spin. If you havent' tried collectl yet, perhaps this would be a good reason to do so.

There is an alternative output format I call single line, in which a small number of columns are all reported on the same line, making it real easy to spot change. If you look at the bottom of the colmux page, there's a cool picture if monitoring close to 200 systems on a single line, of course it takes 3-30" monitors to see them all.

Mark Seger |

6 Comments |

Permalink |

Print Article

Email Article

Product,

tools and Utilities

Reader Comments (6)

"2000K+ node cluster"

Really? That's 2 million+ nodes. That's more servers that all of Google worldwide combined.

Is it even possible to have a cluster with that many nodes??

If that's really the case, I'd definitely want to learn more about it.

August 25, 2011 |

Andy

oops, I suppose I'd like to think colmux could do that but of course you're right. but even looking at 2000 nodes once a second is still pretty impressive in my opinion. how would you go about tracking down virtually any performance counter on a 2K node cluster in real-time? How about something as obscure as an nfs client doing too many commits? or how about the one node getting excessive interrupts by exact interrupt number? you can look at some pretty bizarre stuff you never even thought of in this way. and remember - the nodes you're monitoring aren't even breaking a sweat as all the work is on the machine running colmux.
-mark

August 26, 2011 |

Mark Seger

Nice, I didn't know about collectl. Could probably hook it up to OpenTSDB to persist the data points it collects. At StumbleUpon we're now collecting about 10000 data points per second, and persisting them all forever in OpenTSDB. Ultimately my goal is to get almost every metric exposed by the kernel and our apps into OpenTSDB, and collect them all every few seconds.

August 26, 2011 |

Benoit Sigoure

CLI looks inspired by xCAT (that is written in perl too). Is these projects related any way?

August 27, 2011 |

Nikolay

re OpenTSDB - I had never heard of it before but it sounds pretty cool, especially since it does plotting. When you talk about 10K data points/sec, can I assume that's the aggregate across a cluster as opposed to a single node? In the case of colletl I never counted but suspect it collects on the order or hundreds of counters every 10 seconds, not counting slab or process data which it only collects every minute. I'm sure one can collect more with lower overhead but I'm not sure how much more as this is basically a problem of reading MANY different data structures in /proc. One can certainly crank up the monitoring frequency, to as fine a grained level as you like, say 100ths of a second, but then you're starting to use real cpu time.

On the other hand if you can gather that much data across a cluster I'm sure there are many uses such as feeding it with collectl data. The current collectl model is to collect data locally for 2 reasons:
- I've always felt, and still do, that the the major flaw with remote collection is if you lose your network during the times of network problems, you loose the very data you need to diagnose it. My solution is to do both - log locally as well as send it off to a remote 'catcher'. This is a core capability of collectl.
- centralized DB's have always been a great concept but I've yet to see one that could handle heavy loads or high numbers of variable, for example RRD. Great tool but it can't deal with volume, at least not that I know of. Sounds like OpenTSDB could be the answer everyone has been looking for!

As for collectl, it has the ability to send data to a remote collector. We have an HP product called CMU or Cluster Management Utility, that can optionally use collectl to remotely collect data centrally from thousands of nodes every 5 seconds and display the output in real time. I'd think with OpenTSDB you could use the same methodology. I'd be more than happy to have a discussion, perhaps on collectl's mailing list or a forum on SourceForge, primarily so others can participate if they like.

re xCat - sorry but I'm not familiar with it. Collectl is solely based on the Tru64 utility, collect. Collectl gets it's 'l' for Linux as in 'collect for linux'. ;)

-mark

August 27, 2011 |

Mark Seger

Is it the same as collectd, ended in 'd', in Ubuntu repositories?
I can't find colmux there.

June 1, 2012 |

pepe

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>