This is going to be one of my rambling sysadmin-y entries talking about stuff that's probably of little general interest. Fair warning given.
There are two components that I would consider essential to proper system administration: GOOD Monitoring and Issue/Incident Tracking. It is imperative that you know when a problem arises (preferably before anyone else notices) and that you keep track of the problems you have encountered in order to spot troublesome systems and redesign them to stop bugging you.
Those of you who have worked with me know I have my prejudices in both of these areas, and that for the last few years I've settled on two pieces of software to fill these roles:
InterMapper for monitoring and
RT for issue tracking.
The major caveat of this pairing is that the two have no formal integration: InterMapper will happily send emails, and RT will happily accept emails and turn them into tickets, but RT doesn't know when InterMapper is telling it about the same problem twice, or that a previous issue has been cleared. The end result of this lack of integration is that you have a bunch of RT tickets for the same issue which need to be manually merged and resolved, and this manual bit bugged me enough that I actually took the time to fix it!
A little trolling in the
RT Wiki shows that people have gotten RT and Nagios to talk to eachother with some degree of success. InterMapper has a bit more complexity in that its states are more granular than PROBLEM and RECOVERY, but the principle is essentially the same: Create a global scrip that looks for mails from the monitoring system & takes appropriate action.
Gory Details:
The actual integration is dirt simple: there is almost nothing that needs to be done to make this work, so there's no excuse for not doing it.
- Create the scrip with the following options:
- Condition: On Create
- Action: User Defined
- Template: Blank
- Stage: TransactionCreate
You should probably give it a nice name too. - Enter 1; for the Custom Action Preparation Code (There's nothing to prepare)
- Use this code as the Custom Action Cleanup Code
Note that you may have to change the $from =~ /..../ regex to suit your environment
Note that the supplied code is not perfect. In particular:
- It only captures {Warning<-->Alarm<-->Critical}<--->OK transitions - it doesn't know what to do with Up or Down messages
- It assumes that you will let InterMapper's "OK" message resolve the case
(If you resolve the case yourself RT will have no "problem" to resolve, and will create a new case for the spurious "OK" message) - It assumes that the subject line is of the form "Severity: Identifier"
(e.g. "Warning: " with a descriptive device name.) - It assumes that the messages will all have go to the same queue (and that you won't move them)
Up/Down messages could be handled by modifying this scrip's action code, but this muddies up otherwise clean logic. It's probably easier to make a different (mostly identical) scrip specifically to handle Up/Down messages.
The "messages are all in the same queue" restriction is a design choice: This prevents RT from glomming potentially unrelated stuff from other queues while letting you handle all the InterMapper mail with one global scrip. You could also create per-queue scrips (with hardcoded queue names), but this approach seams more reasonable.
Finally a reverse-Knuth disclaimer: beware of bugs in the above code: I have only tried it, not proven it correct.
post/read comments