Setting up alerting for a number of APC UPS units in our environment, and getting a bit bogged down with a list of things considered anomalous and the apparent need to set up alerts for each and every one of these conditions:
- Battery needs replacement
- UPS on battery (aka loss of wall or external power)
- Battery too hot. (May indicate the ambient temp being too high.)
- UPS on battery and remaining runtime is too low. (As in, shut it down gracefully now!)
- Output load too high (approaching capacity aka the unit will not have sufficient runtime if the power goes out).
- Input voltage too low or too high (aka a problem with external power)
Sure, I could try and create a single alert with a bunch of these conditions - the problem with such an alert is that it wouldn't tell me of a new condition in addition to an existing one. E.g. if the battery becomes too hot in a UPS that already has an alert going for the battery needing replacement - this is a valuable piece of info I'd want to know right away if it happens, and would be undetectable using a catch-all alert evaluating a number of anomalous conditions and firing on any of them.
But... does this mean I need to create (and maintain) separate alerts for each of the above conditions? (Having spent 8+ years with NPM and SAM and constructing quite a few alerts - I've learned that maintaining alerts is much harder than developing them.)
What works for you, or has worked for you in the past?
(One thing I could think of is maintaining alerts externally via Git - which would hopefully simplify and organize their maintenance with the added benefit of treating monitoring as IaC and applying coding practices to it - yet I am not there yet, and not sure SolarWinds is.)
Thanks!