All,
I was lucky enough to recently experience a high CPU (100% pegged!) issue on one of my core routers. I'm still not 100% certain what caused it, but after troubleshooting for a hour, I had to reboot it and the issue is now "resolved." At least until the CPU climbs back up to 100%, and I can troubleshoot further (I'm leaning toward a code bug).
However, this issue also pointed out a gap in my monitoring; that is to say, I'm not alerting on high CPU for network gear at all (or memory). After all, most network issues are connectivity and routing, CPU and memory are usually the last things I look at. So, I'm wondering if there is a way to alert on CPU and memory; what thresholds should I consider as a "rule of thumb?"
I found this posting that seems like a great solution, but I prefer to let NPM handle this.
Is anyone else alerting on high CPU and memory on your network gear. What thresholds are you using? What do your alerts look like? I've not upgraded to NPM 10.7 yet, and even if I did, I don't know if base-lining for CPU or memory is configurable.
D