I recently had a switch start malfunctioning and it caused a lot of disruption. Finding it wasn't intuitive or easy because it wasn't down and wasn't showing up in any of my NPM's front page widgets. A reboot was required to correct the issue immediately, and it may require an IOS upgrade or hardware replacement to prevent this permanently.
It's a simple switch, but it faces some critical equipment and areas. It's never had much latency--1 ms or less, often, and never as much as 20 ms.
But twelve days ago it started having increasing latency. And four days ago it started having a little packet loss. But never enough packet loss to show up on NPM's front page. But the latency was enough to cause increasing complaints to the help desk, and the packet loss was enough to drop a lot of customers' voice & video & citrix sessions. I want to avoid being behind this 8-ball in the future by building the right alerts & notifications based on changing patterns of latency and loss--without getting inundated when other devices have temporary-but-normal cases of increased latency and packet loss. In short, I want the right actionable alert, as adatole would say, but I'm not certain how to build it.
Here's how the switch's latency and packet loss history looks:
I'd like to have started getting alerts when the latency changed from less than 4 ms to consistently more than 20 ms. I'd like this particular class of device to have more priority for any packet loss so that at 3% loss we'd have seen it in bright red on the front of NPM.
What type of alert & threshold is appropriate to capture these slowly-growing changes over time?