Lower time frame monitoring

We seem to be having issues at our call centers where calls are getting dropped by several agents at a time.  When this happens, the MSP that helps us sees on Cisco RTM the CUCM servers will lose communication with each other for just a couple of seconds.   They see SDLLinkOutOfService in the logs between our call manager servers.  We have three CUCM servers at our HQ and 3 at our Colo.   But we don't have anything in place at that low a of  time frame to catch such a small event.  Can VNQM be configured in some way for certain connection between devices to do more real time monitoring and dump anything older than a configured time interval, like 30 minutes or something.   Just need data held long enough to go and research right after it happens.  Or is this something that VNQM does automatically?   We don't have VNQM on our SW server currently, but if it can accomplish this, I will make a case for adding it.  And if it can't does anyone know of another product that can accomplish this?

Parents
  • For transient, short lived errors like that I think you might be better off going for a packet capture approach.   That will give you a lot more detail about what exactly your servers are doing when they drop communication.  Anything polling based is pretty unlikely to capture the issue and even if it does all it will do is provide confirmation of the data you already have in the logs.   The servers say they can't communicate, so the real problem you are trying to solve is to understand exactly what is causing that. Capturing from the impacted servers, or maybe at the switch port if you prefer, would let you see if the packets are actually getting dropped or if there is some kind of application hiccup where they just aren't answering correctly or if there is something more complicated happening. 

Reply
  • For transient, short lived errors like that I think you might be better off going for a packet capture approach.   That will give you a lot more detail about what exactly your servers are doing when they drop communication.  Anything polling based is pretty unlikely to capture the issue and even if it does all it will do is provide confirmation of the data you already have in the logs.   The servers say they can't communicate, so the real problem you are trying to solve is to understand exactly what is causing that. Capturing from the impacted servers, or maybe at the switch port if you prefer, would let you see if the packets are actually getting dropped or if there is some kind of application hiccup where they just aren't answering correctly or if there is something more complicated happening. 

Children
No Data