VoIP Quality Issues: Are free technologies good enough for troubleshooting?

VoIP has been widely adopted by enterprises for the cost savings it provides but it is also one of the most challenging applications for a network administrator in the network. Some enterprises choose to run VoIP on their existing IP infrastructure with no additional investment, bandwidth upgrades or preferential marking for voice packets. But because VoIP is a delay sensitive application, the slightest increase in latency, jitter or packet loss affects the quality of a VoIP call.

The Story:

A medium sized business with their HQ in Austin, US and a branch office in Chennai, India used VoIP for sales and customer support requirements as well as for communication between offices. IP phones and VoIP gateways were deployed at both Austin and Chennai and the call manager and the trunk to the PSTN for external calls was at Austin. Austin and Chennai were connected over the WAN and the voice calls from Chennai used the same path as data.

network dgm.png

The Problem:

Tickets were raised by users in Chennai about VoIP issues such as poor call quality and even call drops when calling Austin and customers around the globe.

The network admin had the NOC team check the health and performance of the network. The network devices in the path of the call were analyzed for health issues, route flaps, etc., with the help of an SNMP based monitoring tool. After confirming that the network health was fine, the team leveraged on a few free Cisco technologies for VoIP troubleshooting.

The Solution:

  1. Analysis with Call Detail Records (CDR) and Cisco VoIP IPSLA
  2. Root cause with Cisco NetFlow
  3. Resolution with Cisco QoS

Analysis with Call Detail Records (CDR) and Cisco VoIP IPSLA

When call drops were first reported, the NOC team quickly set up a tool with which they could analyze both Call Detail Records (CDRs) and Cisco IPSLA operations. The Cisco call manager was configured to export CDR data to the tool and the edge Cisco routers at both locations were added to the tool for IPSLA monitoring. CDR data was analyzed to find details about all failed calls and IPSLA was used to measure MOS, jitter and latency for VoIP traffic between the locations. IPSLA reports were correlated with CDR information to confirm the affected location, subnet and set of users.

failed calls.png

mos score.png

Root cause with Cisco NetFlow

IPSLA confirmed high packet loss, jitter and latency for VoIP conversations origination from Chennai and this put suspicion on the available WAN bandwidth. The network admin verified the link utilization using SNMP. Though WAN bandwidth was being utilized to the max, it was not to the extent that packets should be dropped and latency should be high.

The 2nd free technology to be used was NetFlow. Most routing and switching devices from major vendors supports NetFlow or similar flow formats, like J-Flow, sFlow, IPFIX, NetStream, etc. NetFlow was enabled on the WAN interfaces at both Austin and Chennai and set to be exported every 1 minute to a centralized flow analysis tool that provided real-time bandwidth analysis.

The network admin checked the top applications being used and did not find VoIP occupying a place in the top applications list as expected. ToS analysis from NetFlow data showed that VoIP conversations from India did not have the preferred QoS priority. A configuration change on the router had caused backup traffic to have a higher priority than VoIP traffic. This had caused backup traffic to be delivered whereas VoIP traffic was being dropped or buffered when the WAN link utilization was high. The admin also found that a few scavenger applications too had high priority.

top apps.png       EF-top apps.png

Resolution with Cisco QoS

With reports from the flow analyzer tool, the network admin identified applications and IP addresses hogging the WAN bandwidth and redesigned the QoS policies to provide preferential marking to VoIP and mission-critical applications and put everything else under “Best Effort”. Bandwidth hogging applications were either policed or set to be dropped. Traffic analysis with NetFlow confirmed that VoIP now had the required DSCP priority (EF) and that other applications were not hogging the WAN bandwidth. Because Cisco devices supports QoS reporting over SNMP, the QoS policies on the edge Cisco devices were monitored to confirm that the QoS drops and queuing were as desired.

EF priority for VoIP.png  CbQoS drops.png

Cisco IPSLA and CDR analysis confirmed that VoIP call performance was back to normal no more VoIP calls had a poor MoS score or were being dropped. We had a smart network admin and that was the day we were taught to be proactive rather than reactive.

The question I now have is:

Have you been in a similar soup?

Are there alternatives methods we could use and how would you have gone about it?

Thwack - Symbolize TM, R, and C