On the interface statistics gathered by Orion, are the Error and Discards counters reflecting the current status of the link?
If this is the case, you could setup an alert that notify you whenever either the error or discard counters reach a predefined threshold.
You can review the error and discard statistics by creating a report in Report Writer.
Start > All Programs > SolarWinds Orion > Alerting, Reporting and Mapping > Report Writer.
What type of circuits are these? Is the problem that the circuit is too highly utilized and it starts dropping packets? If so, I would create an alert to let you know when a circuit has reached something like 75% utilization. If not, and you are dropping large packets even during times of low utilization, you have another problem on your hands.
There are other ways to go about this besides using just Orion. Many of today's vendors implement IP SLA features, which can be used to test for this sort of thing from the routers themselves. You can use this in conjunction with Orion (or the equipment's built-in "event manager" application) to notify you of high usage, packet loss, etc.
Also, you should think about implementing QoS for your links if there is critical data that needs to get through. If your office LAN shares the WAN link with important applications that remote users need to access, it would make sense to prioritize the important stuff and de-prioritize your office LAN traffic. Jason in accounting really doesn't need to watch the entirety of yesterday's woman's Wimbledon match in HD on two different computers, you know?
Yann, I doubled checked our Errors/Discards graph for the circuit we had problems recently and it doesn't look like Orion seen anything - I'll explain further.
bleearg13, the circuits we're having troubles with are managed DSL circuits. The carrier's equipment onsite is the router and all we have onsite for managed equipment is a Cisco switch, specifically monitoring the port between the carrier equipment and our switch. The problem we have with catching errors, etc is that the problems are happening on the carrier's equipment and not on ours - the interface between our equipment and the carrier's is clean. The circuits are not being maxed out, etc during these problems, as we do have those problems from time to time and even regular polling starts to hiccup so it's easier to anticipate what might be happening in those cases.
I was hoping maybe there was a way to adjust polling to "catch" this problem, as pinging with 1500 byte packets reveals the problem, but otherwise it's hard to tell the problem is happening. This may not be as realistic as I first hoped, but I'm always open to suggestions.
The comment about IP SLA with the carrier does bring up a good point though - if Orion doesn't have a realistic way to monitor this, we may be able to get the carrier to closer monitor this problem by requesting it or worst case paying a little more (just thinking out loud here).
Thanks for everyone's feedback so far.
That's definitely a sticky situation to be in. If the problem always shows up when pinging with 1500-byte sized packets, and the loss is not 100%, at least you're likely not dealing with an MTU issue. If you are seeing 100% loss on 1500-bytes, then you might be dealing with an MTU issue within your carrier's network.
If the carrier won't let you manage the DSL routers at your locations, then it's their responsibility to find the cause of the packet loss. I would provide to them as much information as you can:
- Ping your LAN gateway with 1500-byte sized packets.
- Ping your WAN gateway with 1500-byte sized packets.
- Perform a traceroute to some website and attempt to ping one of the hops in the path with 1500-byte sized packets.
- You could run MTR on a linux box, or pathping from Windows and gather similar information.
- As mentioned before, you could set up a pair of routers or switches on each end of your circuits that do IP SLA between them. Orion can be configured to monitor the IP SLA MIB for threshold failures, or depending on the capabilities of the equipment, something like Cisco's Embedded Event Manager can notify you via SNMP or syslog. Juniper has a similar application called Realtime Performance Management, if you use their equipment.
As a carrier, the above information is what I would like to see from a customer when they report a problem. While ICMP is not a foolproof way to identify a problem, it's a darn good start.
Finally, I'd like to mention that DSL circuits are a pain in the a** to deal with. DSL circuits are by far the most unstable circuits we provide. They are extremely prone to errors, since they are limited by distance and physical copper quality. Good luck and hopefully some of this rambling was useful to you.
1500 byte packets make it through without issue on a "regular" day, so there doesn't seem to be an MTU issue that we can see - we just use it as a test when our users start complaining about slowness, network applications not working, etc.
For the most part I think we're lucky with our carrier for problem resolution - when we call in with these sorts of issues, they are usually able to identify the problem resonably quickly (once they pickup the ticket of course ) and usually it's just a reset of their equipment or an adjustment of settings and things are back to normal. All in all this inquiry came from a manager's request for us to be more "proactive" than "reactive" in these cases - in pretty much all other cases NPM has us covered, it's just this scenario we have problems with. Fortunately it doesn't happen that frequently.
Thanks for the additional troubleshooting steps bleearg13 - this could definately come in handy in the future if we're unable to get a resolution to a DSL related problem.