This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

New To NPM With Issue!

mcmasterj over 8 years ago

Hello,

I have just installed a few days ago the NPM from solar
winds.

I have setup some node (7 HP switches) and I can see them in
the top ten reports etc…

I need help with an issue I have, on Friday and today my
network almost came to a stop in the morning on Friday this was from around
9:15 am until 10:15 today was from 9:00 am until around 12:30pm.

When I see the graphs they show me that the switches have
about 70% traffic transmitting but nothing receiving, I am trying to find the
source of this but I am not sure of my way around to do this yet.

If anyone can point me in the right direction I would be
very grateful as this is bringing the network down.

Thank you for any help

John

Top Replies

0 johnny_ringo over 8 years ago

John,
Can you post a screen shot of what you are seeing? If you are collecting SNMP statistics from your switches and you see activity from the transmit direction, then you should be collecting data from the receive direction. Did you configure the switches for SNMP? Are you monitoring all of your switches and all of the ports in and out of your network?
Cancel
Vote Up 0 Vote Down

Cancel
0 mcmasterj over 8 years ago

Hello Johnny,
thank you for your reply.
this is SNMP but I can only see in the monitor the transmit traffic the receive is not active.
I am very disappointed as I have bought this software on the word that when this happens I would be able to see the port that the device causing the traffic would be easy to see. today when this happened all of the switches showed traffic but I was unable to see the one that that started the traffic.
I hope this make sense.
john
Cancel
Vote Up 0 Vote Down

Cancel
0 rschroeder over 8 years ago

NPM will show you the Transmit and Receive traffic statistics if you configure it to do so.
What you've written sounds like you may have already isolated the problem to a particular switch, and you'd like to find which port on that switch is showing a traffic pattern that may be associated with the outages.
If this is true, it may be that the switch is being affected by something that's preventing it from reporting the data to NPM. It could be that the switch's CPU is overwhelmed by a broadcast storm from something attached to one of its ports.
It might also be possible that the network outage is affecting your NPM's ability to collect data. There are some steps to take to help better understand the issue.
Review your NPM setup first on whatever switch or router is highest in your network. Verify it shows TX and RX on the ports connecting to other routers and switches. If it is not showing that information then you must adjust it until it does so.
Once you have it reporting this info from the top end switch or router, proceed to duplicate that monitoring on all ports between all switches and routers. If your licensed element count allows, monitor all ports, or at least all active ports. You'll be able to see transmit and receive patterns, port errors, and Layer 2 and Layer 3 traffic information, assuming you selected L2 and L3 monitoring.
In your type of outage NPM can help you first by informing you which, if any, devices have gone unreachable to ICMP polling. They'll appear red, or "down" in NPM. But remember that they're only up or down from the physical and logical viewpoint of the NPM server. It's a reason why the NPM server should be attached to the network at a central location instead of an edge location.
Your job during an outage is to discover what systems are down and determine how to best recover them quickly. Knowing what their commonalities and inter-connections are via the great network documentation and drawings you've built (You DO have an up-to-date drawing of all physical connections between switches and routers, right?) helps here a lot!
Eventually you must make a choice: Either connect to the highest devices and troubleshoot them and do whatever it may take to recover them, or perform data capture and troubleshooting to help identify the cause of the outage and prevent it from reoccurring. Troubleshooting may take more time than Management prefers, especially if the outage is costing a lot of money. But it is key to preventing future outages by understanding the actual causes of the problems.

How to perform some basic troubleshooting steps can involve many different instructions and tasks. You might start by:
1. Looking for CPU and memory utilization, uplink and downlink utilization, errors, and comparing them against your baselines. If something is maxed out, that's a red flag that may yield helpful results by troubleshooting it.
2. Capturing and reviewing the devices' logs. Sometimes they'll tell you exactly what failed and why. Other times they'll tell you about events that may be unusual. Most often you'll find logs are full of usual and unnotable activity, and you'll need to filter them to find more important information buried among the common items.
3. If your devices are Cisco:
     A. Set your terminal emulator to capture and save all output of the CLI session
     B. Then set the Cisco device to not paginate (term length 0 works for most Cisco's)
     C. Issue a show tech command (this works better from an SSH session than from a direct console CLI session due to the amount of data that will be presented, and due to SSH being able to handle that throughput better than a standard console connection).
     D. Once the show tech output is complete, stop capturing the data.
     E. Open a TAC case with Cisco and share with them the captured show tech, along with the device model and IOS version and serial number. Define the problems and ask for help identifying causes, correcting problems, and preventing them from reoccurring.
There are MANY paths to follow when troubleshooting, but perform basic triage first to determine what steps will provide the most return on getting the network going again for the least amount of time and effort. Remember that you will probably find it best to eliminate problems at the lowest levels first, then work your way up the OSI model. Once you've proven Layer 1 is good, troubleshoot Layer 2, then 3, etc.
NPM can be a great tool for showing paths that are up and down, RX and TX traffic, trends for utilization of bandwidth and CPU. But NPM won't necessarily tell you WHY things are down; it'll tell you WHAT is down; it's your job to interpret that data and focus on the likely causes.
Monitoring all ports can be particularly helpful if you have an outage caused by a NIC or interface going into a broadcast storm. Network CPU stats should also match up with the start and end of an outage caused by a broadcast storm.
When you're seriously under pressure to recover things, you may find it necessary to actually disconnect switches and routers to isolate and eliminate the paths of potential causes Sort of like cutting off your leg at the knee to save your life from the gangrene in your foot. Not a pleasant experience, but if you find that your network recovers instantly when you disconnect a downlink to a particular switch, you'll be able to focus on what may be attached to that switch which is causing the issue.
It's a huge can of worms you've postulated, but these ideas can get you started.
Swift Packets!
Rick S.
Cancel
Vote Up +2 Vote Down

Cancel
0 CourtesyIT over 8 years ago in reply to rschroeder

rschroeder‌, excellent post.
Cancel
Vote Up +1 Vote Down

Cancel
0 hanif.solarwinds over 8 years ago

mcmasterj‌
You can span the uplink and capture the the upstream traffic using Wireshark. From packet captured you could easily find the culprit.
Cancel
Vote Up 0 Vote Down

Cancel