cancel
Showing results for 
Search instead for 
Did you mean: 

The network is fine.. or is it?

It's amazing,


I used to think, my network was running pretty well, a few hiccups now and then, but by and large, I got by and thought everything was business as usual, the boss would ask me: "Craig how is the network today?" and I would say "it's fine boss". Flash ahead to fiscal year's end, and there is some money left in the budget and I am given fifteen minutes to decide what tools to buy otherwise the money doesn't get spent. Most people wouldn't be prepared for this, but I was ready. I pulled out my wish list and said I want SolarWinds Orion NPM. After a few frantic calls to the vendor of choice, it was all set. I'm thinking, this is great that I got this, but I probably won't use it that much, because we have no problems.. and it's going to take forever to install, and the learning curve will be immense. In reality, the software was really easy and intuitive to install; point here, click here, answer a few questions and it was done.But, why was there so much red everywhere? This must be a software bug, because my network has no problems..  I spent the rest of the day tweaking and twiddling about, and I have to say, it was like turning the light on in a dark room. I was able to solve a lot of long standing problems, some of them that I didn't even know I had.

There was the switch with a bad blade that always had problems intermittently, but never failed, so no alarm was tripped. After being alerted to this, I had the blade replaced and things began to run clean. Sometimes the network was slow but I never could attribute it to any single cause, it usually coincided with a home game at the local ballpark. Turns out a lot of non-work related web streaming was going on, and some other folks were enjoying Netflix.

There was the router that went down even though it had redundant power supplies, but no one ever saw when the first one went down, but people sure noticed went the second one failed.. I setup an alert to monitor this and several other things. The major cost of IT where I work is no so much the hardware or software, it's the cost of actually scheduling time with union, paying for the lift truck, just the logistics were mind boggling and intensive and nobody wanted downtime. I am now able to easily automate and monitor my network, and do a lot more proactive monitoring and forecasting. I am just as busy as I was before, the difference is now I have a better view of what is going on with the network and I can act proactively instead of reactively. I have a lot less stress. I have lost fifty pounds and I have a corner office.. lol.. just kidding.. but I do get to sleep through the weekends without the pager going off at 3am and I still go to the same amount of meetings, but now they are more about future planning instead of postmortems.

What about you guys? Can anyone share a general process of things you might monitor and proactively forecast? Any tips and tricks pertaining to procedure are greatly appreciated!

I am now a believer in network performance management. It has really paid for itself many times over.

29 Comments
Level 13

We gained a great deal of visibility once we had NPM monitoring our gear.

Some of the things we suddenly could see:

     Switches A,B,C have been running out of memory.  That explains the random reboot every couple months.

     Power supplies (redundant luckily) in switches D, E, F are dead.  Better check breakers and cords.

     Redundant uplinkes in switches G, H, I are down.

     WAN links to sites 1, 2, 3 get saturated every day.  Time to upgrade.

     WAN links to sites 4, 5, 6 are not saturated.  Users' slowness complaints can be justifiably sent to desktop support.

     WAN links to site 7 shows errors during the workday.  Call the provider and tell them the circuit is NOT running clean.

     Gear in switch closet c1 is running hot.  Better check HVAC.

There's all sorts of other stuff that we could suddenly monitor and produce reports for but this is just a short list that shows immediate ROI to managers.  And managers like being able to open a web page and see a bunch of green icons.

Level 10

There are a ton of things my NMS gives me visibility into that I would not get otherwise.

For network:

-link traffic patterns, peaks, failures and times.

-mem/CPU usage and creep

-Syslog/Traps for general health and goings-on

For Server:

-CPU/Mem/Volume usage and creep

-Service status and health

-General load and usage of network

Event logs

If watched daily and/or weekly you can start to get a general feeling of what "normal" is in your network and then quickly pick out faults when something does go south

MVP
MVP

NMS becomes more useful the longer it runs. I'm always interested in identifying the use patterns of various network and server resources, so you can make decisions on whether you have sufficient capacity for peak use times. The historical data, the baselines... that's where you can turn information about your network into knowledge about your operations.

Level 10

I came in to this position with NPM already being implemented. I don't know what we would do without it though. We definitely are so much more proactive rather than reactive so it makes managing the huge infrastructure possible with a much a smaller team. Still get the occasional router or switch that takes a dive in the middle of the night but its a lot more manageable than other places I've worked at that didn't have this level of insight in to their environment.

Level 11

We have a fairly large solarwinds deployment, and I think it's really invaluable to our mission as a healthcare organization.  Having a tool that monitors our infrastructure lets us know of an issue before the helpdesk even gets a call.  Often times, we are able to resolve a potential problem before any users are impacted.  Other times, we've already self dispatched to a problem and arrive on site shortly after (or before) the user places the trouble ticket. 

We're finally getting management buy-in for care and feeding of our solarwinds implementation, so that's a plus.  Previously we only had a few minutes a week to work on it... unless it was broken.  Then it was a chorus of "WHY DIDNT YOU FORSEE THIS FAILURE!!!!!111!1!!ONE"

Level 15

The real trick is when you start warehousing your SQL and getting multi-year trends. I've seen environments that have SSRS reports setup that are used to trend failure rates down to specific modules within the network gear.

Oh to be a fly on the wall during the meetings with those vendors... "So you sold me ABC 3 years ago for my enterprise with promises of X,Y,and Z... Take a look at this and tell me what you would like to try to sell me now..."

Level 21

The catch with any successful NMS/APM/Monitoring is the whole "Good Data In = Good Data Out".  You need to be constantly tuning and updating your NMS to provide useful data; otherwise it slowly (and more quickly in some cases) becomes more and more useless over time.  The larger the deployment the more true this becomes.  Unfortunately my company often wants to treat it like a Ronco and "set it and forget it" and that simply doesn't work.

Level 21

I think our biggest Ah Ha! moment was when we setup the new App Insight for SQL on a customer database that was experiencing problems.  Immediately we saw severe index fragmentation and some queries that were taking forever to run.  It didn't take long to fix the problems and some serious performance improvements were seen.

MVP
MVP

I've got a customer that came to us regarding some performance issues. I suspect it's their SQL server so I've been trying to get permission to run AppInsight for SQL in their environment for further analysis. Slow progress but I think they'll come around eventually and approve it.

Level 11

As belthasarx stated, it's the change from being reactive to proactive that is the biggest benefit to us: to give a recent example, we had a switch that regularly had a response time of over 200ms. Without NPM alerting us we'd have been none the wiser (as the switch stayed "up" throughout) but the alerts allowed us to investigate and rectify the cause before the end users noticed or it became a much bigger issue.

Of course, this brings the danger that much more of our work now goes "unnoticed" - as far as the management is concerned there are rarely any issues so they start wondering what we actually do all day

Level 8

Relatively new here so please bear with me! I would say that what i have learnt most i have learnt because of having a good steady baseline which I can review and compare against and this baseline also gets reviewed via our monitoring software ( can we plug brand names here okay?? ) which fires out alerts notifying of oddities including drops in traffic not just peaks. 

Level 10

Very true. Also dont forget that the same is true for getting team buy-in and keeping them using it regularly.

The best NMS in the world is useless when your teammates dont know how to login and check status during a core meltdown and your on vacation.

Level 8

it's truely amazing how one believes just because thing are working fine means there is not issue on the network. I have come across some issue that my users called my attention to, but surface checking shows that things are working as expected whereas critical service for remote users was not fine.

It can be embarrassing when it is your user that is notifying you on issue instead of  you the admin being proactive.

Level 9

Our deployment of Orion NPM is a little different than most in that my responsibilities include management of a dispatch system at a mine site andthe radio network that is required to make that operate. NPM has given me the ability to improve our network coverage, make real-time adjustments, monitor the status of the network and clearly determine whether the issue(s) we see are related to the network or the dispatch system. Before, we would have both vendor blaming each other for their products being the cause. But now, I can clearly show stats of the network performance and narrow our attention to the correct issue instead of wasting time with the blame game.

We have setup alerting that goes to a group that includes the dispatch controllers so they instantly know if there is a network issue that will cause their dispatch system to behave erratically. In a couple instances, we have used this to notify the electricians that a section of the mine has lost power (no one was working off those power feeds at the time) before anyone knew about it. Talk about pro-active! Great product with lots of hidden "bonuses" customers never can anticipate.

Level 7

Nice one thanks for sharing.

Level 12

As someone who implements Orion for numerous clients, I concur with what @craig schnarrs said, specifically as it relates to hardware monitoring.  It is not uncommon at all to identify a dozen or more hardware issues that clients had no idea existed prior to installing portions of the Orion suite.

Level 10

Great write up. Really mirrored what I experienced as as well. In my case there was zero monitoring and very low accountability across all departments so when there was a "what's wrong with the network?" problem it was always my team digging postmortem trying to find what was causing the issue. After Solarwinds was installed we were able to hold our upstream provider more accountable and through logging able to see when people were rebooting equipment without letting anyone else know. This has been HUGE!

NPM and NTA were quick to setup and show such much granular information that I'm still amazed after a year of running the software (10.7 was worth the software renewal!!).

Level 10

It still baffles me that in today's world IT departments still go without any monitoring. To me even basic ping testing is not acceptable but you still see it...

Level 13

In my SolarWinds training classes, I try to enforce some small tasks folks can do on a daily basis to increase their impact while using SolarWinds products.  This would be something you do every morning right when you get to work.  It is meant to last just a few minutes in total. 

First, check the Top 10 page.  The goal would be to get this to be normally, usually green by removing/fixing/filtering nodes that are not, setting good thresholds, and using auto thresholds.  Skim this and look for network devices with high cpu and memory as well as servers with high interface traffic.  The top offenders are here and may indicate you need to check them out first thing.

Next, check the alerts.  The top offenders on the Top10 may have or should have triggered an alert.  This is also a good place to see if your alerts need tuning.  IMO, the top10 offenders should likely have alerts too.

If you find anything and want more info, Message center is next.  Filter the view to show one of the offenders you found on the top10 and/or alerts views and get a chronological history of events/alerts from it.  I like to enable many warning level alerts that send no emails just to populate the message center.  Then I also create a bunch of critical alerts that populate the event log and email.

If you spend a few minutes doing this each morning, you will start to recognize the devices and start to also recognize patterns.

Sohail Bhamani

Director of Public Training

Solarwinds Training and Professional Services

Level 12

Thank you for the ... Sohail Bhamani

Level 9

I find it interesting that the OP talks about how he expected it to be hard to setup and hard to learn but it really wasn't. I've always had some of the same reservations. I feel like setting something like this up would require a lot of configuration on each device and setting up SNMP traps and all that stuff. Is it really as easy as you describe? I've never really worked with SNMP before and every time I look into it, it just sounds like a pain to setup.

Level 11

We have been running NPM since v7 and I agree with all of the points made so far.

The reactive to proactive paradigm shift is immensely healthy in an environment where you have tight tolerances on outages. 

For someone who will take the time to immerse themselves into the data that has been collected over the course of years you start to see patterns and can more accurately predict possible causes of issues.

Being able to deliver reports on the Nodes Down for the last 24 hours or Availability for the last 30 days helps to also identify what could be sporadic issues that don't get reported.

We also utilize many of the additional Orion modules and find value added for each one (IPAM needs to support infoblox and then we would be happy all around).

Level 12

Thank for sharing..

Level 9

I recently bought NPM, NCM, SAM and Kiwi. The monitors are great however I still cannot figure out how to monitor a physical hard drive on the farm of Dell R710 we have. Support has made mentioned that we need to setup a Dell OpenManage server to get this data, however we don't have the resources to set  that up. If anyone has any idea's on this or has ran into this issue?

As a side note we have found monitoring UPS for smaller clients quite handy. It allows us to know there is an issue ahead of time and shut down servers and process before they get damaged.

Level 7

Unfortunately, our environment is so big (we have 40 sites internationally) and so many administrators that alerts sometimes get ignored. I have taken it upon myself to email all the admins at least once a week with all of the nodes that are down or interfaces that have problems. I usually get a good response from the admins and make the changes/corrections in Orion or the problem is taken care of on their end. Most times they didn't even know they had a problem because they have Outlook rules that file away the alerts unseen. I know, this is a personnel issue and not a system issue.

Level 9

We purchased NPM, WPM, SAM, NCM, and NTA to replace several monitoring tools we were using. The visiblity and granularity that SW provides, exceeds that of the tools we replaced. Along with resolving outstanding issues, we discovered new issues that were unknown due to and increased visibility of our network devices. With the added support of SW techs and the thwack community, we love our Solarwinds products.

Level 9

NPM definitely makes it easier to be proactive

Level 11

Nice Info.

Level 15

Sort of the experience I had when I got deeply into Solarwinds.  I knew a lot of things that were going on in the network but Solarwinds brought a lot to the surface but it also has brought ROI to some of the future projects that I have been involved with.  Thanks for the post.