Sorry to hear you're having problems.
1) Sadly this is going to be a manual process, it would be difficult to say how the issue occurred in the first instance, but resolving it should be fairly simple. You would need to start by running a report of all Nodes showing the Caption and IP Address. I would tag the troublesome devices most likely with a custom property to help filter them out later. Then you just need to work through each issue one at a time.
2) This is something that can be setup just with NPM. You can create a report, or an alert (or both). The system would check the last polled date of any of your devices and inform you if that timestamp hasn't been updated in the last X number of polling cycles (The x is determined by you).
3) AppInsight for SQL whilst brilliant and informative is very component heavy. This isn't an uncommon issue. You could either look to load balance the polling with additional polling engines or if you think the monitoring you are getting back from AppInsight is excessive you could look to download or create smaller more custom templates to monitor just what you require from the SQL environment.
4) Again hard to be specific there are many things that could cause SolarWinds to perform slowly, but I would say start with the Database as that is the most common pinch point. If your SQL Server isn't set up properly and is struggling for CPU or IO then your entire environment will suffer.
5) SolarWinds can be set up to run defragmentation during the nightly database maintenance, but you should also set up some basic SQL Maintenance there is a good video here: How to create a SQL Server maintenance plan - Video - SolarWinds Worldwide, LLC. Help and Support
6) This could be an issue with the Pub/Sub or RabbitMQ also depends on the version of SolarWinds you are running. There is a good tip for splitting up the Business Layers into individual processes, which can help identify if you've got a problem with one specific module. Forcing the Business Layer to load plugins in a separate or 64-bit process in NPM 12 - SolarWinds Worldwide, LLC. Help … Also again, if the SQL DB is suffering it will have a ripple effect.
Sorry, I can't give anything more specific without a little bit more information, but also if it's been going on for this long you should open a support case with SolarWinds directly from your Customer Portal. They will be abe to work through basic troubleshooting and give you some basic tips.
2 of 2 people found this helpful
1. Duplicate IPs and Hostnames, try a bit of SQL:
--Find duplicate IP address--
Select IP_ADDRESS, count(*)
GROUP BY IP_address
having COUNT (*)>1
--Find duplicate node names--
SELECT Caption, count(*)
GROUP BY Caption
having COUNT (*)>1
And then a manual process of sorting it out.
I agree that this does seem quite odd, as I grew up with OpenView and NNM could resolve this problems with ease.
But maybe understanding layer 3 was easier back then...(showing my age).
2. We have similar issues where customers update device credentials and don't tell us. For SNMP I set up a common SysUptime poll, on every SNMP device, with an alert fired if it does not return a value.
For WMI, the is a Node "Minutes from last synch" value, and if this value goes over your polling rate and the node is not down (i.e you still have ICMP), it raises an alert warning that WMI is failing.
3. AppInsight for SQL is a busy old template. We found that the DBAs didn't want that much detail, well not until they were troubleshooting problems. So we stripped the SQL monitoring down to the SQL services and some test to ensure that the database was responding correctly ( using the SAM SQL Server User Experience Monitors) and a fraction of the load.
4. SQL 2016 is offering some performance increases, over previous versions, so if you aren't already it might be worth a look at.
I managed to talk our platform team to allow me SSD for most of my SolarWinds systems, is that an option? We split the OS, the Applications, Logs, Swap and Web server to seperate drives.
And added an additional webserver (AWS) to lighten the load again.
Also have you set up the antivirus exclusions on your Orion servers?
What is your element count and polling completion looking like?
Check out the report called "SAM Component & Element Count Per-Polling Engine" and the Polling Engines (/Orion/Admin/Details/Engines.aspx).
I seem to recall it's about 10,0000 elements per poller, then time to start thinking about another Additional Poller Engine (APE).
Are the servers up to spec? Orion multi-module system guidelines - SolarWinds Worldwide, LLC. Help and Support
The queries helped greatly. We've removed all dup's out of the system. And are better organized now. As far as the stale polling we've had help from loop 1 to get a report that uses the CPU resource and if it hasn't been updated in x amount of day's it shows up as potential for polling failure in our report. It needs to be cleaned up. I haven't figure out how to do it yet. But for the mean time with some manual work involved we're managing to get things done. As far as appinsight for SQL we are trying to remove 2005 servers into it's own template not using so many resources. And we've been experimenting with creating a new template. But our SQL admins like the resources on the appinsight which can't be added to custom template as it's proprietary to that template. Uses queries and methods to gather data that SolarWinds doesn't want us to know how they managed it. So they lock it down.
As far as our sql server. We've bumped ram to 100gigs. This improved efficiency of cache use. We've spoken with storage group and they gave us a larger portion of our LUN in SSD space. And reconfigured our VMDK's to use a fast policy they have available. This along helped us reduce buffer overflow into disk and at the same time sped things up if it does. All in all great performance improvement. But still we hit another wall. Semaphore time outs, among other time outs. Checking NIC use we see during database maintenance or for example diagnostic run we max out our avaliable 1 gig that we have on the NIC. So once again we're working with our VM guys to see if it's possible for a bigger NIC on that server.
Slow progress. But better than no progress at all. We found as well that we have 4 of our six pollers with over 20,000 components and over 10 k apps each. (I was SAM was not rate limited as our hardware can handle the requests to an extent.) This is why we are trying to clean up our appinisight for sql. It's killing us to monitor our huge sql infrastructure.
I welcome any further feedback if any available. Thank!
yaquaholic hit all the main points pretty well. Like you indicated it is hard to trim back the appinsight if the users actually want the data, everything in there *could* be recreated using SQL user experiences but it would be a painful manual process which was the whole use case for appinsight in the first place to use scripts to figure out all the stuff you needed and pull it in. You could maybe trim the load on your system if you asked the DBA's to give you a list of any the databases that they really don't need to monitor that aggressively so you can unmanage those databases specifically. Just an example might be that you could possibly get away with unmanaging all the instances of master, model, and msdb since those are defaults that tend not to have performance issues. With unmanaging just those databases from 300 instances it could add up to a pretty decent number of components overall that aren't terribly important. (quick math says each db in an appinsight template is tracking about 40 components, remove 3 db from 300 instances and you might be looking at over 30,000 fewer components to poll)
Another trick that could yield some small reduction in workloads is to edit the Appinsight for SQL template and change the Minimum Size of Indexes to Retrieve, essentially small indexes are usually not a big deal if they fragment because SQL can scan the whole thing in a few ms anyway. The default here is 1 MB, but if you talk to your DBA's they might be able to offer you some guidance on what it a "large" index within your environment that is worth keeping an eye on. Depending on how good their maintenance jobs are honestly they might tell you something to the effect of "we don't need real time monitoring on indexes because we handle that all in the scheduled jobs, as long as those are completing we are as good as we are going to be" and then you could disable that component completely and you would eliminate another chunk of data from the polls (and by extension this helps keep the Orion db smaller, which improves performance overall).
If things are really dire you can also slow the polling intervals from 5 minutes to 10, would cut your loads in half in one swoop, at the cost of obviously less granularity but in the long run I bet most people wouldn't even notice that you made the change.
Also, once you are into the 7+ APE range it is probably worth taking a look at the APM licensing instead of the tradition licenses. It includes unlimited APE's and might actually save you money overall. With unlimited APE licenses you can stack a couple onto your pollers and maybe get more use out of your hardware. According to the latest docs you would be able to do 20,000 component polling per server, and if you are running the latest versions the maximum number of components overall is super high now.
Indeed. As mentioned, I'm actually happy to see that progress is happening. Sometimes when your in the thick of it all you get a one track mind and narrow vision and some times it cane be hard to see other options when your focuses so hard on trying to get things done. This is why I always make use of thwack. Throw an idea out there some pain points and see what the community has to say.
Ive been the administrator for several environments previous to this one. And I've never seen one like this before that needs so much work. Most others had clean logs. spec's up to date. everything in harmony and purrring like a cat. All nice and all happy. This one has compounding issues. You fix one problem another 10 replaces that 1 and it feels like it's never ending.
But I'm getting there.