Hi
It appears to be taken a really long time to implement a "steady" system of NPM & APM so we can actually trust the output from it and getting to the point of giving up!!
I make sure I have the latest versions/SP's etc installed and that the templates that have come out recently have been applied (again) to each server/node etc.
But all day long I just seem to get event's showing soo many problems ? or are they really?
Since Applying the New WIndows 2003-2008 templates-- ( it was alot worse before)
a) Servers that show %Processor Time for xxxx server =down. Happens throughout the day/night
I check this on the actual server and it might be that a peak of a second of CPU happened and this Event is logged. This goes on all day on many different application servers as you would exepect them using CPU as the app is being used.
I can't find anywhere in NPM or APM to change Event thresholds to say only do this when CPU event lasts for like 20 or 30 seconds....
b) Servers show Pages/Sec being critical, again checking the server the counters on perfomance monitor only show "blips" no way of changing thresholds of duration.
C) I have an few servers that I cannot for the life me get application templates to show anything as they cant connect to WMI using wmi or rpc connection.
I have followed all the docs on Thwack about trouble shooting, connecting remoting to WMI, changing the rights/permissions in security etc with different credientials and with the orion set ones, rebuilt wmi on the servers in question, rebooted etc falling short of rebuilding the servers - not possible as they are live production. I am lost to what to try next.
D) I have domain controllers that just randomly become status is "unknown" and 5 mins later they are "up" nothing changes on network/servers etc, nothing is bottlenecked etc but just cant see what this happens and all the searching on this forum hasn't brought any answers.
E) I also have servers that the Network connections service, Remote Registry, DTC all "drop out" no reason nothing, I remote desktop onto the servers in question go into services they are fine, 2 minutes later NPM/APM reports them as fine.... this can happen at least one or twice a day or something not once a day and on different servers all the time.
Our network team have looked at network using traffic management, I have checked the loads on servers and all seems to be ok.
Anyone else getting this type of problems with their install? or have any suggestions? Installed at the moment is:
SolarWinds Orion Core 2011.1.0 SP1, APM 4.0.2 SP3, NPM 10.1.2 SP1, IVIM 1.1.1