This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Events for Application Monitor

Hi

It appears to be taken a really long time to implement a "steady" system of NPM & APM so we can actually trust the output from it and getting to the point of giving up!!

I make sure I have the latest versions/SP's etc installed and that the templates that have come out recently have been applied (again) to each server/node etc.

But all day long I just seem to get event's showing soo many problems ? or are they really?

Since Applying the New WIndows 2003-2008 templates-- ( it was alot worse before)

a) Servers that show %Processor Time for xxxx server =down. Happens throughout the day/night

I check this on the actual server and it might be that a peak of a second of CPU happened and this Event is logged. This goes on all day on many different application servers as you would exepect them using CPU as the app is being used.

I can't find anywhere in NPM or APM to change Event thresholds to say only do this when CPU event lasts for like 20 or 30 seconds....

b) Servers show Pages/Sec being critical, again checking the server the counters on perfomance monitor  only show "blips" no way of changing thresholds of duration.

C) I have an few servers that I cannot for the life me get application templates to show anything as they cant connect to WMI using wmi or rpc connection.

I have followed all the docs on Thwack about trouble shooting, connecting remoting to WMI, changing the rights/permissions in security etc with different credientials and with the orion set ones, rebuilt wmi on the servers in question, rebooted etc falling short of rebuilding the servers - not possible as they are live production. I am lost to what to try next.

D) I have domain controllers that just randomly become status is "unknown" and 5 mins later they are "up" nothing changes on network/servers etc, nothing is bottlenecked etc but just cant see what this happens and all the searching on this forum hasn't brought any answers.

E) I also have servers that the Network connections service, Remote Registry, DTC all "drop out" no reason nothing, I remote desktop onto the servers in question go into services they are fine, 2 minutes later NPM/APM reports them as fine.... this can happen at least one or twice a day or something not once a day and on different servers all the time.

Our network team have looked at network using traffic management, I have checked the loads on servers and all seems to be ok.

Anyone else getting this type of problems with their install? or have any suggestions? Installed at the moment is:

SolarWinds Orion Core 2011.1.0 SP1, APM 4.0.2 SP3, NPM 10.1.2 SP1, IVIM 1.1.1



  • D) I have domain controllers that just randomly become status is "unknown" and 5 mins later they are "up" nothing changes on network/servers etc, nothing is bottlenecked etc but just cant see what this happens and all the searching on this forum hasn't brought any answers.



    If it's a windows 2003 domain controller, install the WMI hotfix from Microsoft KB 941084.

  • I can't find anywhere in NPM or APM to change Event thresholds to say only do this when CPU event lasts for like 20 or 30 seconds....

    Edit the alert and look on the 'Trigger Condition' tab; at the bottom is a box "Do not trigger this action until condition exists for more than X seconds, minutes, or hours.  Is that what you were looking for?

    E) I also have servers that the Network connections service, Remote Registry, DTC all "drop out" no reason nothing, I remote desktop onto the servers in question go into services they are fine, 2 minutes later NPM/APM reports them as fine.... this can happen at least one or twice a day or something not once a day and on different servers all the time.

    I stopped checking for the network connection service because I would see that going down on servers from no reason.  You mention the remote registry service also, I just ran across this problem and if this is happening on your SQL servers look at KB2159286.

  • Must admit its mainly WIndows 2008 R2 servers with a  few Windows 2008 on the site so the link probably wont work in this case :-(

    I did start taking network connections out, but I have found that if I set the service to Autostart on each server it becomes less an issue on its own.

    But I still get the overall Unknown status from the services on DC's or normal servers.

    Interestingly our network guy mentioned that 9Gb of data was transmitted to our Orion server over an MPLS link in the last 10 days from 1 server that was a DC/Windows 2008 r2 etc ... if this is the type of load  on the network through out the day for 100 odd servers thats quite alot!