This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SolarWinds Environment issues and false alarms

Hey everyone! I am very new to SolarWinds, and our current enterprise SolarWinds environment is very unstable and giving false alarms. I would like to know what we are doing wrong, and if a rebuild of our environment is a good solution for us. we are also looking to setup high availability, disaster recovery, and lab/test environment Q1 next year. We are having resource issues all the time with our additional polling engine and our main poller/Orion server. Any ideas would be helpful! Thank you! Here is a list of our current products and a overview of our setup: 

Orion Platform 2017.3

WPM 2.2.1

IPAM 4.5.2

VNQM 4.4.1

NCM 7.7

CloudMonitoring 1.0.0

NPM 12.2

DPAIM 11.0.0

QoE 2.4

NTA 4.2.3

VIM 8.0.0

UDT 3.3.0

SAM 6.4.0

Toolset 11.0.6

NetPath 1.1.2

Screen Shot 2017-11-02 at 1.05.44 PM.png

screencapture-ussl-swnpm1-usanainc-Orion-Admin-Details-Engines-aspx-1510856880417.png

  • 1. What sort of false alarms ?

    2. Why is it unstable (resources - if so what sort of resourcing issues)? If its something to do with the resources on Solarwinds servers you will need to define a proper regular health check for Solarwinds servers (space issues, memory issues etc)

    3. Always try and remove unwanted nodes/interfaces/volumes/monitors/application/etc etc which are no longer require monitoring as this would add up additional load on your tool.

    4. Keep a healthy DB and check for DB response time between your Solarwinds servers and Solarwinds DB

    5. Network latency - cant help you much in this area, if possible keep an eye on this as well.

    6. As well revisit and check your alert definitions including the time that you have defined for your check on alerts, refine them if possible. Do you really require a 1 min check for all your alerts ?

    7. And polling interval in most cases, if you are using a generic polling interval for all devices on your environment revisit and rethink on them ? Do you really require a 5 min polling on all nodes monitors etc ?

    Hope it helps ......... to kick off

  • First of all, Thank you so much for your quick reply with these great questions!

    1. What sort of false alarms?

    Other departments are reporting false positives for nodes being down which triggers our pagerduty. WPM is saying that some logins are taking over 15 seconds to login when in reality its probably under 2 seconds. Orion seems like it is always saying that new UDT jobs are still pulling. its always little things being weird all over the place.

    2. Why is it unstable (resources - if so what sort of resourcing issues)? If its something to do with the resources on Solarwinds servers you will need to define a proper regular health check for Solarwinds servers (space issues, memory issues etc)

    Lately we have been having super HIGH CPU usage on our servers which even now have ridiculous amounts of CPU (10 CPU cores @ 2.3 GHz). MEM seems to be fine at 50% 20 GB. But the CPU usage will still get pegged at 100% on our Main SolarWinds server causing a crash after a while.

    3. Always try and remove unwanted nodes/interfaces/volumes/monitors/application/etc etc which are no longer require monitoring as this would add up additional load on your tool.

    We are always working to remove unwanted/not in use nodes, applications, Agents, and such, thanks to a previous tip from Technical support on this issue. 

    4. Keep a healthy DB and check for DB response time between your Solarwinds servers and Solarwinds DB

    Talking with our DBA team they are not seeing any issues or latency, which is why they are the most upset team at our situation.

    5. Network latency - cant help you much in this area, if possible keep an eye on this as well.

    We have been monitoring our network and not seeing any latency since our network hardware refresh company wide earlier this year.

    6. As well revisit and check your alert definitions including the time that you have defined for your check on alerts, refine them if possible. Do you really require a 1 min check for all your alerts?

    I will definitely be revisiting all our settings since we are still using all the default settings as the third party engineer we hired to originally set it up recommended .

    7. And polling interval in most cases, if you are using a generic polling interval for all devices on your environment revisit and rethink on them ? Do you really require a 5 min polling on all nodes monitors etc ?

    Thank you so much for all these ideas! I will have to get all our Admins, Network Engineers, DBAs, and Architects together for a think tank.

  • Addressing your False Alarms :

    Nodes : Are you monitoring Response/Status by ICMP or SNMP?  Use the list resources to check this option, if SNMP change it to ICMP. It will respond better.

    If using ICMP, build in timing to your node down - must be down for 30 seconds, 1 minute, etc. If there is a quick reboot, you can use a separate alert for that. The event of a Down/Status change (even to a warning) should show in your events.

    Transactions : Increase your Thresholds Timing. There is a lot at play with WPM and a service spike on a webserver that you are running a transaction against has the potential to increase your playback time more than it would actually affect your user experience. Find stability in increasing those thresholds to start, then slowly bring them back down - If you have steps failing you may need a wait built in to your steps at certain point to allow for certain page modules or resources to load.

    Also, Dynamic resources on the page can be problematic - if you screen shot does not show properly try a wait. If you still have issues try to use static links on the page, or Tab to the resource and press enter instead of using the mouse. Image Verification and Text Verification can be very useful in these cases as well.

    Good Luck!

  • I would also add that your Pollers (if not regional or specifically used for certain devices) should be load balanced. 

    Even out the element count, and mainly watch the Job Weight (comparative value to see how hard your servers are working) and see if you can't bring those closer together between the two APEs.

    You are doing well by keeping a light load on the Primary Engine.