This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alerting.

This isn't as dumb as my other posts recently, I can promise you.

I am changing the monitoring infrastructure for the company I work for.  We have about 400 servers that we monitor and I have run a report that tells me the values of their CDM (CPU, Disk usage and Memory)  I can change the values of every server of CDM, the thing I have just thought about that might prove a slight problem is the alerting.  Some CPU's are used at a high capacity (eg - some run hospital imaging software that will run it high at all times - 95 - 97%) and others will be idle for most of their life (it's peak is about 5 - 10%) and everything else in between.  Because of this am I going to have to write alerts for every server individually/in their respective smaller groups?  I can't leave the standard CPU/memory alert as it's set to the generic Orion threshold levels and it will error out a lot more than what it currently does or, simply not error at all.

Does anyone have any ideas for a quick fix solution?  I feel though that I don't have one and will have to do it the long and difficult way.

  • Set SolarWinds NPM thresholds

    Alternatively, you could also set a custom property CPU_threshold (name it whatever) and make it for example 98.  Configure alert CPU utilization > custom property CPU_threshold.  You would have to set that property value anywhere it mattered.  Perhaps you could make a group (hospital imaging software) and leverage that.  Reference that in your alert.  You may find that there is benefit in having the group for other purposes in the future--a summary page of their own, or showing them together on a summary page.

    Depending if you want the entire alert experience (alert action, acknowledge, etc.), you could make a view or report instead.  Do you intend for emails to be sent or any other actions to occur or is being visible on a view/ report acceptable?

  • I would recommend grouping them by server function. As an example, some server instances run at a very high "default" CPU (such as Exchange, some DB servers, etc) as they intentionally hog the CPU via the store.exe or whatever (depending on the server). If you take a look at server function, this may give you a better approach to setting your thresholds.

  • At present the alerts as sent to us and the customer and then we either resolve the issue if it's customer impacting (during out of hours and its systems critical) or we simply wait for it to resolve itself.  Eg - an SQL job will run which will indicate that the service is up.  Once the job is finished we will receive an email saying the service is now down.

    I'd like it eventually so that we would have very little to do - so that solarwinds simply does a lot of stuff itself and that we are disturbed a lot less.  Plus a lot less emails.  I feel the email thing might have to always be like that, but we will see.

    I'd love to set new thresholds, but my boss, at the moment doesn't want me to touch them for some reason.

  • I think it might be your job to explain to your boss why you might need to set new thresholds emoticons_happy.png

    I feel your pain though, I've been there.

  • I would do a combination of the suggestions. Add a custom property for server type (as in what is the server's generic purpose) and then if you have critical and non critical servers in the same type I would add another custom property like Response Teir (and set each device to whatever classifications for response you have). Then going through and modifying the node thresholds is really the only way to get better alerting without a large mass of alerts that will just cause a lot more work and will be more error prone. Since the thresholds don't affect the servers at all it I can't see why there would be any issue changing the thresholds and then basing your alerts off the threshold reached field.

    If changing threshold is completely out of the question you should be able to create alerts based on a combination on the custom properties.

  • Have dig through the various advanced Node variables and you'll find "Critical Value Reached (CPU Load Threshold)" and "Threshold Name (CPU Load Threshold)" 

    pastedImage_0.png

    The Critical Value Reached (CPU Load Threshold) is triggered when the nodes critical CPU threshold (as defined on the node's "Edit Node" page).

    pastedImage_1.png

    No need for additional custom properties, when they already exist.

  • I'm once again going to sound stupid.  I have a years worth of data in terms of peak CPU and peak memory usage.  If some CPU's never even use 25% usage, would I still need to set it to 90/98%? Or can I set these values to what I feel would be necessary for the vm?

    Once again thank you all for your assistance.

  • Shaun,

    if you use the alert definition that I supplied, it uses the individual hosts CPU thresholds.

    I have upload my actual trigger condition.xml, if it will help.- CPU_critical_threshold_exceeded_condition.xml

    You can then import it into your alert trigger definition (look down at the bottom of the page) .

    pastedImage_0.png

    Then just use the individual nodes property pages to define their unique thresholds.

    pastedImage_2.png

    I hope it helps,

    Rich

  • The pretty much sums up what I need.   From talking (er reading?) to people on here I managed to find my way around.

    I find solarwinds a very simplistic, but yet utterly frustrating to use.  It could be that either I have zero experience within monitoring solutions, I could be very thick or a combination of the two.

    Thank you all for your advice and information.