I'm pretty new to Solarwinds Orion. I have the SAM and NPM modules and we are trying to get alerting setup. Our needs a pretty simple, just tell us when a server or network device is down or has a problem and email us. I am aware of the concepts of alerting as I've used other monitoring software, but Orion monitoring has me confused. I have a few questions. When setting up a new node in SAM, you can specify thresholds for CPU, RAM, response time and packet loss. Are these supposed to generate alerts when the threshold is reached? How are these related to alerting if at all? I have site groups that I've set up. I setup each site group to depend on a router for that site. We recently had an outage of a router (power issue) and I got alerts for every object being monitored at that site, including ping latency, and ping loss. Am I missing something here, or was the group supposed to not alert if the main dependency went down? I double-checked and it seems to be setup correctly. I haven't changed any of the default alerts, but I am getting a lot of generic alerts for things. For example, I see a few "Node is is warning or critical state", but I have no idea what that is as there is no information in the Alerts page. This isn't very useful at as it doesn't provide any information, especially for an email notification. Is there anything I can do about this? Is there anyway to set alerts for only a group of devices without manually adding the devices to a group? For instance, I want ping latency monitoring for all my network devices and some servers we have. I don't want to include everything. The list of network devices may change and I don't want to have to update the group every time we replace a device. I looked into dynamic groups, but didn't see a way to tell it to only monitor certain devices. I seem to keep getting alerts for devices that we've already acknowledged there is a problem. Can I tell Solarwinds to not alert again for that device? Also, is there a way I can tell Solarwinds not alert on that device if it going up and down? Once notice that it is not working right is enough. I've looked through documentation and watched some of the labs videos, but they don't explain a whole lot and the Alerting in Solarwinds is confusing coming from something else.

This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Questions about Alerts

respec over 3 years ago

I'm pretty new to Solarwinds Orion. I have the SAM and NPM modules and we are trying to get alerting setup. Our needs a pretty simple, just tell us when a server or network device is down or has a problem and email us. I am aware of the concepts of alerting as I've used other monitoring software, but Orion monitoring has me confused. I have a few questions.

When setting up a new node in SAM, you can specify thresholds for CPU, RAM, response time and packet loss. Are these supposed to generate alerts when the threshold is reached? How are these related to alerting if at all?
I have site groups that I've set up. I setup each site group to depend on a router for that site. We recently had an outage of a router (power issue) and I got alerts for every object being monitored at that site, including ping latency, and ping loss. Am I missing something here, or was the group supposed to not alert if the main dependency went down? I double-checked and it seems to be setup correctly.
I haven't changed any of the default alerts, but I am getting a lot of generic alerts for things. For example, I see a few "Node is is warning or critical state", but I have no idea what that is as there is no information in the Alerts page. This isn't very useful at as it doesn't provide any information, especially for an email notification. Is there anything I can do about this?
Is there anyway to set alerts for only a group of devices without manually adding the devices to a group? For instance, I want ping latency monitoring for all my network devices and some servers we have. I don't want to include everything. The list of network devices may change and I don't want to have to update the group every time we replace a device. I looked into dynamic groups, but didn't see a way to tell it to only monitor certain devices.
I seem to keep getting alerts for devices that we've already acknowledged there is a problem. Can I tell Solarwinds to not alert again for that device? Also, is there a way I can tell Solarwinds not alert on that device if it going up and down? Once notice that it is not working right is enough.

I've looked through documentation and watched some of the labs videos, but they don't explain a whole lot and the Alerting in Solarwinds is confusing coming from something else.

0 ryan88 over 3 years ago

Hi,
I will try to answer you're questions in order:
1: Yes these will have their own warning and critical thresholds which can be overriden by editing each node or by changing the global settings through the all settings menu. I tend to build 2 alerts for memory, cpu etc one which has a trigger condition for where warning thresolds breach = true and the same for critical. you will see these through the parameters page when building the alerts.
2: If you have built the dependancy rules correctly it will set the devices in the group into an unreachable state. If your alert for the other systems have a trigger condition which doesnt exclude unreachable it is logical that it would trigger if their state changes. I would check your alert to make sure the it isnt set to 'where node status is not up' or somethign similar
3: the warning and critical state is the health role up of the overall node. which would suggest the warning or cirtical thresholds for responsetime cpu memory have been breached. this url explains it:
https://documentation.solarwinds.com/en/Success_Center/orionplatform/Content/Core-Status-Rollup-Mode-sw2114.htm
i tend to turn off all out of the box alerts and enable if suitable, or copy it and edit to my needs.
4: Yes - in the trigger conditions you can change the scope at the very top to point to your group name. this will ensure that the alert only evaluates against items within the specific group.
5: if you achknowledge an alert for a specific device in the console it will sit there until it has reset. i would check the reset conditionss for the alerts to make sure it isnt evaluating as healthy adn the state flipping. For the note alerting if you go to the manage nodes page you can mute alerts or unmanage (stop polling). this will suppress all alerts from a specific device. you can also set a SAM temaplte to unmanaged to stop that specific application from polling or alerting through the sam setting page as well.
Cancel
Vote Up 0 Vote Down

Cancel
0 respec over 3 years ago in reply to ryan88

Thank you for the answers.
One of the issues I see now is that an alert is showing on the Node Management page that a node has a problem. When I look at that particular node it is reporting that a threshold has been reached. Unfortunately, there is no indication of which threshold or what the value is.
Cancel
Vote Up 0 Vote Down

Cancel
0 ryan88 over 3 years ago in reply to respec

it normally wil mention on the node management page under the status column which threshold has been breached. if you cant see it there i believe you can click on that node and then drill down into it.
The on that page you can check the values being captured manually.
Or you can click on the 'vital stats' page on the left hand page. this shows you every stat being captured including the response times etc.
Cancel
Vote Up 0 Vote Down

Cancel