cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Orion Platform 2019.2 - Enhanced Node Status

Product Manager

Status is arguably one of the most important aspects of any monitoring solution. It's a key component for visually notifying you that something is amiss in your environment, as well as being an important aid in the troubleshooting process. When used properly, status is also the engine that powers alerting, making it an absolutely essential ingredient for both proactive and reactive notifications aimed at ensuring your entire IT environment runs smoothly.

Orion® Node Status, in particular, has for an extended period of time been somewhat unique when compared to other entities in the Orion Platform[MJ1] . Most other entities have a fairly simple, straightforward, and easy-to-understand hierarchy of status based upon severity. These include things like Up, Warning, Critical, and Down, but can also include other statuses which denote an absence of a state, such as Unknown, Unmanaged, etc. By comparison, a node managed in the Orion Platform today can have any of twenty-two unique statuses. Some of these statuses can, to the uninitiated, appear at best contradictory, and at worst, just downright confusing.

This is the result of separating information about the node itself from its associated child objects (like interfaces and applications) into multiple colored balls. The larger colored ball representing the reachability of the node, usually via ICMP, while the much smaller colored ball in the bottom right represents the worst state of any of the node's child objects.

Primary Node Status

Nodes With Child Status

pastedImage_6.pngpastedImage_0.png

It would be fair to say that this is neither obvious, nor intuitive, so in this release, we've sought to radically improve how Node status is calculated and represented within the Orion Platform.

Node Thresholds

The first thing people usually notice after adding a few nodes to the Orion Platform, is that node thresholds for things like CPU & Memory utilization appear to have no effect on the overall status of the node, and they'd be right. Those thresholds can be used to define your alerts, but node status itself has historically only represented the reachability of the node. That, unfortunately, complicates troubleshooting by obfuscating legitimate issues as well as adds unnecessary confusion. For example, in the image below, I'm often asked why the state of the node is “green” when the CPU Load and Memory utilization are obviously critical? A very fair and legitimate question.

With the release of Orion Platform 2019.2 comes the introduction of Enhanced Node Status. With this new Enhanced Node Status, thresholds defined either globally or on an individual node itself can now impact the overall status of the node. For example, if the memory utilization on a node is at 99% and your “Critical” threshold for that node is “Greater than 90%,” the node status will now reflect the appropriate “Critical” status. This should allow you to spot issues quickly without having to hunt for them in mouse hovers or drilling into Node Details views.

CPU Load

Memory Utilization

pastedImage_4.pngpastedImage_5.png

Response Time

Packet Loss

pastedImage_7.pngpastedImage_8.png

Sustained Thresholds

Borrowing heavily from Server & Application Monitor, Orion Platform 2019.2 now includes support for sustained node threshold conditions. Being notified of every little thing that goes bump in the night can desensitize you to your alerts, potentially causing you to miss important service impacting events. For alerts to be valuable, they should be actionable. For example, just because a CPU spikes to 100% for a single poll probably doesn't mean you need to jump out of bed in the middle of the night and VPN into the office to fix something. After all, it's not that unusual for a CPU to spike temporarily, or latency to vary from time to time over a transatlantic site-to-site VPN tunnel. 

What you probably want to be notified of instead is if that CPU utilization remains higher than 80% for more than five consecutive polls, or if the latency across that site-to-site VPN tunnel remains greater than 300ms for 8 out of 10 polls. Those are likely more indicative of a legitimate issue occurring in the environment that requires some form of intervention to correct.

pastedImage_3.png

Sustained Thresholds can be applied to any node's existing CPU Load, Memory Usage, Response Time, or Percent Packet Loss thresholds. You can also mix and match “single poll,” “X consecutive polls,” and “X out of Y polls” between warning and critical thresholds for the same metric for even greater flexibility. Sustained Thresholds can even be used in combination with Dynamic Baselines to eliminate nuisance alerts and further reduce alert fatigue, allowing you to focus only on those alerts which truly matter.

Null Thresholds

A point of contention for some users has been the requirement that all Node thresholds must contain some value. Those could be nodes that you still want to monitor, report, and trend upon those performance metrics but not necessarily be alerted on, such as staging environment, machines running in a lab, decommissioned servers, etc.

Historically, there has been no way to say, “I don't care about thresholds on this node”' or “I don't care about this particular metric.” At best, you could set the warning and critical thresholds as high as possible in the hopes of getting close to eliminating alerts for metrics on those nodes you don't necessarily care about. Alternatively, some customers update and maintain their alert definitions to exclude metrics on those nodes they don't want to be alerted on. A fairly messy, but effective, solution—but also one that is no longer necessary.

With the introduction of Enhanced Status in Orion Platform 2019.2, any Node threshold can now be disabled simply by editing the node and unchecking the box next to the warning or critical thresholds of the metric you're not interested in. Don't want a node to ever go into a “Critical” state as a result of high response time to keep the boss off your back, but still want to receive a warning when things are really bad? No worries, just disable the “Critical” threshold, leave the “Warning” threshold enabled and adjust the value to what constitutes “really bad” for your environment.

pastedImage_5.png

If so inclined, you can even disable these individual warning and critical thresholds globally from [Settings > All Settings > Orion Thresholds] for each individual node metric.

pastedImage_10.png

Child Objects

In this new world of Enhanced Status, no longer are there confusing multi-status icons, like “up-down” or “up warning.” Child objects can now influence the overall node status itself by rolling up status in a manner similar to Groups or how Server & Application Monitor rolls-up status of the individual component monitors that make up an Application. This provides a simple, consolidated status for the node and its related child entities. Those child objects can be things such as Interfaces, Hardware Health, and Applications monitored on the node, to name only a few.

Similar to Groups, we wanted to provide users with the ability to control how node status rollup was calculated on an individual, per-node basis for ultimate flexibility. When editing the properties of a single node or multiple nodes, you’ll now find a new option for “Status roll-up mode” where you can select from Best, Mixed, or Worst.

pastedImage_1.png

By altering how node status is calculated, you control how child objects influence the overall status of the node.

BestMixedWorst
pastedImage_6.pngpastedImage_7.pngpastedImage_8.png

Best status, as one might guess, always reflects the best status across all entities contributing to the calculation. Setting the Node to “Best” status is essentially the equivalent of how status was calculated in previous releases, sans the tiny child status indicator in the bottom right corner of the status icon.

Worst status, you guessed it, represents the status of the object in the worst state. This can be especially useful for servers, where application status may be the single most important thing to represent for that node For example, I'm monitoring my Domain Controller with Server & Application Monitor's new AppInsight for Active Directory. If Active Directory is “Critical,” then I want the node status for that Domain Controller to reflect a “Critical” state.

Mixed-status is essentially a blend of best and worst and is the default node status calculation. The following table provides several examples of how Mixed status is calculated.

Polled Status

Child 1 Status

Child 2 Status

Final Node Status

DOWNANYANYDOWN
UPUPUPUP
UP or WARNINGUPWARNINGWARNING
UP or WARNINGUPCRITICALCRITICAL
UP or WARNINGUPDOWNWARNING
UP or WARNINGUPUNREACHABLEWARNING
UPUPUNKNOWNUP
WARNINGUPUNKNOWNWARNING
UPUPSHUTDOWNUP
UP or WARNINGDOWNWARNINGWARNING
UP or WARNINGDOWNCRITICALCRITICAL
UP or WARNINGDOWNUNKNOWNWARNING
UP or WARNINGDOWNDOWNWARNING
UPUNKNOWNUNKNOWNUP
WARNINGUNKNOWNUNKNOWNWARNING
UNMANAGEDANYANYUNMANAGED
UNREACHABLEANYANYUNREACHABLE
EXTERNALANYANYGroup Status

In case you overlooked it in the table above, yes, External Nodes can now reflect an appropriate status based upon applications monitored on those nodes.

Child Object Contributors

Located under [Settings > All Settings > Node Child Status Participation] you will find you now have even more fine-grained, granular control of up to 27 individual child entity types that can contribute to the overall status of your nodes. Don't want Interfaces contributing to the status of your nodes? No problem! Simply click the slider to the “off” position and Interfaces will no longer influence your nodes status. It's just that easy.

pastedImage_1.png

Show me the Money!

You might be asking yourself, all these knobs, dials, and switches are great, but how exactly are these going to make my life better or simpler? A fair question, and one that no doubt has countless correct answers, but I'll try and point out a few of the most obvious examples.

Maps

One of the first places you're likely to notice Enhanced Status is in Orion Maps. The example below shows the exact same environment. The first image shows what this environment looked like in the previous release using Classic Status. Notice the absence of any obvious visual cues denoting issues in the environment. The next image to the right is of the very same environment taken at the exact same time as the image on the left. The only notable difference is that this image was taken from a system running Orion Platform 2019.2 with Enhance Node Status.

In both examples, there are the exact same issues going on in the environment, but these issues were obfuscated in previous releases. This made the troubleshooting process less intuitive and unnecessarily time-consuming. With Enhance Status, it's now abundantly clear where the issues lie. And with the topology and relationship information from Orion Maps, it's now easier to assess the potential impact those issues are having on the rest of the environment.

Classic Status

Enhanced Status
pastedImage_9.pngpastedImage_8.png

Groups

Groups in the Orion Platform are incredibly powerful, but historically in order for them to accurately reflect an appropriate status or calculate availability accurately, you were required to add all relevant objects to that group. This means you not only needed to add the nodes that make up the group, but also all child objects associated with those nodes, such as interfaces, applications, etc.

Even in the smallest of environments, this was an otherwise impossible feat to manage manually. Given the nature of all the various entity types that could be associated with those nodes, even Dynamic Groups were of little assistance in this regard. Enhanced Status not only radically simplifies group management, but it also empowers users to more easily utilize Dynamic Groups to make group management a completely hands-off experience.

The following demonstrates how Enhanced Node Status simplifies overall Group Management in the Orion Platform, reducing the total number of objects you need to manage inside those groups. The screenshot on the left shows a total of eight nodes using Enhanced Status in a group, causing the group to reflect a Critical status. The image to the right shows all the objects that are required to reflect the same status using Classic Status. As you can see, you would need to not only add the same 8 nodes but also their 43 associated child objects for a total of 51 objects in the group. Yikes!

Enhanced Status (8 Objects)

Classic Status (51 Objects)

pastedImage_0.pngpastedImage_0.png

By comparison, the following demonstrates what that group would look like with just the eight nodes alone included in the group using both Classic Status and Enhanced Status. Using Classic status, the group reflects a status of “Up,” denoting no issues at all in the group. With Enhanced Status, it's abundantly clear that there are in fact issues, which nodes have issues, and their respective severity. This aids in significantly reducing time to resolution and aids in root cause analysis.

Enhanced Status

Classic Status
pastedImage_2.pngpastedImage_1.png

Alerts

Possibly the greatest benefit of Enhanced Status is that far fewer alert definitions are required to be notified of the exact same events. Because node thresholds and child objects now influence the status of the node, you no longer need alert definitions for individual node metrics like “Response Time,” or related child entities like “Interfaces.” In fact, of the alert definitions included out-of-the-box with the Orion Platform, Enhanced Status eliminates the need for at least five, taking you from seven down to a scant two. That's a 71% reduction in the number of alert definitions that need to be managed and maintained.

Out-of-the-box Alerts Using Classic Status - x7

pastedImage_2.png

Out-of-the-box Alerts Using Enhanced Status - x2

pastedImage_3.png

Alert Macros

I'm sure at this point many of you are probably shouting at your screen, "But wait! Don't I still need all those alert definitions if I want to know why the node is in whatever given state that it's in when the alert is sent? I mean, getting an alert notification telling me the node is “Critical” is cool and all, but I sorta need to know why."

We would be totally remiss if in improving Node status we didn't also improve the level of detail we included in alerts for nodes. With the introduction of Enhanced Status comes two new alert macros that can be used in your alert actions, such as email notifications, which lists all items contributing to the status of that node. Those two alert macros are listed below.

The first is intended to be used with simple text-only notification mechanisms, such as SMS, Syslog, or SNMP Traps. The second macro outputs in HTML format with hyperlinks to each child objects respective details page. This macro is ideally suited for email or any other alerting mechanism that can properly interpret HTML.

  • ${N=SwisEntity;M=NodeStatusRootCause}
  • ${N=SwisEntity;M=NodeStatusRootCauseWithLinks}

pastedImage_5.png

The resulting output of the macro provided in the notification includes all relevant information pertaining to the node. This includes any node thresholds that have been crossed as well as a list of all child objects in a degraded state associated with the node, which is all consolidated down into a simple, easily digestible, alert notification that pinpoints exactly where to begin troubleshooting.

pastedImage_10.png

Enabling Enhanced Status

If you're installing any Orion product module for the first time that is running Orion Platform 2019.2 or later, Enhanced Status is already enabled for you by default. No additional steps are required. If you're upgrading from a previous release, however, you will need to enable Enhanced Status manually to appreciate the benefits it provides.

Because status is the primary trigger condition for alerts, we did not want customers who are upgrading to be surprisingly inundated with alert storms because of how they had configured the trigger conditions of their alert definitions. We decided instead to let customers decide for themselves when/if to switch over to Enhanced Status.

The good news is that this is just a simple radio button located under [Settings > All Settings > Polling Settings]

pastedImage_4.png

Conversely, if you decided to rebuild your Orion server and have a preference for “Classic” status, you can use this same setting to disable “Enhanced” Status mode on new Orion installations and revert back to “Classic” status.

Cautionary Advice

If you plan to enable “Enhanced” status in an existing environment after upgrading to Orion Platform 2019.2 or later, it’s recommended that you disable alert actions in the Alert Manager before doing so. This should allow you to identify alerts with trigger conditions in their alert definition that may need tweaking without inadvertently causing a flood of alert notifications or other alert actions from firing. Your coworkers will thank you later.

pastedImage_11.png

Feedback

Enhanced status represents a fairly significant, but vitally important, change for the Orion Platform. We sincerely hope you enjoy the additional level of customization and reduced management overhead it provides. As with any new feature, we'd love to get your feedback on these improvements. Will you be switching to Enhanced Status with your next upgrade? If not, why? Be sure to let us know in the comments below!

51 Comments

This looks like lots of fun!

MVP
MVP

Excellent feature, enabled it on our server this week and took about a day fine tuning it.

It would be good to have the Group Root Cause as a popup when hovering over a group with issues to save clicking in to the group.

Level 13

A very welcomed addition!

I think over the last 18 months to 2 years, Solarwinds has made more great steps in amazing directions, then they did in the previous 5 years before that. Please keep this coming

MVP
MVP

This is just unbelievably awesome, thanks aLTeReGo

In fact, believe it of not, I already had most the above implemented by means of reporting, scripts, workaround, leveraging SAM components, etc... I am super please to see this coming. Thanks guys and looking forward for our upgrade

With Gratitude,

Alex

MVP
MVP

3aLTeReGo​, below are few more ideas that I currently have implemented in a custom way, but it would be great to have them OOB at some point:

(1)

Node/Interface Down Status to have X polls

- We have a way to define number of polls object needs to be down before firing alert. Very similar to what I see here with CPU and Memory. If this can be extended to Status itself - would be awesome. It is quite common to not want alerts to be sent unless object was down for a certain number of polls. Configuring delay in the alert is NOT an option - it needs to be managed on the node level to allow a single alert to handle all different cases

(2)

Volume threshold must include % and MB check (both!)

- Right now volume thresholds overrides are only available for %. However, what I found that by defining BOTH % and MB it is possible to have a single alert that can handle ALL disks at once - big a small. Thresholds can be set to 95% AND 10GB. Small disks will usually have MB breached pretty soon at half capacity, but % will petty much determine correct time to trigger alert. Contrary, for the big disks it works other way round - disk with xxxTB may have % value breached, but actuall space is still more than sufficient. Therefore - both values must be breach to fire alert AND both needs to be able to be overridden.

(3)

Interface Error and Discards

- Please split overrides and global thresholds here. Right now it is only possible to set 2 values (Critical and Warning) for both of them, not individually. However, Discards and Errors is not the same things and as such must be treated separately. Besides, often Transmit and Receive are not treated the same either, so, further split is needed. I propose the following global thresholds and overrides:

  • Rx Errors Warning
  • Rx Errors Critical
  • Tx Errors Warning
  • Tx Errors Critical
  • Rx Discards Warning
  • Rx Discards Critical
  • Tx Discards Warning
  • Tx Discards Critical

Yes, it is 8! different values, globally AND with overrides for each interface, not 2 as we have now... With further ability to un-tick individually (as discussed in blog as a new feature) AND ability to set over how many polls as well (same, new highly welcomed approach as per blog above)

(4)

Unknown Status for Objects

- This needs to be further improved. A while ago I have developed a set of reports, which I have improved over time (can share latest version on request) to identify polling issues. We still heavily rely on those, because so often object will not give any indication whatsoever that SolarWinds is not able to poll some aspects of it and will remain green, giving us false-negative. Reports are working great and we even trigger alerts for such issues, but, it would be great to see "Unknown" status improved, so that we don't need that additional reporting. Typical example is SNMP node, where community string has been changed - PING still works, node is green, but all stats are no longer collected

...

I may have some more...

Once again - thank you everyone at SolarWinds Engineering Team bringing this up. It is MASSIVE improvement of existing workflow

With Gratitude,

Alex Soul

MVP
MVP

aLTeReGo​ ... last question

Why having UNKNOWN children has no impact on overall (final) status?

Polled Status

Child 1 Status

Child 2 Status

Final Node Status

UPUPUNKNOWNUP
UPUNKNOWNUNKNOWNUP

Also, back to my point (4) previously - I believe there is needs to have a visual for polling issues on parent level itself. Currently "UNKNOWN" is not even listed under Polled Status column. Maybe a new status name is needed for such cases, something like "POLLING ISSUES", or "PARTIAL", or "UNKNOWN".. or whatever you deem appropriate to indicate such things. I call them "GREMLINS" ... they are sort of healthy ish, but actually not so

Product Manager
Product Manager

alexslv  wrote:

aLTeReGo  ... last question

Why having UNKNOWN children has no impact on overall (final) status?

Polled Status

Child 1 Status

Child 2 Status

Final Node Status

UPUPUNKNOWNUP
UPUNKNOWNUNKNOWNUP

'Unknown' itself is the absence of status. It's not necessarily indicative of a problem with the monitored node, but rather may be an issue with the configuration of your monitoring solution. In other words, Orion doesn't know what the status is, and therefore we can't assume its status is bad. For example, this can occur when credentials used for monitoring have expired. It's unlikely anyone would appreciate being woken up at 3am by an alert telling them there is a serious issue with a mission-critical system, only to find there was nothing wrong with that system at.

If as the monitoring engineering your concerned about entities going into an 'unknown' state, then you can certainly configure alerts to notify you as such. However, those alerts which go to those people on the ground responsible for the performance and availability of those systems often times only wish to be alerted when there is a legitimate confirmed issue they have the ability to take action on. For those people, alerting on 'unknown' status is often times just noise.

alexslv  wrote:

Also, back to my point (4) previously - I believe there is needs to have a visual for polling issues on parent level itself. Currently "UNKNOWN" is not even listed under Polled Status column. Maybe a new status name is needed for such cases, something like "POLLING ISSUES", or "PARTIAL", or "UNKNOWN".. or whatever you deem appropriate to indicate such things. I call them "GREMLINS" ... they are sort of healthy ish, but actually not so

Polled status can never be 'Unknown'. It's derived from ICMP state that node status in previous releases was based upon or an override which excludes status, such as 'unmanaged' or 'external'. So there is no scenario under which the node's polled status could ever be 'Unknown. That is why it is not included in the table.

Product Manager
Product Manager

alexslv  wrote:

3aLTeReGo , below are few more ideas that I currently have implemented in a custom way, but it would be great to have them OOB at some point:

(1)

Node/Interface Down Status to have X polls

- We have a way to define number of polls object needs to be down before firing alert. Very similar to what I see here with CPU and Memory. If this can be extended to Status itself - would be awesome. It is quite common to not want alerts to be sent unless object was down for a certain number of polls. Configuring delay in the alert is NOT an option - it needs to be managed on the node level to allow a single alert to handle all different cases

This is already possible today under [Settings > All Settings > Polling Settings > under the 'Calculations & Thresholds' section > Node Warning Level] (minimum is 10 seconds)

pastedImage_4.png

alexslv  wrote:

(2)

Volume threshold must include % and MB check (both!)

- Right now volume thresholds overrides are only available for %. However, what I found that by defining BOTH % and MB it is possible to have a single alert that can handle ALL disks at once - big a small. Thresholds can be set to 95% AND 10GB. Small disks will usually have MB breached pretty soon at half capacity, but % will petty much determine correct time to trigger alert. Contrary, for the big disks it works other way round - disk with xxxTB may have % value breached, but actuall space is still more than sufficient. Therefore - both values must be breach to fire alert AND both needs to be able to be overridden.

Improving volume status is something we're currently working on, so this feedback is quite timely.

alexslv  wrote:

(3)

Interface Error and Discards

- Please split overrides and global thresholds here. Right now it is only possible to set 2 values (Critical and Warning) for both of them, not individually. However, Discards and Errors is not the same things and as such must be treated separately. Besides, often Transmit and Receive are not treated the same either, so, further split is needed. I propose the following global thresholds and overrides:

  • Rx Errors Warning
  • Rx Errors Critical
  • Tx Errors Warning
  • Tx Errors Critical
  • Rx Discards Warning
  • Rx Discards Critical
  • Tx Discards Warning
  • Tx Discards Critical

Yes, it is 8! different values, globally AND with overrides for each interface, not 2 as we have now... With further ability to un-tick individually (as discussed in blog as a new feature) AND ability to set over how many polls as well (same, new highly welcomed approach as per blog above)

Interfaces are not a function of the Orion Platform, but I agree that their status and thresholds should operate in a similar fashion to nodes and other entities in Orion. I will be sure to let jason.carrier​, our new Product Manager for NPM, aware of your request.

alexslv  wrote:

(4)

Unknown Status for Objects

- This needs to be further improved. A while ago I have developed a Uncovering Polling Problems and Issues in Your Environment , which I have improved over time (can share latest version on request) to identify polling issues. We still heavily rely on those, because so often object will not give any indication whatsoever that SolarWinds is not able to poll some aspects of it and will remain green, giving us false-negative. Reports are working great and we even trigger alerts for such issues, but, it would be great to see "Unknown" status improved, so that we don't need that additional reporting. Typical example is SNMP node, where community string has been changed - PING still works, node is green, but all stats are no longer collected

This is a great set of reports you've created and I appreciate you sharing them with the community. I agree that we need to improve visibility into these types of issues throughout the product, not just for nodes and volumes in the Orion Platform.

MVP
MVP

I would argue that is the Node isn't properly collecting statistics (ie Up but just by way of ICMP) then the responsible engineer isn't going to get woken up at all even when there is an issue as the monitoring solution won't know about performance issues. Rather get woken up to tell me it's a community string change than a call from my boss saying critical X system is down and why didn't we know about it!

But - I am really excited about the changes overall as Alex mentioned some of these have been a long time coming, and as with most things I am sure they will get tuned over time based on community / customer feedback

Level 13

It's been a long standing issue that its hard to tell when a device has actually stopped polling through SNMP or WMI but is responding to ping. It takes a keen eye to spot the unknowns or lack of fresh data when the main node status is Green, especially when you have a day job to be doing at the same time.


I use hardware unknown Alerts and reports to help me find them but its still an issue like you pointed out dgsmith80



MVP
MVP

ye, I had a feeling it is not going to be easy. I agree with all comments above, thank you aLTeReGo

However, Polling Issues (such as inability to collect CPU/Memory) whilst having node as Up (PING working properly) is a real problem here, which leads to false-negatives (for example CPU runs hot, but we don't event have this information in SolarWinds). I disagree that flagging it with engineering is perceived to be noise. Quite opposite - we flag it with them to tell them that your system is not being polled correctly and should anything happen - we may not be able to alert on it. One of the biggest question we discuss on a weekly bases - "What has gone wrong with infrastructure that lead to problems and downtime that SolarWinds did not notify us about in advance?" (part of continuous improvement program)

Therefore, seems like we will still need to rely on that additional reporting to highlight such objects.

Interestingly - this set of report is usually my #1 thing I present to customers. They are amazed to find out how many Gremlins they didn't even think they had. So, my point here - there needs to be a way to identify such issues, whether through status or somehow differently, and ideally it should be available out of the box as a visual indication. Reports are doing good job... but, having hundreds of them available OOB means that very likely many customers won't even know they exist or perhaps won't appreciate the importance of them if they will manage to find those.

I will update my previous thread with latest reports - hope those will assist you

...

Here is another feedback related to current status indication, in this case for volumes. At a glance - all seems good here... green and clean ...

pastedImage_0.png

...however, it is very deceptive. Notice, those are all C:/ drives on the same box... clearly an issue... albeit, no visual indication. By hovering over it is clear that some of them are down.

pastedImage_1.png

I suggest to make default icon for the disk to be the icon of the actual status for this disk, not the icon for disk type (same as done for interfaces resource for example)

MVP
MVP

You are superstar Thanks a million!

On the first point - this need to be available on the object level (node, interface, etc) as an override, in addition to global as you have mentioned above

Product Manager
Product Manager

All very fair points alexslv​. To your example above, an additional feature added in this release would likely help with this situation, as well as improve the maintainability of the Orion system. Automatically removing volumes (or interfaces) which have been in an 'unknown' state for longer than a user definable time period.

pastedImage_0.png

MVP
MVP

pastedImage_0.png

Level 9

I really like the overall status change, I must say when I first saw it I got the typical "Wow theres a lot of red" but it has actually helped me right from the off cut through a lot of the crap and get right to the source so thats really helpful, also along with setting those sustained polling thresholds as well is really really helpful.

This is MASSIVE. I'll miss my favorite "Up Down" status, but I guess we can live without it.

Product Manager
Product Manager

Speaking to item 3 - I agree! Incorporating status and threshold improvements to interface metrics is something I'm hoping to address in the near future. Important to also note it's not something we're currently working on.

Level 9

You can change it back by going to polling settings > Node Status = Classic. Unfortunately we had to do this as we discovered some inconsistencies which we are investigating with the statuses displayed.

Level 9

we had the same problem...so we changed ours back and i just have had time to go back to it...what may i ask were you seeing...our in node/interfaces were showing up as red (down/critical), but i could not figure out why...this would happen after an outage but all devices would be back up and would have been up for some time so the memory,cpu,etc...would be back up also and yet the interfaces/nodes would still show up as red...i was so confused so i just changed it back until i had more time to look into it...any help would be great

Product Manager
Product Manager

liammiller  wrote:

You can change it back by going to polling settings > Node Status = Classic. Unfortunately we had to do this as we discovered some inconsistencies which we are investigating with the statuses displayed.

Can you describe those inconsistencies? Did you by chance open a support case?

Product Manager
Product Manager

cathsheh1  wrote:

we had the same problem...so we changed ours back and i just have had time to go back to it...what may i ask were you seeing...our in node/interfaces were showing up as red (down/critical), but i could not figure out why...this would happen after an outage but all devices would be back up and would have been up for some time so the memory,cpu,etc...would be back up also and yet the interfaces/nodes would still show up as red...i was so confused so i just changed it back until i had more time to look into it...any help would be great

Were the interfaces also showing as 'down' on the interface detail page?

Level 12

I really like this direction. However, I had to change back to classic for a couple of reasons. Firstly, I've got to review all my alerts that are alerting when Status != Up. I read the warning on the potential for a ticket storm and did it anyway...my bad. Secondly, I have about 1500 switches on the WorldMap that all went from Green to Yellow because a couple of interfaces/ports are down...and will be until someone plugs something into them though, no one ever should. In fact I have an alert if these ports become "Up". I need to slow down and reread this entire blog.

Level 9

I think this has something to do with the parent/child dependencies...as this happened twice..both times the parent (coresw) went down sending all child nodes/interfaces into either down or unresponsive...then when the parent came back up the child nodes/interfaces stayed down (according to solarwinds but not physically down)...i first thought that this was happening because of the cpu/memory was just not coming back to thresholds levels yet and i needed to wait it out...however one sunday after maintenance i wait for several hours and i got a message from a co-worker asking why all the nodes/interfaces were still showing down when in fact they were not...I tried to adjust the polling on the thresholds but that did not help either..so i just went back to classic for now to either see if others had the issue or until i had more time to look into it

Crazy complex in description, but MUCH appreciated in use!

MVP
MVP

One word of advice for people activating this might be to activate the feature, but slowly add contributors as you go rather than turning them all on at the same time.

Start with something simple like CPU or Memory, making sure you disable any individual alerts at the same time. Then work on other contributors that you are currently alerting for individually such as HWH or Application Status etc.

Big- bang is never ideal unless you're starting a new environment.

Level 8

Excellent feature has came in will try to explore more on this 

Level 12

What are these "contributors"?

MVP
MVP

Hey James, I'm referring to the section above titled:  CHILD OBJECT CONTRIBUTORS

To generalise it is how SolarWinds determines which objects are factored in when the status calculation is made:

pastedImage_1.png

Components of the node that can "contribute" to it's overall [enhanced] status. For example, an application, an interface, or a voluem

Level 12

Thank you. I should have read a little further. One of my issues is that I have a WorldMap populated with about 1500 switches where there will always be a couple of the interfaces down. When I first turned it on, my whole map when Yellow. Can I define contributors for a node or or is it global?

MVP
MVP

At the moment it's Global, but I hope they make it "Per Node" You could just not include Interfaces in your status contributors.

Level 20

This is awesome stuff aLTeReGo​ finally granular node status!!!

Level 11

Are the AppInsight templates the only SAM items that will count as a child item today or will any/all SAM templates/components that are monitored be picked up as a child object?

Product Manager
Product Manager

nglynn  wrote:

Are the AppInsight templates the only SAM items that will count as a child item today or will any/all SAM templates/components that are monitored be picked up as a child object?

Any application can influence the status of a node. The screenshot I posted above is only a partial listing.

Level 12

This is pretty awesome, I started using the new map application. I still need to figure out what's the best way to present the data.

Any more hits would be greatly appreciated!

Level 12

How do we set a 'default' number of 'consecutive polls' for new feature 'sustained thresholds'?  I do not see a entry under 'Orion General Thresholds', only when I go into node properties.

My use case is to specify a default and build an alert to alarm if CPU is over threshold for specified number of 'consecutive polls'.  This allows overrides of CPU usage (presently) and hopefully 'consecutive polls' with this new feature.

Product Manager
Product Manager

monitoringlife  wrote:

How do we set a 'default' number of 'consecutive polls' for new feature 'sustained thresholds'?  I do not see a entry under 'Orion General Thresholds', only when I go into node properties.

My use case is to specify a default and build an alert to alarm if CPU is over threshold for specified number of 'consecutive polls'.  This allows overrides of CPU usage (presently) and hopefully 'consecutive polls' with this new feature.

There is currently no mechanism that allows for defining a 'default' sustained condition, though that is an excellent suggestion.

Level 12

aLTeReGo

Where is the best place to send the feedback/post to get this added?

Product Manager
Product Manager

The Network Performance Monitor Feature Requests​ forum would be the best possible place.

Level 12

Created

Level 9

In meantime you can set sustained thresholds for multiple nodes in bulk. Just select more nodes in Node Management page, click Edit Properties and override global settings for thresholds.

MVP
MVP

voted UP!. I believe this feature must be included together with this new enhancement. Global Default thresholds must be in-line with what overrides provide

Just LOADS O' STUFF!  Good stuff, that is.

I find myself constantly asking to want more options to customize the hover details. I know I've spoken with meech​ and jreves​ about it at times for NTA, but this is another example. I wonder if it's waiting on the UI redesign.

Product Manager
Product Manager

designerfx  wrote:

I find myself constantly asking to want more options to customize the hover details. I know I've spoken with meech  and jreves  about it at times for NTA, but this is another example. I wonder if it's waiting on the UI redesign.

What specifically would you like to see added to the hovers?

I wanted many things:

  1. ability to customize the hover in many ways, some of which are covered in concept with - and I know progress has been made on that effort under the internal FR UIF-4748
    1. remove/select custom properties to show up on the hover. IE: I have one custom property I've made which I want to see on every node, and a bunch I don't. Alphabetical is not the way we should do that. It would be an easy custom property creation checkbox of "display on object hover" to have it be applicable to various types of orion objects from IPAM subnets to Nodes to Interfaces, et al.
    2. select which modules information you want on hover based on view/module view/page. Maybe I want NTA top talkers on my nodes view from the perspective of the node I hover on (I do!). Maybe I want NTA top applications for the node. Maybe I want UDT VLAN's applicable for the node, not just the interface. Maybe it's IPAM subnets, etc? This would tie back to enabling me to fine tune user profiles to let people see specifically what they want to see.
  2. more context specific information based on modules installed. If it's NTA I want to see the top 5 talkers for the node I hover on, assuming no critical statuses such as cpu mem etc.This gets to:
    1. Ability to start removing stuff from the hover we don't need outright. Example: Do I want a metric when it's at 100%? Not for packetloss. Do I want it at anything less? Yes, naturally. This would be nice to specify by something, either a custom property or machine type or anything else we can start getting really granular about.
    2. I want to customize the order here. Node name at top, IP, remove machine type. see paragraph above and an example:pastedImage_17.png

You might want to open this up to another roundtable if it's of interest, I would think you'll get a *lot* of feedback on this.It ties back towards what alexslv​ is asking for as well. I would appreciate volume status with that hover criteria, for example.

MVP
MVP

It would be so awesome if we could finally be able to change the Node status visually in a map, based on BGP neighbor status.

Something we have been waiting for patiently, for 12 years.

Product Manager
Product Manager

Have you considered creating a feature request for this?

Network Performance Monitor Feature Requests

Product Manager
Product Manager
MVP
MVP

Somehow I think that adding one more FR for this will not help