10 Replies Latest reply on Oct 31, 2012 12:22 PM by Leon Adato

    How to get hardware failure alerts

    krfitzgerald

      I'm running NPM and want to setup alerts for hardware failures. I was thinking of setting up the solarwinds syslog server and hoping I could find event logs about hardware failures and then hopefully have the syslog send me email alerts. I'm running other things like UCS and a few different kindes of SANs, so I'm also hoping I'll be able to forward logs from those systems also. does anyone have a good solution for setting up hardware failures in NPM?

          • Re: How to get hardware failure alerts
            Zak Kahl

            If they devices are capable of SNMP, then you could send traps and use the Trap Server software to send alerts when you trigger the type of trap you are looking for.  You could see if there are any OID's you could use for NPM to poll the device.  Then have NPM alert the custom poller value.

             

            Your syslog solution could also work if the devices send the syslog msgs you are looking for.

             

            Zak Kahl

            http://www.loop1systems.com

              • Re: How to get hardware failure alerts
                krfitzgerald

                How do I setup alerts for the custom pollers? I've setup several custom pollers but I don't see them in the drop downs when I'm in the advanced alert manager?

                  • Re: How to get hardware failure alerts
                    Leon Adato

                    from the Alert Editor, Trigger Condition tab, "Type of Property to Monitor" drop-down:

                    For alerts on UnDB (custom oil pollers) you would  use either a "Custom Node Poller" (for UnDP that collects a "get" or "get next" value) or Custom Node Table Poller (for UnDP that does a "get table" operation). If your UnDP is interface-related, then you would use "Custom Interface Poller".

                     

                    On the other hand, if you are still talking about a trap or syslog, you would have to set up the alert in that utility (Trap or Syslog).

                     

                    Hope that helps.

                    - Leon

                      • Re: How to get hardware failure alerts
                        krfitzgerald

                        This sounds like what I'm looking for. I downloaded the netapp undp from thwack and I can see the disk drive status in my orion webpage. I tried all the drop downs like you said for the custom node poller and table poller but I do not see an option for a hard drive status set to failed. Can someone post a screenshot perhaps of a hardware failure alert from a undp? thank you for your help.

                • Re: How to get hardware failure alerts
                  Leon Adato

                  Ah... I see where you are confused. You won't see something that says "Hard drive" (or whatever). The alert setup will look something like:

                  • "Poller Name" is equal to <your poller name>
                  • Value/Rate/Total is greater than (or equal to) <your threshold>

                   

                  Depending on the specifics of the display, you may also need to add lines for Column number, Row ID, etc.

                   

                  You can get specifics (ie: find out if you need Value, Rate etc) by creating an email to yourself and then putting ALL the variables in the email

                  (ie: "Value is ${Value}"

                   

                  and see which one(s) have the actual information you want.

                    • Re: How to get hardware failure alerts
                      krfitzgerald

                      OK, I'm getting closer, thanks for your help. I'm using a custom poller named diskfailed message with OID = 1.3.6.1.4.1.789.1.6.4.10

                      I set my alert to the drop down custom node poller because it is a get next type.

                      I set the condition to

                      Poller Name is equal to diskfailedMessage

                      Status contains fail

                       

                      the alert sent me an email right away even though I don't have any failed hard drives right now. How do i find out what the condition needs to be for a failed hard drive?

                    • Re: How to get hardware failure alerts
                      Leon Adato

                      Make sure the poller name matches in the UnDP system and the alert (case sensitive, etc)

                       

                      Now in your alert email, add a whole ****-load of variables so you can see what is getting detected and returned:

                      Assignmentname is ${AssignmentName}

                      OID is ${CustomPollers.OID}

                      Uniquename is ${CustomPollers.UniqueName}

                      Rate is ${CustomPollerStatus.Rate}

                      Rawstatus is ${CustomPollerStatus.RawStatus}

                      status is ${CustomPollerStatus.Status}

                      total is ${CustomPollerStatus.Total}

                       

                      That way you can see EXACTLY what is being triggered, and then re-formulate your actual triggers based on what you are seeing.

                        • Re: How to get hardware failure alerts
                          krfitzgerald

                          I copied and pasted what you had in the last post and the email returned these results.

                           

                          OID is 1.3.6.1.4.1.789.1.6.4.10

                          Uniquename is diskFailedMessage

                          Rate is 

                          Rawstatus is 

                          status is There are no failed disks.

                          total is

                           

                          How do I know what the status will be when a disk fails? If I'm understanding this correctly the alert is triggering because the status contains fails even though the status is no failed disks?

                        • Re: How to get hardware failure alerts
                          Leon Adato

                          So your alert should be tweaked so it triggers when:

                          UniqueName = "diskFailedmessage"

                          (no threshold yet)

                           

                          Now the hard part. You need to make something actually break so you can get a positive confirmation on the error. You *might* be able to figure out the message from the MIB document, but in reality, until you see it live all bets are off.

                           

                          We have a policy at my company that no alert goes live (ie: cuts a ticket, etc) until we've seen at least one alert "in the wild" so we have that level of comfort that we know exactly what we'll be seeing.

                           

                          If you absolutely can't fabricate an alert, then your next best option is to set the second trigger as:

                          ${CustomPollerStatus.Status} IS NOT EQUAL TO "There are no failed disks."

                           

                          Then you won't get "I'm OK" messages, but you won't know what you WILL get until you start getting it. I'd leave those other variables in until you've completely worked it out.