6 Replies Latest reply on Oct 4, 2010 3:59 PM by jspanitz

    Node Status Problem

    jspanitz

      We had a bunch of nodes using some old community strings, so we went in and changed the strings and restarted the snmp services on the nodes.  We thought we would have heard Orion chime in and tell us the nodes were down but alas, nothing.   So we opened up the Orion console expecting to see the nodes in some sort of error state, but again, nothing.  We then selected a node and looked at the stats.  Interesting!

      CPU Load stopped collecting data when the change was made - as expected.  Network Interfaces were in an unknown state - as expected. 

      But Availability is showing 100%, Node Status = One or more interfaces are in an Unknown state.  Running Poll Now, a Rediscover and then another Poll Now and now change in status.

      Can someone explain to us why it is behaving this way?  This is very bad and we have some serious concerns about Orion's ability to report the truth.  Help Please!

        • Re: Node Status Problem
          kb1lxm

          Jspanitz,

          If you take one of the nodes and select "List Resources" what happens?  Do you get an error saying the node is down or not responding to SNMP?

          You did go to "Edit Settings" and change the community string on these nodes right?  Orion has no clue if you change the community string on the server.  You have to tell it the new community string to use.

            • Re: Node Status Problem
              jspanitz

              Selecting List Resources on the node results in "SERVERXYZ is currently down, unreachable, or SNMPCOMMUNITYXYZ is not a valid SNMP community string.".  Which is EXACTLY what we expect.  We just expect it to happen as soon as Orion finds the node in this state.

              We did not go into Edit Settings because we deliberately wanted to see what would happen in Orion when it could no longer see the node via SNMP.  We understand it has no way of knowing the string changed, but if Orion is unable to pull data with the string it has, it should flag that and spit out a warning.  So yes, when we went through the final steps of moving the nodes to the new strings, we did go into Edit Settings and update the node.  Orion then picks up right were it left off as if nothing happened.

              This is a a huge issue, in our opinion, as a server could be unreachable and Orion will never tell you that it can't be reached.

                • Re: Node Status Problem
                  kb1lxm

                  Well I set up a report that I get every morning that shows me which volumes are "unknown".  This usually helps me find the following problems:

                  1.) Servers where the SNMP service has failed/hung
                  2.) Servers where the SNMP settings are misconfigured
                  3.) Misconfigured firewall rules
                  4.) Servers that are hung or locked up
                  5.) Occasionally if there is a lot of network traffic some responses to SNMP Queries get dropped

                  You could also set it up as an alert as well.  Also you could configure the SNMP Agent to send out Authentication Traps and set  up alerts around those.  Basically if something queries the SNMP Agent with an invalid community string or from the wrong host, it sends out a trap.

                  I understand this is a huge issue, but there are a variety of reasons that an SNMP query could fail.  It can't say for sure why a host is no longer responding and rightfully goes into an unknown state.  Heck if you don't bother to setup an alert for a node going down it won't send an email.

                    • Re: Node Status Problem
                      jspanitz

                      We will probably set up an alert, as we need to resolve an unmonitored node asap.  Auth trapping can get messy.

                      I appreciate all your help and insight.  It looks like others on the forum are having similar "unknown' node issues.  Hopefully someone from SW will chime in on this.  We'd like to know that they are aware of the issue, they are working on fixing the issue and it will be resolved soon.  Or they can tell us we're nuts and it's all in our heads.  At least we'll know where we stand.

                        • Re: Node Status Problem
                          njoylif

                          the node is reporting up because (I expect) the pings are still responding; thus it is up.  It just can't get stats via SNMP until you update Orion with that.

                            • Re: Node Status Problem
                              jspanitz

                              I guess I didn't clearly state that we acknowledge the fact that Orion can't collect stats until the string is updated.  I think the piece we feel is missing is that Orion doesn't flag this as a problem.  In our opinion, this should be some type of failure event.  What if the SNMP strings were accidentally changed / deleted / corrupted (or deliberately for malicious purposes).

                              Feature / Bug Fix request!