5 Replies Latest reply on Oct 24, 2010 12:13 AM by netlogix

    Overview of RC Testing

    bwiechman

      Quick Rundown on new features and functionality testing/issues noted

      • Active Directory authentication integration with group support
        • Not tested
      • Dependencies 2.0 / Basic Root Cause Analysis
        • Basic groups/dependencies set up. Status of group updated as the underlying status changes.
          • Would it be possible to include the reason for the child node degraded state in the vent log. Currently it says something like: "Group X is in a warning state because Node Y is in a warning state". You have to drill into the node to dredge up the state. Would be much more useful if the event said something like "Group X is in a warning state because Node Y is in a warning state due to <Insert triggered event here>".
        • Child objets are marked down when a parent node or interface goes down. I have several parents defined as groups (typically wireless point to point links - one end goes down the link is down so it works). It would be service impacting to trigger a down condition of one node. In this case if one node in the parent group is marked down, will the children be marked unreachable or down? What would the expected behavior be?
      • VMware infrastructure monitoring
        • Basic functionality looks like it works. Some issues with object references. Sounds like the root cause has been identified. Waiting for fix.
        • Unable to fully test. vCenter server suffers from the above noted problem so datacneters/clusters not retrieved at this time. Will do additional testing once those issues are resolved.
      • Dynamic Service Level Groups
      • Ability to PDF Reports and Web Console Views via Report Schedule
        • Tested. No issues noted. 
      • Cisco UCS Support (Unified Computing System)
        • Not tested. No supported hardware
      • Meru Wireless Controller Support
        • Not tested. No supported hardware
      • Ability to create custom SQL advanced alerts
        • Not tested
      • Mobile Alert View
        • Not tested
      • Node and interface details on Summary Views
        • No issues noted
      • Enable, disable, and delete Advanced Alerts in the web console
        • No issues noted
      • Notes on alerts
        • Works. No issues noted. 
      • Acknowledge Alerts via link in email alerts
        • This works. Caveat noted: requires dns resolution or backend database workarounds if you are unable to resolve the hostname of your device using DNS. Would be more usable if there would be a global option that would allow the end user to configure the hostname/SSL/port status of the Orion box to be used in alerts to ensure that link will work properly.
      • Platform Support - SQL 2008 R2
        • Not tested
      • Stacked Charts
        • The yellow color can sometimes be very difficult to see
        • Nice option to have the same UnDP on every page. Would be nice to do this per hardware vendor or something, instead of a single global interface stacked interface,etc chart that shows up on EVERY node. i.e. This set of interfaces on all devices of type X, this set on all of type Y, etc.
          • In other words... works, but greater flexibility in generating and assigning stacked charts would be excellent.
        • When do we get to include different pollers in the same chart? Several UnDP pollers not from the same table for example.
          • Not every hardware vendor is all that intelligent about combining like information in the same table.
          • Sometimes related information for an interface for example is contained in different tables. For example three tables for total, registered, and offline modems, with rows for each interface in each table. I want to graph the three values in a single graph on the related interface.
          • ETC

       

      Basic issues noted:

      • Restarting Orion services causes erroneous node rebooted alerts to be triggered.
      • Hostname/SSL in URL included in alert acknowlegment emails may not be valid
      • Trap source OIDs appear to be translated/displayed in a different manner. This affect alerts configured to use the old source OID. In my case this was the majority of the alerts.
      • Node status pop-up displays 100% packet loss for nodes that had a small spike of packet loss. Often marked as warning for 24+ hours after packet loss incident. Also recorded as 100% packet loss in Top X Packet Loss listings such as those on the Network Summary pages. (194151)
      • Volume details not displayed correctly: volumes with 0 bytes consumed are not displayed properly (194187)
      • Incorrect chart selected when editing some node/interface charts for the first time
      • VMWare polling issues. 'Object reference not set to an instance of an object' (194104)
      • VMWare interface bandwidth consumption summary displays N/A instead of actual throughput.
      • Trap alerts sent with multiple addresses in CC field are not sent if a semicolon is used an the address separator. This was an issue prior to 10.1, but still exists in 10.1RC so I note it here (192818)
      • The Solarwinds Trap Service crashed twice since installing the 10.1 RC on my Orion system. I have not noted any service failures in the past.
      • When starting all services in the Service Manager, the process is halted when attempting to start the Job Scheduler service because of the dependency on the Job Engine Service. Re-order service start-up in the service manager to automatically start services in the proper order without hanging on end user input.
        • Re: Overview of RC Testing
          Tomas.Mlcoch

          Regarding to:

          • Trap alerts sent with multiple addresses in CC field are not sent if a  semicolon is used an the address separator. This was an issue prior to  10.1, but still exists in 10.1RC so I note it here (192818)

          this should be already fixed in 10.1 RC1. Is it still issue in your environment?

            • Re: Overview of RC Testing
              bwiechman


              Regarding to:

              • Trap alerts sent with multiple addresses in CC field are not sent if a semicolon is used an the address separator. This was an issue prior to 10.1, but still exists in 10.1RC so I note it here (192818)

              this should be already fixed in 10.1 RC1. Is it still issue in your environment?

               



              It appears I was incorrect. This is resolved in 10.1 RC1. Thanks for following up.

            • Re: Overview of RC Testing
              Karlo.Zatylny

              With regards to:

            • Child objets are marked down when a parent node or interface goes down. I have several parents defined as groups (typically wireless point to point links - one end goes down the link is down so it works). It would be service impacting to trigger a down condition of one node. In this case if one node in the parent group is marked down, will the children be marked unreachable or down? What would the expected behavior be?
            • First, when a parent is down, the children will be unreachable.  Please let us know if you are observing something different or unexpected.

              When a dependency parent is a group then we use the status based on the selection for Group Roll Up Status Mode.  When you are creating a group there are advanced options to set the way in which a group will calculate its status.  This is called "Status Rollup Mode" on the UI.  The options are Best, Mixed, and Worst.  Depending on how you want the dependency to behave will determine which method you choose.

              If you are creating a parent where only one of the parents in the group needs to be up, then select best or mixed.  The parent will show up or warning when one of the objects in the group is still not down and child objects will not go unreachable.  For example, if you have a node that is attached to two switches, then if one of them is up then your node is reachable.  If both are down, then your node is unreachable.  Select "best" or "mixed".

              If you are creating a parent where if one of the objects goes down, causes the children to be unreachable, then select "worst". 

              Please let me know if I can describe this better.

                • Re: Overview of RC Testing
                  bwiechman

                  Your explanation makes sense and is basically what I did. In cases where both ends of the link needed to be up for the group to be up I configured the roll up status to be worst. If one side is down they are both down, so essentially the two units function as a single entity anyways. Your explanation confirms what I hoped would happen in the case where a parent was a group.

                   

                  I am working to test this, but that is maintenance window work.

                   

                  One set of features I think would be nice:

                  - Custom Properties for Groups

                  - Allow at least two levels of sorting based on custom properties on Groups page.

                  Or (a little more funky) root level groups to be sorted by regex in the group name. (This may be a little less workable)

                     - for example allow me to provide a list of key words/phrases and categorize based on that.

                     - For example. Site A - Transport, Site B - Transport, etc would be placed under a dynamic group called transport

                     - Site A - Workstations, Site B - Workstations, etc.... does that make sense? Then by following a simple naming convention the end user can create a dynamic group display.

                • Re: Overview of RC Testing
                  netlogix


                  • Restarting Orion services causes erroneous node rebooted alerts to be triggered.
                  • Hostname/SSL in URL included in alert acknowlegment emails may not be valid
                  •  



                  I have both of these issues as well.  Let me know if you are given any fixes.

                   

                  VMWare polling issues. 'Object reference not set to an instance of an object' (194104)


                  I had a similar issue with the polling too, but I removed the 2nd NIC and remove/added node from orion and it fixed that (I also saw that someone else said something about assigning an IP to the NIC fixed them).

                   

                  The Solarwinds Trap Service crashed twice since installing the 10.1 RC on my Orion system. I have not noted any service failures in the past.


                  My syslog service kept crashing on me, but I just rebuilt the OS and re-installed Orion and it has been fine for me now (so easy! everything is held in the DB, so I didn't loose any Orion configs except some weirdness with NCM integration) (it also gave me an excuse to upgrade to 2008 R2)

                  I have another issue that all my volumes changed to unknown.