12 Replies Latest reply on Nov 14, 2008 8:20 AM by bwiechman

    Worried about reliability...

    bwiechman

       I am somewhat concerned about the longterm reliability of Orion at this point. I installed last week and have been working for the last week to configure it. In that time I have opened 6 trouble tickets and noticed that over 1000 tickets have been opened since last Wednesday. Has everyone had a fairly stable experience, or have you experienced issues with performance and reliability?

      - Availability reports don’t report correct values. We had some network nodes go down last week and the availability percentages on the node details page are correct, but the availability report lists 100%. Great way to demonstrate to management that Orion helps ensure 100% uptime!
      - A couple of processes on the server were dying and being restarted. Apparently this is a known issue with APM that was supposedly resolved in APM2.0 SP2 (which we are running).
      - When configuring a trap alert enclosing a match string in single quotes disallows future editing of that value. Somewhere along the line those quotes aren’t getting parsed in our out correctly and it is messing with the interface. Not a huge deal since I don’t see anywhere that actually requires you to include single quotes. Not sure what it would do with a single apostrophe though.
      - Trap alerts at least do not appear to correctly match a “not equal to” IP address. They do succuessfully and correctly match an item “equal to” an IP address however. So there is a workaround, just cumbersome. I don’t know if this is present in the standard alerts or syslog alerts, but have verified it with snmp trap alerts.
      - For some network nodes with asynchronous TX/RX interface speeds the utilization percentage graphs don’t correctly show the actual percentage for the lower speed interface. Both percentages are based on the larger value which obviously results in erroneous graphs.
      -Discarding a trap as an alert condition doesn’t appear to immediately discard the trap. Alerts that follow a discard action still appear to process the trap (forwarding remaining traps forward all traps with those that I wanted discarded tagged with “marked for discard”). They do not show up in the trap viewer or web interface however. I’m not sure if this same thing would apply to syslog messages.

      Also, not a bug per se, but the fact that changes to snmp trap alerts don't take effect immediately was... a painful realization. Apparently I am told that there used to be a note somewhere to inform users of that fact but I have yet to see it (I may be blind), but no matter what this is definately counter intuitive and really leads for a lot of wasted time (especially when coupled with the IP address bug above).

        • Re: Worried about reliability...
          bwiechman

          An update...

          - I forgot a bug I noticed late on Friday until now as well... APM graphs sporadically show response times of less than 0ms.

           

          - A minor bug that is more an irritant than anything else (I'm nitpicking now, but after the list so far and the cost of the application I think I have the right): upgrading to 9.1 SP1 replaced all the default alerts that I had dumped because they aren't needed.

          • Re: Worried about reliability...
            aLTeReGo

            Your list is long, and quite possibly well founded. You sound like you know what you're talking about, and since I do very little with traps I'll have to take your word for all the issues you're having with that. I can certainly confirm the bug you're seeing with APM 2.0, even with SP2. As for this issue however..


             

            For some network nodes with asynchronous TX/RX interface speeds the utilization percentage graphs don’t correctly show the actual percentage for the lower speed interface. Both percentages are based on the larger value which obviously results in erroneous graphs.



            I'm not sure if you are aware, but you can configure independent transmit and receive bandwidth on a per interface basis. I hope this helps 

              • Re: Worried about reliability...
                bwiechman


                Your list is long, and quite possibly well founded. You sound like you know what you're talking about, and since I do very little with traps I'll have to take your word for all the issues you're having with that. I can certainly confirm the bug you're seeing with APM 2.0, even with SP2. As for this issue however..

                 

                For some network nodes with asynchronous TX/RX interface speeds the utilization percentage graphs don’t correctly show the actual percentage for the lower speed interface. Both percentages are based on the larger value which obviously results in erroneous graphs.


                I'm not sure if you are aware, but you can configure independent transmit and receive bandwidth on a per interface basis. I hope this helps 

                 

                 

                  Yes, that is exactly what I did (through the web interface... but accomplishes the same feat). The Percent Utilization line chart - second one down on the default interface details page, shows the percentage calculated based on the greater of the two configured interface speeds. I noticed when I was checking out interface details on an interface that was heavily utilized. The graph only shows about 45% utilization even though the actual utilization was 90% or so. It's a wireless access point set up for 67/33 downlink/uplink ratio so that works out about right.

                  • Re: Worried about reliability...
                    aLTeReGo

                    It's not that I don't believe you, but if you can post a screenshot I'd like to see what your looking at and compare it to my own asynchronous interfaces.

                      • Re: Worried about reliability...
                        bwiechman

                         No problem. :)

                         These shots were all from the same time. Just for shits and giggles I also went to the System Manager and verified that the interface speeds listed there matched what was showing up in the graphs and in the node details in the web interface. They all match up.

                         Here you can see that that utilization is calculated correctly for the gauge

                         

                        Here you can see that the graph reports the total bandwidth available. You can also see that of the 1.17mbps available over 1Mbps of that is being consumed, which matches up with the 90% utilization on the gauge.

                         However on the line graph the avg is reported incorrectly in the legend, and the graph is also incorrect at 41% or so.

                         

                         Note: graphs are accurate in the System Manager... 

                  • Re: Worried about reliability...
                    jonchill

                    I can't comment on APM or Traps as I don't use either.

                    But I wouldn't have a complaint with the overall reliability or stability of Orion NPM it has its problems but has come a hell of a long way since we first installed the product in 2003. I would say that 95% of products out today have some sort of problem with them however big or small it is.

                      • Re: Worried about reliability...
                        uscallesen

                        It took us about a year to get all of our bugs fixed (one remaining where Orions reports shows incorrect bandwidth utilization for Docsis cable interfaces) - Upgrading to 9.1 and 9.1sp1 was the first really positiv experience I've had since we started with Orion almost 2 years ago.


                         But our current installation (V9.1sp1) works really well - it's stable, faster than previous realeases and generelly seems much more mature than previous releases.


                         What I hope for at this point is that Solarwinds will have more focus on fixing reminaing bugs than on releasing a new mayor release.

                          • Re: Worried about reliability...
                            casey.schmit

                            We are working on the next service pack for 9.1, which contains some fixes from issues raised here and through our support department.  I'm working on a couple of these items this morning.  We're also working on a new major release, which will contain some features that I think a fair number of you will really like. 

                            I can't say anything about dates though, for fear of incurring Denny's wrath. :)

                          • Re: Worried about reliability...
                            bwiechman


                            I can't comment on APM or Traps as I don't use either.

                            But I wouldn't have a complaint with the overall reliability or stability of Orion NPM it has its problems but has come a hell of a long way since we first installed the product in 2003. I would say that 95% of products out today have some sort of problem with them however big or small it is.

                             



                            I expect some bugs. What I don't expect is bugs in basic core functionality: inaccurate graphs for example - found two cases where the graphs are inaccurate. With the lack of useful legend information a user is forced to interpret the graph to determine what it is showing, which is obviously going to be a problem. Basic pattern matching not working correctly on a simple IP address is also troubling. I spent over half a day fighting with that one. It's a waste of my time.

                            The biggest reason we went this direction is that there was a push within our orgainization to go with a product that had a support structure. Everyone that has been working in IT more than 2 or 3 days knows that bugs exist, how they are resolved and how well it appears that the software developer tests its software before it hits the end user are the major differentiators. Unfortunately it is difficult to gauge either of those until you have invested in the product (as everyone's friend Oracle would say). After uncovering what appear to be a series of bugs within a week of really pushing the software I think I am rightly concerned. I don't think anything I am doing is all that unique. Essentially I was trying to get a feel for whether my experiences have been representative, or if most people never see a bug and I am that special case.

                            I am hopeful at this point. After submitting my first ticket and hearing nothing for a week (apparently my account was linked to some email address that doesn't exist and responses aren't sent to the email address that is given, and confirmed, in the ticket report) Solarwinds appears to be actively working on my tickets. The test will be to see how quickly bug fixes are integrated into the product line.

                              • Re: Worried about reliability...
                                denny.lecompte
                                I expect some bugs. What I don't expect is bugs in basic core functionality: inaccurate graphs for example - found two cases where the graphs are inaccurate. With the lack of useful legend information a user is forced to interpret the graph to determine what it is showing, which is obviously going to be a problem. Basic pattern matching not working correctly on a simple IP address is also troubling. I spent over half a day fighting with that one. It's a waste of my time.
                                 

                                We'll definitely look into these issues.  Some of these are known bugs, and as Casey said, we're working on getting them into a service pack.  I don't want to comment about the other issues because they haven't gone through the full support process.  Sometimes, there are environmental issues that expose bugs that most users don't see.

                                If you didn't get a response for a week, I can see why you'd be frustrated.  And while we always have some bugs, having one customer hit that many so quickly is atypical.  Please rest assured that we will actively work your issues and get your system running.

                                Please keep the comments coming.  Once you get past these few bumps, I believe you'll find Orion is worth the investment.

                                • Re: Worried about reliability...

                                   Did you not discover these issues when you evlauated the product or did you install directly to production? It might take you more than one week of product usage to get into synch with any customized enviroment. Hang in there.

                                    • Re: Worried about reliability...
                                      bwiechman


                                       Did you not discover these issues when you evlauated the product or did you install directly to production? It might take you more than one week of product usage to get into synch with any customized enviroment. Hang in there.

                                       

                                       

                                      We did trial the app but didn't turn up all these issues then for several reasons. We actually looked at Orion about a year ago and ran into some configuration difficulties then, but nothing major, then took another look at Orion after APM 2.0 came out as this added a bunch of functionality that had kept of from seriously considering Orion to start with. Of course there is always the delay between the time that you trial something and the time that the PO is actually signed by management and in that timeframe we acquired some additional network hardware, made some modifications to the specific items we were monitoring and alerting on. I did configure a large number of snmp monitors for our trial, but didn't work extensively with the alerting, or trap/syslog alerting as our old system didn't support it and there was some amount of digging around to determine what traps were important, etc. And I'm only one guy. If I had time to sit and configure a monitoring system for a month that would be great (which is what this thing really takes), but that is time I just didn't have.

                                      We'll see what happens.