11 Replies Latest reply on Aug 20, 2012 9:42 AM by akrekic11

    Monitoring Gotchyas

    Jeremy Stretch

      I noticed in NPM the other day that a port-channel connecting to a critical server was showing down. This was odd, not only because nobody was screaming, but because the server itself was reachable. I logged into the upstream switch, a Cisco Catalyst 6500, and noticed that there were now two port-channel interfaces sharing a common numeric ID: Port-channel200, and Port-channel200A. This was new.


      It turns out that the port-channel, which had LACP negotiation enabled, encountered an error at some point and its two members had been moved to this newly spawned alternate port-channel. Traffic continued to flow, however the creation of a new interface of course resulted in a new SNMP index identifier, which of course was not being monitored. This resulted in quite a bit of confusion in the NOC until someone logged into the core switch to investigate.


      What other oddball monitoring "gotchyas" have people run into?

        • Re: Monitoring Gotchyas

          Off the top of my head:


          1. SNMP service wigging out on Windows causes interface status to show "unknown".
          2. SNMP service restarting on Windows causes bogus reboot alerts because the system start time changed.
          3. People mistakenly configuring "Physical Memory" as a monitored volume and causing unexpected volume utilization alerts.
          • Re: Monitoring Gotchyas

            The current issue I'm facing the most of is in regards to SolarWinds monitoring of Windows Based Servers and their disk drives.  For whatever reason the drive is discovered for example as "C:\System Serial Number 12345" (the serial number is not located anywhere in the physical drive description but it is returned via SNMP).  As a result if the server get's rebuilt and a new C:\drive configured (which happens quite often in our VM environment when we mount a bigger drive) the new drive will have a different description "C:\System Serial Number 67890".


            Since SolarWinds sees this as a different drive (even though the drives have identical labels) two things happen.

            1.  Polling on the old volume begins to fail

            2.  The new volume isn't monitored until a discover or a list resources is executed to add it to monitoring.


            There are far too many gotchas to cover as to why I no longer use SW built in discovery feature so I'm left with not knowing a volume has changed.  I have had to resort to writing an alert that tells me when a drive monitor fails so I get tipped off as to this change.  This does nothing for me however if a completely new drive is added as I am completely in the dark if that happens.

            • Re: Monitoring Gotchyas

              One of the SolarWinds Orion specific monitoring Gotcha's that bit me today is trailing spaces in Custom Properties.  Trailing spaces are treated as a character and if you are using Custom Properties for other automated processes such as sorting/reporting and view limitations it can bite you.  Several of the others that I work with just copy and paste several custom properties when setting up nodes and often that will include a trailing space.

              • Re: Monitoring Gotchyas
                Sohail Bhamani

                Just today I edited an existing discovery to remove the SNMP strings so I could run an ICMP only sonar scan.  Strangely enough, the scan used some sort of cached SNMP credential and discovered the node as it would had I put some SNMP strings in the first steps of the discovery.  When I created a new discovery without any SNMP credentials, it worked as described.  This is my gotcha for this weeks client.


                Sohail Bhamani


                • Re: Monitoring Gotchyas

                  In current NPM release, ability to put whatever you want into custom properties causes problem (looks like it will be resolved in 10.4 with administravely defined and locked custom properties...but can't test it yet in 10.4 beta )


                  Not sure what the cause was, but I have 2 ESX server alert because Orion reported their memory at about 4000%


                  In general, the idea of strategically approaching alerts.  You can write very few alerts that cover most everything, but if you're not careful and everyone has access to do it, will end up with maybe hundreds of active alerts (we are kind of stuck creating more specific alerts because of how everything was just mass imported....might be better to lay out the alert structure and then take a slower approach to importing nodes?)


                  Oh yea, one more thing i ran into this morning...unamanaging nodes is sometime flakey.  Maybe this doesn't really apply to the coversation, but just thought i'd throw it out there...had to dive into the database and unmanage a node from there...it's something we've run into several times now, and it's probably something people should be aware of when deploying orion.

                  • Re: Monitoring Gotchyas

                    It's like a dog chasing a bus.


                    Your instinct is to monitor everything. Then when you get hit with all that information it is overwhelming and no one knows what to do with it.


                    I would say trying to fulfill the requirement to "monitor everything" is a gotcha

                    • Re: Monitoring Gotchyas

                      Here's once that gets us every time.


                      When you add a new node and apply a poller or template, any alert tied to those pollers or templates usually throws a bogus alert right away, do to the initial values being outside the threshold.  The biggest offender is usually a disk space alert.

                        • Re: Monitoring Gotchyas

                          I had to alter our process of adding a node for the same reason jspanitz All of our alerts require a custom property to be set in order to route the alerts to the proper team.  When we add a node we first have to wait until the hard drives show status before we can tag them.  Makes discoveries virtually impossible because people complain about the false alerts when we turn up a large group of nodes.