13 Replies Latest reply on Dec 5, 2015 5:10 PM by humejo

    Anyone ever have trouble with NPM not monitoring itself well?

    xpowels

      Here's a couple examples.  Recently I found this when looking at a UPS:

      poller.png

       

      Um, it's not September anymore. So what is happening is that Orion is no longer getting data back when polling this node.  Am I getting an error?  Does the Node have a red flashing box?  Am I getting a daily report that something is amiss?  No on all counts. This is frustrating beyond my ability to express it.

       

      So, how many nodes have this issue? According to tech support, there is no way to run a report to find out. I suppose I could just pull up all 2000 of my nodes one at a time to check.  Seriously.

       

      Last night I had another issue that is along the same vein but somewhat different. There was some amount of service stopping and starting on my main server, and several nodes just stopped being polled.  How many?  Again, no way to know.  They just appear green, but there are no statistics being collected. No Application stats in SAM.    No CPU.  No Memory.  No Disk.  No network stats including latency, which leads me to believe no polling (not even ping) was occurring.  Error?  Down or unknown node?  Nope.  All happy green.  Nothing wrong here.  *SIGH*  Rebooting an additional poller fixed the issue, but I was just lucky to stumble upon it before the Thanksgiving holiday.

       

      So there's the rant.  Now let's talk solutions:

      For UnDP issues, Orion should 1) Change the Node graphic to have a flashing red box like an interface was down (or make it a global check box option), and 2) Create a report to show Top 10/All node that have UnDP that have not updated in 12 hours or some other arbitrary value.

      For the times when Orion mysteriously stops polling, I'm open to suggestions. Maybe each poller should have a separate process that checks all nodes on all pollers to be sure data is populating. Run in every hour, once a day, whatever. It's not complicated, just check and see if there is SOMETHING from a node in the last hour or so.

       

      Am I alone in seeing these issues?

        • Re: Anyone ever have trouble with NPM not monitoring itself well?
          lynchnigel

          So the issue is on the server you have assessed? Is the devices settings still set up ok?

          What's the polling method SNMP -WMI - Ping?

          Are all the services on Orion ok ? have you checked that they are all running using the Orion Service manager?

          • Re: Anyone ever have trouble with NPM not monitoring itself well?
            humejo

            Yeah, I wish it did this out of the box, but it isn't super hard to do.  Check out these posts for a few different methods on how to report on nodes that are no longer responding to SNMP:

             

             

            Alert on Nodes that stopped responding to SNMP

             

            Noes not responding to SNMP or WMI

             

             

             

            Another way to go about this is changing all of your snmp nodes to base their status upon SNMP (the default is ICMP). Just go into Manage Pollers (Settings > Manage Pollers), choose "Status & Response Time SNMP" and click "Assign" at the top, Group by "Polling Method" on the left-hand side, select "SNMP", select all of the nodes  and click "Enable Poller".  This will cause Orion to base the Up/Down status of nodes on their responding to SNMP.  This means that if a node goes "Down" then it could be because it doesn't respond to ping, or it could just be that the node isn't responding to SNMP.

             

            I would recommend creating a report like in the posts above, since basing status on SNMP has a few downsides.  One, is that you won't know why a node is down until you investigate it (is it Ping or just SNMP communication?).  Another is that SNMP queries take a bit longer to respond than a ping request.  Not a big deal on a per Node basis, but multiply that by all of your SNMP nodes and you can end up adding a significant amount of time to your overall polling cycle, possibly resulting in sub-100% Polling Completion.

              • Re: Anyone ever have trouble with NPM not monitoring itself well?
                xpowels

                Excellent answers.  Some are a bit specific to devices that have CPUs, but the more generic ones worked for me.  Also, Tech Support finally sent be a report that will show every single object that is not responding:

                 

                On web reports:

                1. Create a new report
                2. Select custom table
                3. On the Selection method drop down menu > Choose > Advanced Database Query > Query type SQL.

                Use this custom script:

                SELECT DISTINCT n.Caption, cpa.AssignmentName, DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), cps.DateTime) LastPolled FROM CustomPollerStatus cps JOIN CustomPollerAssignment cpa ON (cps.CustomPollerAssignmentID=cpa.CustomPollerAssignmentID) JOIN Nodes n ON (n.NodeID=cpa.NodeID) WHERE cps.DateTime <DATEADD(hh,-24,GETUTCDATE()) ORDER BY 1,2,3

                4. Click on preview results and add to layout.
                5. After adding to layout click on Add column and select all objects

                -AssignmentName
                -Caption
                -LastPolled

                6. Click on Submit


                Works perfectly!

                  • Re: Anyone ever have trouble with NPM not monitoring itself well?
                    humejo

                    Yes, that will work, but keep in mind that will only show you nodes that have custom pollers assigned to them.  If all of your nodes have a custom poller assigned, then good, otherwise, you'll want something a bit more comprehensive.  I have a SWQL query (I'm surprised support told you to use SQL when SWQL is really the way you should go since any major schema change they make could potentially break custom SQL reports you have) that will tell you when any SNMP node has not had a successful uptime poll in the last X amount of minutes (I use 60 minutes, but you could change it to whatever time frame you want).  I don't have it available to me right now, but I will add it to this response as soon as I do.