cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 11

Anyone ever have trouble with NPM not monitoring itself well?

Here's a couple examples.  Recently I found this when looking at a UPS:

poller.png

Um, it's not September anymore. So what is happening is that Orion is no longer getting data back when polling this node.  Am I getting an error?  Does the Node have a red flashing box?  Am I getting a daily report that something is amiss?  No on all counts. This is frustrating beyond my ability to express it.

So, how many nodes have this issue? According to tech support, there is no way to run a report to find out. I suppose I could just pull up all 2000 of my nodes one at a time to check.  Seriously.

Last night I had another issue that is along the same vein but somewhat different. There was some amount of service stopping and starting on my main server, and several nodes just stopped being polled.  How many?  Again, no way to know.  They just appear green, but there are no statistics being collected. No Application stats in SAM.    No CPU.  No Memory.  No Disk.  No network stats including latency, which leads me to believe no polling (not even ping) was occurring.  Error?  Down or unknown node?  Nope.  All happy green.  Nothing wrong here.  *SIGH*  Rebooting an additional poller fixed the issue, but I was just lucky to stumble upon it before the Thanksgiving holiday.

So there's the rant.  Now let's talk solutions:

For UnDP issues, Orion should 1) Change the Node graphic to have a flashing red box like an interface was down (or make it a global check box option), and 2) Create a report to show Top 10/All node that have UnDP that have not updated in 12 hours or some other arbitrary value.

For the times when Orion mysteriously stops polling, I'm open to suggestions. Maybe each poller should have a separate process that checks all nodes on all pollers to be sure data is populating. Run in every hour, once a day, whatever. It's not complicated, just check and see if there is SOMETHING from a node in the last hour or so.

Am I alone in seeing these issues?

13 Replies
Level 12

Yeah, I wish it did this out of the box, but it isn't super hard to do.  Check out these posts for a few different methods on how to report on nodes that are no longer responding to SNMP:

Alert on Nodes that stopped responding to SNMP

Noes not responding to SNMP or WMI

Another way to go about this is changing all of your snmp nodes to base their status upon SNMP (the default is ICMP). Just go into Manage Pollers (Settings > Manage Pollers), choose "Status & Response Time SNMP" and click "Assign" at the top, Group by "Polling Method" on the left-hand side, select "SNMP", select all of the nodes  and click "Enable Poller".  This will cause Orion to base the Up/Down status of nodes on their responding to SNMP.  This means that if a node goes "Down" then it could be because it doesn't respond to ping, or it could just be that the node isn't responding to SNMP.

I would recommend creating a report like in the posts above, since basing status on SNMP has a few downsides.  One, is that you won't know why a node is down until you investigate it (is it Ping or just SNMP communication?).  Another is that SNMP queries take a bit longer to respond than a ping request.  Not a big deal on a per Node basis, but multiply that by all of your SNMP nodes and you can end up adding a significant amount of time to your overall polling cycle, possibly resulting in sub-100% Polling Completion.

Excellent answers.  Some are a bit specific to devices that have CPUs, but the more generic ones worked for me.  Also, Tech Support finally sent be a report that will show every single object that is not responding:

On web reports:

1. Create a new report
2. Select custom table
3. On the Selection method drop down menu > Choose > Advanced Database Query > Query type SQL.

Use this custom script:

SELECT DISTINCT n.Caption, cpa.AssignmentName, DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), cps.DateTime) LastPolled FROM CustomPollerStatus cps JOIN CustomPollerAssignment cpa ON (cps.CustomPollerAssignmentID=cpa.CustomPollerAssignmentID) JOIN Nodes n ON (n.NodeID=cpa.NodeID) WHERE cps.DateTime <DATEADD(hh,-24,GETUTCDATE()) ORDER BY 1,2,3

4. Click on preview results and add to layout.
5. After adding to layout click on Add column and select all objects

-AssignmentName
-Caption
-LastPolled

6. Click on Submit


Works perfectly!

0 Kudos

Yes, that will work, but keep in mind that will only show you nodes that have custom pollers assigned to them.  If all of your nodes have a custom poller assigned, then good, otherwise, you'll want something a bit more comprehensive.  I have a SWQL query (I'm surprised support told you to use SQL when SWQL is really the way you should go since any major schema change they make could potentially break custom SQL reports you have) that will tell you when any SNMP node has not had a successful uptime poll in the last X amount of minutes (I use 60 minutes, but you could change it to whatever time frame you want).  I don't have it available to me right now, but I will add it to this response as soon as I do. 

0 Kudos
Level 12

So the issue is on the server you have assessed? Is the devices settings still set up ok?

What's the polling method SNMP -WMI - Ping?

Are all the services on Orion ok ? have you checked that they are all running using the Orion Service manager?

0 Kudos

In the first case, yes, everything is running perfectly.  The problem is that I moved from a physical to a virtual platform and changed the IP of the server. The devices that are having the issue being polled need to change their SNMP settings to allow requests from the new IPs.  Easy change.  But, how do I determine which systems that I need to fix? Orion gives me no way to know what UnDP pollers are failing without manually look at the last successful poll of each system.

As for the other case, I'm not sure the root cause of the polling failure.  When I find you haven't been polling for hours, I try to quickly get it fixed and don't sit on the phone for hours with support.  A reboot got the polling going again, so that's done. The Windows server team were the ones who shut down the services in the first place to stop getting alerts during maintenance.  I have corrected them by letting them know how to disable the alerts, but this isn't the only instance of the polling starting to fail.

0 Kudos

What you need to do is analyse which nodes have not responded to SNMP since you changed the IP address. If you have the web reports in your version, constructing this report will give you the information you need to pinpoint which devices need their SNMP modified:

1. Click 'Reports' under the Home tab (default menu configuration assumed).

2. Click 'Manage Reports'

3. Create a new report.

4. In the first window, choose the advanced selector, and select 'nodes' as what you are reporting on. Then in the where, search for the field 'Last Database Sync', choose 'is less than' and then put in todays date, then click 'add to layout'.

5. Add in a custom table, edit it, selecting at least 'caption' and 'ip address' as the fields you want to list in the table.

6. Select the other report options as required.

When you run this report, it'll show you only the nodes which have not responded via SNMP today.

If you have issues with the report, PM me with your email address and I'll send you a template you can import into your environment

- Jez Marsh

Well, when I tried it, I only got back nodes that were unmanaged.  I guess since ping is still working, the database sync field must be getting updated. Great idea though!

0 Kudos

Hmm.. that should have worked for SNMP, not just ping updates

Last time I did something similar was in NPM 10.x. It's possible they've altered the field names in 11.x . I'll have another think!

- Jez Marsh
0 Kudos

Theres a table called Orion.NPM.CustomPollerStatus with a DateTime field there

this is the select statement for it

SELECT CustomPollerAssignmentID, DateTime, Rate, Total, RawStatus, Status, RowID, Description

FROM Orion.NPM.CustomPollerStatus

0 Kudos

Obviously you would have to marry that up with the custompoller name which 'I think' this will help

SELECT CustomPollerID, UniqueName, Description, OID, MIB, SNMPGetType, NetObjectPrefix, GroupName, PollerType, CustomPollerParserID, Format, Enabled, IncludeHistoricStatistics, Unit, TimeUnitID, TimeUnitQuantity, DefaultDisplayTimeUnitID, LastChange, PollInterval, ColumnNumber

FROM Orion.NPM.CustomPollers

0 Kudos

IF you go into the node and edit the settings you can test the node polling for the device there so should help you determine if they are working.

pastedImage_0.png

0 Kudos

For 2000 nodes?  Surely, you jest......

0 Kudos

Creating a report first and checking, depending on how many there are adjusting your strategy to suit.

0 Kudos