This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Node is Up. Overall Hardware Status (Node) 'Overall Hardware Status' has state: Unknown.

We'd previously been using a different tool to monitor the hardware status of our network devices but are in the process of migrating that function to Solarwinds NPM.  Out of the ~1k nodes that we currently have onboarded into Solarwinds, just shy of 1/2 of those are showing an overall hardware status of  "Node is Up.  Overall Hardware Status (Node) 'Overall Hardware Status' has state: Unknown."  All of the nodes showing an "Unknown" hardware status are showing valid CPU, memory and interface utilization information.  The vendor name (Cisco) and model are also recognized.

I have read through a number of posts with similar topics and have tried all of the suggestions that I could find but nothing thus far seems to have helped.  There doesn't seem to be a common theme among the devices that are showing in an unknown state.  So far only deleting and re-adding the nodes to Solarwinds seems to resolve the issue (which isn't practical given the number of devices involved and the loss of historical data).  Below are some of the things that I have tried.

  • Rediscovered the device
  • Repolled the device
  • Verified that the SNMP community string passes the test in node settings
  • Verified that the OID is viewable via the MIB browser
  • Compared the SNMP configuration against known working devices of the same model & IOS version
  • Compared the node configuration in Solarwinds against known working devices of the same model & IOS version
  • Restarted services on all pollers
  • Changed the Hardware Health Polling Method from "Use global setting" to "CISCO-ENVMON-MIB" (The default is set to CISCO-ENTITY-SENSOR-MIB) for the node
  • Installed the most current set of MIBs from the Customer Portal (restarted services after install)
  • Verified that the MSMQ directrory wasn't >1GB
  • Deleted and re-added the node in Solarwinds <--- This is the only step that I have found which resolves the issue thus far.

Any help would be greatly appreciated!

  • You mentioned pollers are the nodes that are having the issue on the same poller?

  • The issue spans multiple polling engines.  I did change the preferred Cisco MIB to CISCO-ENVMON-MIB which gradually brought the number of affected nodes down by ~100 (just shy of 400 affected now).  Many of the devices affected have a low delay to the poller so it doesn't seem to be a SNMP timeout issue. 

  • Sounds like you have done your due diligence on this matter I would open a ticket with support.

  • Hi, Have you tried disabling the Hardware sensors for one of the devices to see if that helps to resolve the issue?

  • Just some ideas

    1) Polling Interval may need to be adjusted for these devices

    2) Have you tried reloading/restarting snmp services on the device?

    3) Are they all going to the same poller? If they don't have one of the pollers IPs defined in your configs it acts goofy

  • We had exactly the same issue.

    New customer - they had been using Solarwinds for some years and called us in to do upgrades etc.

    I was then assigned to customer and in poking around to see how things were setup, I noticed around half (1400) of their devices showed "Undefined" in the Hardware Heath pie chart.

    Investigating I tried all the things you did - and indeed the only thing that seemed to "work" was to delete the node and re-add it. Or, adding the node again on a different Polling Engine showed the Hardware sensors for the "new" node but the original still had no sensors and Node Status showed "Overall Hardware Status Has State: Unknown".

    Opened a support ticket.

    No real help there (suggested to reboot the Main Poller and provide diagnostics. Then they came back with instructions to delete and re-install a bunch of things - based on "error" seen in files - most of which were from up to 12 months prior and not related at all (because they indicated issues for which we had already opened tickets for and resolved). And by the time they asked for the "delete and re-install", we had already resolved.

    We now have all nodes showing HW sensors and a none showing "Undefined" or "Overall Hardware Status Has State: Unknown".

    How? And what was the issue?

    Disabled Hardware Sensors.

    All HW sensors on all 1400 devices were disabled (Manage Hardware Sensors).

    I found this by:

    1) Choosing a node that showed "Overall Hardware Status Has State: Unknown".

    2) Trying everything to get HW showing for that node (list of steps in 1st post + extras - like "List Resources", deselecting HW Sensors, saving, then "List Resources" and re-selecting HW Sensors and saving etc.).

    3) Added same node (IP) on another polling engine. Selected HW Sensors when adding.

    4) Checked "new" node and HW sensors all there and working

    5) Checked every table in the NPM (Orion) Database that looked vaguely related to HW and compared entries for the "old" node to entries for the "new" node.

    6) Found 1 significant difference in 2 tables - field labelled "IsDisabled" in HWH_HardwareItem and  APM_HardwareItem (This last one is a View table). In both tables, the sensor entries for the "old" node were set to "1" - or on/disabled.

    7) Went to Manage Hardware Sensors - filtered to see all sensors for "old" node - sure enough they were all Disabled.

    8) Re-enable all HW Sensors for "old" node.

    9) Magic! "Old" (original) node now shows all sensors and status and Node no longer has "Overall Hardware Status Has State: Unknown". Undefined count on Hardware Health pie chart reduced by 1.

    10) In Manage Hardware Sensors - re-enabled all Disabled sensors.

    11) Problem solved - Hardware Health "Undefined" number went from ~1400 to below 100 (and these when checked were all "valid" - nodes down or unmanaged)

    Checking back through Audit logs I found that someone had bulk disabled all the sensors on these 1400 nodes in one hit (all Audit entries had same date and timestamp). This was done 8 months ago and no-one could explain why this would have been done or how. My guess is the person was trying to stop a particular sensor or sensor type from being monitored/flagged (there are a few that are problematical) and did not realise they had clicked on the "Select all" box or missed the "do you want to apply to all" prompt or something similar.

    So - check if the sensors for your device(s) are disabled. This is not as straightforward as it seems as (in my opinion) the Manage Hardware Sensors doesn't show what you might expect. If you select a grouping from the dropdown panel/frame on left, you don't seem to get any view that includes all the sensors on a particular device - I found the only way to see ALL sensors on a device (and therefore their status) was to select No Grouping and then plug in Parent (Node) name in search box.

    I have updated our ticket saying that:

    1Why does “Hardware Health Overview” list the device as “Undefined” – patently it is not – it has just had all “defined”/Discovered/selectable sensors disabled and the “Undefined” state is sending one down a path looking why Solarwinds suddenly cannot “see” any sensors that previously it could “see”.
    2Why does the Node status list the device as “Hardware Status Unknown”? Once again – the status really is known – we knew we had sensors that were being reported on, we know those sensors “exist” (they are in the DB) and we “know” that their status is “disabled” (the DB field entry tell us this). The Node status here is misleading (and has no pointer/flag etc. that would prompt one to check the “Manage Hardware Sensors”).
    3All statuses/errors seem to imply that Solarwinds has an issue “finding” sensors or polling sensors – when in fact it “knows” sensors are there but disabled. The statuses/errors need to be changed to reflect this.

    Also I realised - when you execute "List Resources" and deselect Hardware Monitoring, Solarwinds is not deleting the "discovered" sensors related to the node from the DB (or the "state" of them - i.e. disabled) and when you List Resources again and select the Hardware Monitoring, it "rediscovers" the DB entries and the "old" values/states of those - so if you thought that turning off and turning on here would "start fresh" it appears we are mislead again.

    Anyhow. Hope the above helps with your issues and (hopefully) is the same cause - then it's an easy fix,

  • Update to my post.

    The response to my questions from Support was:

    1) Basically the status of 'Overall Hardware Status Has State: Unknown' is "working as designed" and "it seems to be an expected behaviour that all the sensors are disabled, as there is no separate status for it."

    2) "In current implementation the hardware health status is not a status of HWH poller, but the overall status of all monitored sensors. If all sensors are disabled, then there is no way how to calculate the status and it is Unknown.

    If Hardware Health poller is disabled, we don't remove any data as we do not want to lose history when it will be re-enabled later.

    With the above, I will proceed to move this ticket to a Feature Request, and will be archiving it from my end."

    So - ticket is now closed and (apparently) the request to have the status correctly show something (like - "Hardware Monitored but all sensors disabled") more meaningful instead of "Overall Hardware Status Has State: Unknown" is a Feature Request.

    No idea how I can see/track "Feature Requests" or if indeed this means it will ever be looked at/changed/fixed.

  • This helped me but my problem was the opposite. I had virtual machines with no hardware and "Hardware Health Sensors" was enabled. Unchecking it fixed the hardware status undefined message.