Acknowledging hardware failure

Question

Recently one of our Dell blades had fault in one if its memory module. SAM picked that nicely with yellow sign on that node. Now however server is fixed but I still have that yellow sign on SAM. I also had chat with Dell support to check whether there is still something wrong with this server but there isn't. I have seen with normal alerts that there is this "acknowledge alert" button but not with hardware failure.

So question is. How can I now reset the node to be healthy again?

DONDERKA · Accepted Answer

Hi,

I presume that you polling method for the Blade chassis is SNMP right?

In that case you can check what status the Hardware Health is getting from the server. Health status for sensor should change based on every poll so if the memory is still yellow it seems that the server is reporting health status which is considered warning.

Please take a look on these documents where are lists of OIDs monitored by SolarWinds on Dell servers for Hardware Health statuses.

Dell monitored OIDs for Hardware Health

Dell Blade Chassis monitored OIDs for Hardware Health.

You can create a SNMP walk by using tool which is located in the solarwinds installation directory ..\SolarWinds\Orion\SnmpWalk.exe. It will collect values for all accessible OIDs (please use the same community string as in Orion) and store them in txt file. There you can look for the OID and see for yourself what value the server returns for the memory module.

The meaning of the status value is here:

[HardwareHealthStatus(Name = "Undefined", OrionStatus = Undefined)] = 0,

[HardwareHealthStatus(Name = "Other", OrionStatus = Unknown)] = 1,

[HardwareHealthStatus(Name = "Unknown", OrionStatus = Unknown)] = 2,

[HardwareHealthStatus(Name = "Ok", OrionStatus = Up)] = 3,

[HardwareHealthStatus(Name = "NonCritical", OrionStatus = Warning)] = 4,

[HardwareHealthStatus(Name = "Critical", OrionStatus = Critical)] = 5,

[HardwareHealthStatus(Name = "NonRecoverable", OrionStatus = Critical)] = 6

The Name is Dell status name. The OrionStatus is status level assigned to the sensor in SolarWinds Hwh monitoring. And the value on the end of the row is the value which you will be able to see in the SNMP walk. This table tell us that if you are looking for the warning state you will be looking for the value 4 for the memory status OID.

I hope this will help.

Dalibor