cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Hardware Health Monitoring for Cisco UCS C-series

Hardware Health Monitoring for Cisco UCS C-series

Cisco does not offer an OS agent for monitoring hardware health.  Instead, they recommend using SNMP to poll the CIMC (BMC) using CISCO-UNIFIED-COMPUTING-MIB.

ftp://ftp.cisco.com/pub/mibs/ucs-mibs/CISCO-UNIFIED-COMPUTING-COMPUTE-MIB.my

11 Comments
Level 7

May you please explain how I use this to monitoring CIsco UCS Server?

Thanks

Level 11

These OIDs actually have to be polled through the CIMC (Cisco's remote mgmt card, similar to a DRAC or iLO).  The idea is that you add these Universal Device Pollers to them.

While this will give you some information, I've found it incredibly tedious to try and build custom alerts around this MIB.

Instead, I'm looking at ways to utilize CIMC Fault reporting.  There's a separate MIB that reports current Fault information.  The problem is that it stops responding when the fault is gone, so Solarwinds never actually sees the fault "clear."  I had a call with Cisco a couple days ago to discuss ways this could be improved.

Alternatively, I'm also looking at using the fault SNMP traps that the CIMC can generate.  The trick then is configuring NPM alerting around traps.  Refer to the comments in this thread for more info on that.  https://thwack.solarwinds.com/ideas/3580#start=50

Level 11

What sort of headway have you all made on this?  Just like you guys I'm stuck trying to develop a process to watch for hardware failures.  We have several C-220's and just the other day had a couple of drives go bad.  I'm fairly certain these are standalone boxes not managed by UCS Manager.  Here is where I am so far.

1 - I am polling the OS(Windows) and have SNMP configured at that level. 

2 - I am polling the CIMC via SNMP AND polling for UCS as a separate node and I am able to test successfully most of these OID's in the picture below.

CIMC_OID.JPG

Are you guys getting any good data with accurate trigger/resets on hardware?  These Fault OID's are overwhelming!

I've also read a thread or two about folks installing some IBM tools at the OS level?  Cant seem to find that thread this morning.... 

Level 11

Hey Shack,

Just reading your comments and train of thought, I can tell you're running through the same sequence of attempts that I did.  Tragedy + Time = Comedy?

I wrote a thread about how we ultimately solved this.

Cisco UCS trap-based monitoring

Cliff notes: you need to configure your CIMCs to generate traps to Solarwinds.  (I recommend doing so with a community string just for Cisco CIMC traps.)  You then configure a custom SQL alert to trigger based on the most recent occurrence of those traps.

Hope that helps!

Level 11

I may of had a breakthrough today.  If I poll OID 1.3.6.1.4.1.9.9.719.1.1.1.1.20  as a table I get this when I apply it to the CIMC node in NPM. 

pastedImage_0.png

Then I found this below.  In my case the 6 above correlates to the actual bad drive being reported by the CIMC.  My question is why do I have those 5's in there for the rest.  Anyway, the new drive is going in today I believe so I'll keep an eye on this to see if that 6 resets to a 0.  It should I hope.  Any experience down this specific path?

UCS Fault Table fault severity codes are as follows:

0: Cleared

1: Info

3: Warning

4: Minor

5: Major

6: Critical

Level 8

can we segregate results?? Example, I want to display only critical and major faults in Solarwinds??

Level 20

Any new support for monitoring UCS, DNA Center, ACI, and Firepower would be nice to have added to Orion out of the box in NPM.

Product Manager
Product Manager

We have a beta that is soon to be happening in SAM, which you can see some of the items that may be in there here.  Please ping me or James Barnes if you are interested and available to participate and give us input and feedback.

Product Manager
Product Manager

Yep! Just let one of us know!

Product Manager
Product Manager

SAM 6.8 Release Notes  provided updated UCS monitoring! Please let us know how you like the improvements.