12 Replies Latest reply on Oct 31, 2017 1:19 PM by i3scottw

    Cisco UCS trap-based monitoring

    jest4kicks

      Hey All,

       

      I've had a few different discussions regarding Cisco UCS monitoring, and I wanted to share the solution we ultimately came up with.  This is the product of a lot of different folks that contributed content to these forums.

       

      First, the challenge here is that Cisco does not have any kind of hardware health agent that allows for health polling in the OS.  Your only option is to poll the CIMC (Cisco's ilo/drac, for the uninitiated).  You can poll the CIMC with a bunch of UnDP's (you can find info on that elsewhere in this forum), however I didn't want to have to go through and establish warning/error thresholds for every poller.

       

      Enter SNMP traps and the world of Cisco Faults.

       

      Within the CIMC, Cisco maintains a fault table for all hardware issues that have occurred.  The CIMC can also be configured to send SNMP traps whenever a fault is triggered or changed.

       

      Now, before you shirk away from SNMP traps and start wondering about polling that fault table (as Cisco actually says you can do), let me stop you.  While a fault will appear in the table when it is active, the table becomes unquery-able once the fault clears.  This means your poller will never see a reset message for that fault.  I actually spoke to Cisco about this, and they agreed it was a bad design.  No word on if or when it will be improved.

       

      So, we're left with SNMP traps, and our desire to alert off them.  To do this, we looked at how the traps come in, and how they are stored in the traps table.  What we found is that a "clear" trap always gets sent when the fault clears.  Knowing this, we designed a database query to look at the most recent iteration of a fault trap, and treat it as an active fault unless that last instance was a "clear."

       

      Here it is!  The only customization you should need is the community string.  I would recommend using type-specific communities to keep your incoming traps separate.  The reset condition is simply when the query is no longer true.

       

      SQL Condition: Node (SELECT Nodes.NodeID, Nodes.Caption FROM Nodes)

       

      WHERE Nodes.NodeID IN (

      SELECT  tr.nodeId

      FROM [dbo].[Traps] tr

      JOIN [dbo].[TrapVarbinds] trv ON tr.trapID = trv.trapID AND OIDName LIKE 'cucsFaultDescription%' AND OIDValue NOT LIKE '%Cleared'

       

      JOIN (SELECT  RIGHT(trv2.OIDName, CASE WHEN CHARINDEX('.', REVERSE(trv2.OIDName)) > 0

        THEN CHARINDEX('.', REVERSE(trv2.OIDName))-1 ELSE 0 END) faultID,

        MAX(tr2.DateTime) DateTime

        FROM [dbo].[Traps] tr2

        JOIN [dbo].[TrapVarbinds] trv2 ON tr2.trapID = trv2.trapID AND trv2.OIDName LIKE 'cucsFaultDescription%'

        WHERE tr2.Community = 'CIMCtrap'

       

        GROUP BY RIGHT(trv2.OIDName, CASE WHEN CHARINDEX('.', REVERSE(trv2.OIDName)) > 0

        THEN CHARINDEX('.', REVERSE(trv2.OIDName))-1 ELSE 0 END)) latest

       

        ON latest.faultID = RIGHT(trv.OIDName, CASE WHEN CHARINDEX('.', REVERSE(trv.OIDName)) > 0

        THEN CHARINDEX('.', REVERSE(trv.OIDName))-1 ELSE 0 END)

        AND latest.DateTime = tr.DateTime

       

      WHERE tr.Community = 'CIMCtrap'

       

      GROUP BY

        nodeID,

       

      RIGHT(trv.OIDName, CASE WHEN CHARINDEX('.', REVERSE(trv.OIDName)) > 0

        THEN CHARINDEX('.', REVERSE(trv.OIDName))-1 ELSE 0 END)

      )