This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Displaying Cause of Component Events

This is has been a thorn in my side with SAM forever and I posted about this in 2014. There still doesn't seem to be a way to simply show why a component is in a "Critical", "Down", or "Warning" state. When looking at a component's page, we can see the event list;

pastedImage_1.png

You can also display this on another page with the Application Status resource. What you can't do is show WHY the component is in that state. In the example shot above, it shows that SQL Server is the component and in my case this is usually because the SQL server has high memory or CPU. The only way to see this is to drill down to the component page itself, where it actually shows that. It would be extremely helpful if you could display this info with the component event, IE:

Component "SQL Server" 98% memory. Telling me something is Warning or Critical is not helpful if I have to take extra steps to find out why. If anyone knows of a way to do this, I would love the solution.

  • We had trouble with that too.  We changed all of our alerts from Alert on Application to Alert on Component.  Then we dug through the available variables to find more valuable statistics.  This is an example of an SQL Lock alert.  We put the desired information 3 places.  The alert that goes to the teams, the "Message displayed when the alert is triggered" field and also us the action to write the event to the NetPerMon event log.  This is part of the alert message text.

    AppInsight for SQL Component Details:

    Node Name:  ${N=SwisEntity;M=Application.Node.DisplayName}

    Instance Name: ${N=SwisEntity;M=Application.ApplicationAlert.ApplicationName}

    IP Address:  ${N=SwisEntity;M=Application.Node.IP_Address}

    Component Name:  ${N=SwisEntity;M=ComponentAlert.ComponentName}

    Current Value:  ${N=SwisEntity;M=ComponentAlert.StatisticData}

    Current State:  ${N=SwisEntity;M=ComponentAlert.ApplicationAvailability}.

    Component Description:  ${N=SwisEntity;M=ComponentAlert.UserDescription}

    The component description is key for the AppInsight alerts for us as the descriptions for the component and the recommendations are all in here.  But the current value and state may be what you are asking about.  Just write them to the log and you can find them in message center.  Write them in the alert and it is right there as well.

    Disclaimer:

    When we do this, we have to specify which components we want to alert on in the trigger condition, but since our SQL team only wanted alerts on a subset of the components, and the ability to look at charts of the others, this worked for us.

    pastedImage_0.png

  • This is the SQL custom variable I wrote a while back to try to tackle this situation in my alert messages, it works for most of the "standard" component types but I haven't gone back to have it cover all the edge cases with the appinsights or components with multiple statistics.  It could be more efficient but i didn't know sql as well at the time and haven't been motivated to re-write it since then.

    The tricky bit is that depending on the type of component there are a LOT of reasons it might be showing as yellow/red and all those scenarios live on different tables and it gets tangly fast..

    pastedImage_0.png

    Node ${N=SwisEntity;M=Application.Node.Caption} Application ${N=SwisEntity;M=Application.ApplicationAlert.ApplicationName} Component ${N=SwisEntity;M=ComponentAlert.ComponentName} is ${N=SwisEntity;M=Status;F=Status}

    The following Thresholds have been breached:

    ${SQL: select isnull((

    SELECT cast(concat(tbc.thresholdname, ' '

    , case

    when tbc.thresholdoperator = 0 then 'greater than '

    when tbc.thresholdoperator = 1 then 'greater than or equal to '

    when tbc.thresholdoperator = 2 then 'equal to '

    when tbc.thresholdoperator = 3 then 'less than or equal to '

    when tbc.thresholdoperator = 4 then 'less than '

    when tbc.thresholdoperator = 5 then 'not equal to '

    end

    , cast(tbc.critical as varchar)

    , ' for '

    , isnull(isnull(ovr.criticalpolls,t.criticalpolls),1)

    , case when isnull(ovr.criticalpolls,t.criticalpolls) != isnull(ovr.criticalpollsinterval,t.criticalpollsinterval) then concat(' of ',isnull(ovr.criticalpollsinterval,t.criticalpollsinterval)) else '' end

    ,case when isnull(ovr.criticalpolls,t.criticalpolls) > 1 then ' polls' else ' poll' end, CHAR(13)

    ) as xml) as CriticalDescription

    --, *

    FROM [dbo].[APM_ThresholdsByComponent] tbc

    left join [dbo].[APM_Threshold] t on tbc.componentid=t.id and t.thresholdname=tbc.thresholdname and t.istemplate=1

    left join [dbo].[APM_Threshold] ovr on tbc.componentid=ovr.id and ovr.thresholdname=tbc.thresholdname and ovr.istemplate=0

    join [dbo].[APM_CurrentStatistics] cs on cs.componentid=tbc.componentid

    where tbc.componentid=${N=SwisEntity;M=ComponentID}  and

    (tbc.critical != '1.7976931348623157E+308')

    and (

    (tbc.thresholdname = 'StatisticData' and (

    (tbc.thresholdoperator = 0 and (cs.componentstatisticdata > statisticcritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentstatisticdata >= statisticcritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentstatisticdata = statisticcritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentstatisticdata <= statisticcritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentstatisticdata < statisticcritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentstatisticdata != statisticcritical))))

    or

    (tbc.thresholdname = 'Response' and (

    (tbc.thresholdoperator = 0 and (cs.componentresponcetime > responsetimecritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentresponcetime >= responsetimecritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentresponcetime = responsetimecritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentresponcetime <= responsetimecritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentresponcetime < responsetimecritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentresponcetime != responsetimecritical))))

    or

    (tbc.thresholdname = 'CPU' and (

    (tbc.thresholdoperator = 0 and (cs.componentpercentcpu > cpucritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentpercentcpu >= cpucritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentpercentcpu = cpucritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentpercentcpu <= cpucritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentpercentcpu < cpucritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentpercentcpu != cpucritical))))

    or

    (tbc.thresholdname = 'PMem' and (

    (tbc.thresholdoperator = 0 and (cs.componentpercentmemory > physicalmemorycritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentpercentmemory >= physicalmemorycritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentpercentmemory = physicalmemorycritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentpercentmemory <= physicalmemorycritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentpercentmemory < physicalmemorycritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentpercentmemory != physicalmemorycritical))))

    or

    (tbc.thresholdname = 'VMem' and (

    (tbc.thresholdoperator = 0 and (cs.componentpercentvirtualmemory > virtualmemorycritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentpercentvirtualmemory >= virtualmemorycritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentpercentvirtualmemory = virtualmemorycritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentpercentvirtualmemory <= virtualmemorycritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentpercentvirtualmemory < virtualmemorycritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentpercentvirtualmemory != virtualmemorycritical))))

    or

    (tbc.thresholdname = 'IOReadOperationsPerSec' and (

    (tbc.thresholdoperator = 0 and (cs.componentioreadoperationspersec > ioreadoperationsperseccritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentioreadoperationspersec >= ioreadoperationsperseccritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentioreadoperationspersec = ioreadoperationsperseccritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentioreadoperationspersec <= ioreadoperationsperseccritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentioreadoperationspersec < ioreadoperationsperseccritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentioreadoperationspersec != ioreadoperationsperseccritical))))

    or

    (tbc.thresholdname = 'IOWriteOperationsPerSec' and (

    (tbc.thresholdoperator = 0 and (cs.componentiowriteoperationspersec > iowriteoperationsperseccritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentiowriteoperationspersec >= iowriteoperationsperseccritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentiowriteoperationspersec = iowriteoperationsperseccritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentiowriteoperationspersec <= iowriteoperationsperseccritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentiowriteoperationspersec < iowriteoperationsperseccritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentiowriteoperationspersec != iowriteoperationsperseccritical))))

    or

    (tbc.thresholdname = 'IOTotalOperationsPerSec' and (

    (tbc.thresholdoperator = 0 and (cs.componentiototaloperationspersec > iototaloperationsperseccritical)) or

    (tbc.thresholdoperator = 1 and (cs.componentiototaloperationspersec >= iototaloperationsperseccritical)) or

    (tbc.thresholdoperator = 2 and (cs.componentiototaloperationspersec = iototaloperationsperseccritical)) or

    (tbc.thresholdoperator = 3 and (cs.componentiototaloperationspersec <= iototaloperationsperseccritical)) or

    (tbc.thresholdoperator = 4 and (cs.componentiototaloperationspersec < iototaloperationsperseccritical)) or

    (tbc.thresholdoperator = 5 and (cs.componentiototaloperationspersec != iototaloperationsperseccritical))))

    )

    FOR XML PATH('') ),'Appinsight/Other');  }

  • Appreciate the answers. These solutions are a lot of legwork and I'm not sure they address what I'm after. Also, I'm not using AppInsight at all.

    If MSSQLServer was using 98% memory, it would show on the component page like my screen shot (note, i wasnt able to reproduce a problem for the shot).

    What I'd like is to show this detail on say, the home page, but there is no resource for components that will show that. I can add the "Applications With Problems" resource, which is what I'm currently using, but it ONLY tells me a component is Warning or Critical, etc. What it doesn't do is show me the memory use is high, which is the cause for the app being critical or warning. I've never understood why a monitoring system like Orion can't do this out of the box. Seems like something as simple as a "Components With Problems" resource that would actually show you this might help.

    It's starting to look like without some heavy customization of alerts, or some complicated SQL query this is just not possible. Mind boggling considering this was an annoying issue five years ago, and still is.

    pastedImage_0.png

  • Yeah, it's a pain point and has no easy solution because each type of component has really widely different situations that can cause them to display as critical, and nothing built in tracks them for us.  So at this point the best that can be done is to build up a mess of SQL/SWQL to test for each scenario and basically say "if the component is a process, check these 6 things, if it's a script check these. emoticons_sad.png

  • So I discovered this and there was a glimmer of hope. But counter intuitively, it shows the memory usage of the NODE in red. I'm assuming the CPU load is also a node value. When I clicked through to the Application page, it clearly showed SQL server using a lot of memory. If it just showed the COMPONENT value of those things, that would be some improvement.

    pastedImage_0.png

  • t0ta11ed74  wrote:

    So I discovered this and there was a glimmer of hope. But counter intuitively, it shows the memory usage of the NODE in red. I'm assuming the CPU load is also a node value. When I clicked through to the Application page, it clearly showed SQL server using a lot of memory. If it just showed the COMPONENT value of those things, that would be some improvement.

    pastedImage_0.png

    Great feedback, I'm tracking the improvement request here internally under SAM-10716