Apologies in advance for the length here......
Precipitating this question\discussion is the 100+ plus alerts I have inherited in a production NPM environment. Where the Enterprise Infrastructure is across 25+ autonomous Districts with over 250+ unique US and International sites. Qty 6) pollers: 3k+ nodes\12k+ interfaces\8k+ volumes\20k+ elements All major SW Modules.
At present
the collection of NPM production alerts I am trying to manage are primarily 'simple', not 'custom' alerts. Of benefit to me (and I presume others as well) would be some way to easily identify\track details\specifics across the full scope of production alerts for elements buried beneath multiple drill downs such as:
- Alert recipients (due to consistent, ongoing MACs, I need to frequently revisit\revise this element across Alerts)
- Trigger conditions (identify constructs and logic, including 'Property to Monitor' specific conditions, etc.)
- Variable strings in use (per alert! these can change with Orion releases and 'break' alerts, best would be to be able to globally find\replace when such changes occur.)
- Alert suppression (i.e., details as above)
- Comments field (Used by us to track changes by Admin with date\timestamps, since there is no high level Admin logging details\timestamp in the toolkit at large.)
- Alert times (manage Weekday and Off Hours\Weekend time coverages)
- Alert Trigger Actions (in detail)
- Alert Reset Actions (in detail)
Basically: deconstruct Alerts into a flat interface, or at the least, display the full unique Alert details in a single screen.
Questions\Discussion point
- Does\has anyone created a unified Alert management interface to track details\specifics out of an enterprise level NPM Alert population?
- Or even a report to try and easily track Alert details?
- The benefit being solicited here avoids manually deconstructing the full scope of alerts to track details, in say, Access or Excel.
- Have I missed the toolkit GUI alert manager?
- I have reviewed Thwack for relevant content, and have an open ticket for this same request.
- Included here are possibly Alert (feature) enhancements, but I will let SW dig these out.
Feature enhancements
- Alert GUI interface check box for "Include the generating alert name in the message body."
- Some way to track Admin changes: Author, date of last change, etc.. (Some way to log the changes made.)
- Some way to re-run the "Alert manager GUI" for updating alert changes, or better, be addressable in real time.
- Interact with Alerts globally or individually with a search constraint to make relevant variable changes.
- Some form of search function against the full scope of active alerts per detail field.
- Some central Customer portal URL of feature requests\ features\enahncements in development for future release to remove any question of past requests already tabled.
- (nice to have) some form of boolean tracking to generate a compilation Alert for repetitive alerts, based on unique condition #'s of alerts\timeframe. with resets.
A bit of background
I have already visited the SQL Server DB to see if this request is (easily?) actionable in a complex report.
From what I can interpret the NPM alert as constructed is a SQL query in parts Designer inputs with: Client side table joins and targeting Stored procedures on the DB side. I.e., hereis the SQL statement pulled from a very simple, basic alert: "Alert me when a component goes into warning or critical state." Parsing this to deconstructed details of the Alert tab selections and drill downs (i.e., Trigger action message body) is pretty cryptic (to me).
<?xml version="1.0" encoding="UTF-8"?>
<AlertDefinitions><AlertDefinition AlertDefID="{5EB4B441-F3D6-4DF1-AA39-A59B4B5191AB}" AlertName="Alert me when a component goes into warning or critical state" AlertDescription="This alert will write to the event log when an component goes into warning or critical state and when an component comes back up again." Enabled="True" StartTime="12:00:00 AM" EndTime="11:59:59 PM" DOW="1,2,3,4,5,6,7" TriggerQuery="SELECT APM_AlertsAndReportsData.ComponentID AS NetObjectID, APM_AlertsAndReportsData.ComponentName AS Name
FROM Nodes INNER JOIN APM_AlertsAndReportsData ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)
WHERE
(
(Nodes.Status <> '2') AND
(
(APM_AlertsAndReportsData.ComponentStatus = 'Critical') OR
(APM_AlertsAndReportsData.ComponentStatus = 'Warning')
)
)" TriggerQueryDesign="<QUERY><KIND>1</KIND><COMPLEX><TAG></TAG><CONNECTIVE>1</CONNECTIVE><CHECKED>1</CHECKED><SIMPLE><TAG></TAG><ALIAS></ALIAS><ADVANCED>0</ADVANCED><COMPARISON>5</COMPARISON><FUNCTION>0</FUNCTION><SORT>0</SORT><CHECKED>1</CHECKED><LEFTSIDEKIND>2</LEFTSIDEKIND><RIGHTSIDEKIND>1</RIGHTSIDEKIND><COMPARISONATTRIBUTES></COMPARISONATTRIBUTES><FUNCTIONATTRIBUTES></FUNCTIONATTRIBUTES><LEFTFIELDPATH>Network Nodes.Node Status.Node Status</LEFTFIELDPATH><RIGHTFIELDPATH></RIGHTFIELDPATH><LEFTVALUETYPE>0</LEFTVALUETYPE><LEFTVALUE></LEFTVALUE><LEFTCAPTION>Node Status</LEFTCAPTION><RIGHTVALUETYPE>8</RIGHTVALUETYPE><RIGHTVALUE>2</RIGHTVALUE><RIGHTCAPTION>Down</RIGHTCAPTION></SIMPLE><COMPLEX><TAG></TAG><CONNECTIVE>2</CONNECTIVE><CHECKED>1</CHECKED><SIMPLE><TAG></TAG><ALIAS></ALIAS><ADVANCED>0</ADVANCED><COMPARISON>0</COMPARISON><FUNCTION>0</FUNCTION><SORT>0</SORT><CHECKED>1</CHECKED><LEFTSIDEKIND>2</LEFTSIDEKIND><RIGHTSIDEKIND>1</RIGHTSIDEKIND><COMPARISONATTRIBUTES></COMPARISONATTRIBUTES><FUNCTIONATTRIBUTES></FUNCTIONATTRIBUTES><LEFTFIELDPATH>APM Component Monitors.Component Status</LEFTFIELDPATH><RIGHTFIELDPATH></RIGHTFIELDPATH><LEFTVALUETYPE>0</LEFTVALUETYPE><LEFTVALUE></LEFTVALUE><LEFTCAPTION>Component Status</LEFTCAPTION><RIGHTVALUETYPE>8</RIGHTVALUETYPE><RIGHTVALUE>Critical</RIGHTVALUE><RIGHTCAPTION>Critical</RIGHTCAPTION></SIMPLE><SIMPLE><TAG></TAG><ALIAS></ALIAS><ADVANCED>0</ADVANCED><COMPARISON>0</COMPARISON><FUNCTION>0</FUNCTION><SORT>0</SORT><CHECKED>1</CHECKED><LEFTSIDEKIND>2</LEFTSIDEKIND><RIGHTSIDEKIND>1</RIGHTSIDEKIND><COMPARISONATTRIBUTES></COMPARISONATTRIBUTES><FUNCTIONATTRIBUTES></FUNCTIONATTRIBUTES><LEFTFIELDPATH>APM Component Monitors.Component Status</LEFTFIELDPATH><RIGHTFIELDPATH></RIGHTFIELDPATH><LEFTVALUETYPE>0</LEFTVALUETYPE><LEFTVALUE></LEFTVALUE><LEFTCAPTION>Component Status</LEFTCAPTION><RIGHTVALUETYPE>8</RIGHTVALUETYPE><RIGHTVALUE>Warning</RIGHTVALUE><RIGHTCAPTION>Warning</RIGHTCAPTION></SIMPLE></COMPLEX></COMPLEX></QUERY>" ResetQuery="SELECT APM_AlertsAndReportsData.ComponentID AS NetObjectID, APM_AlertsAndReportsData.ComponentName AS Name
FROM Nodes INNER JOIN APM_AlertsAndReportsData ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)
WHERE NOT
(
(Nodes.Status <> '2') AND
(
(APM_AlertsAndReportsData.ComponentStatus = 'Critical') OR
(APM_AlertsAndReportsData.ComponentStatus = 'Warning')
)
)" ResetQueryDesign="SIMPLE" SuppressionQuery="" SuppressionQueryDesign="<QUERY><KIND>1</KIND><COMPLEX><TAG></TAG><CONNECTIVE>1</CONNECTIVE><CHECKED>1</CHECKED></COMPLEX></QUERY>" TriggerSustained="0" ResetSustained="0" LastExecuteTime="9/8/2011 2:14:48 PM" ExecuteInterval="60" BlockUntil="9/8/2011 2:14:49 PM" ResponseTime="0" LastErrorTime="9/3/2011 5:39:07 PM" LastError="System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)
at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)
at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe)
at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
at AlertingEngine.CheckAlert.UpdateRowsThatAreReset()" ObjectType="APM: Component" IgnoreTimeout="True"><AlertActions><AlertAction ActionDefID="{4449897B-5293-4402-86E9-CB3A59E386FF}" AlertDefID="{5EB4B441-F3D6-4DF1-AA39-A59B4B5191AB}" TriggerAction="True" ExecuteIfAcknowledged="True" TimeOffset="0" RepeatInterval="0" StartTime="12:00:00 AM" EndTime="11:59:59 PM" DOW="1,2,3,4,5,6,7" SortOrder="1" ActionType="NPMEventLog" Title="NetPerMon Event Log : Component ${ComponentName} on Application ${ApplicationName} on Node ${NodeName} is ${ComponentStatus}" Target="" Parameter1="NetPerMon Event Log: Component ${ComponentName} on Application ${ApplicationName} on Node ${NodeName} is ${ComponentStatus}" Parameter2="" Parameter3="" Parameter4="" NetObjectType=""/><AlertAction ActionDefID="{DB86F4B0-1372-4EE8-B887-4A2F9F5E1AE1}" AlertDefID="{5EB4B441-F3D6-4DF1-AA39-A59B4B5191AB}" TriggerAction="False" ExecuteIfAcknowledged="True" TimeOffset="0" RepeatInterval="0" StartTime="12:00:00 AM" EndTime="11:59:59 PM" DOW="1,2,3,4,5,6,7" SortOrder="1" ActionType="NPMEventLog" Title="NetPerMon Event Log : Component ${ComponentName} on Application ${ApplicationName} on Node ${NodeName} is ${ComponentStatus}" Target="" Parameter1="NetPerMon Event Log: Component ${ComponentName} on Application ${ApplicationName} on Node ${NodeName} is ${ComponentStatus}" Parameter2="" Parameter3="" Parameter4="" NetObjectType=""/></AlertActions></AlertDefinition></AlertDefinitions>