This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alert / Event Tracking for troubleshooting

For my sins I am being seen as a 'sort of' expert on Solarwinds. I know some bits very well, but I confess I am far from an expert and not even lose to being good with lots of it.

So can someone kindly point me at documentation that shows a methodology for investigating alerts from end to end from a troubleshooting perspective. Either that or please try and talk me through how you go about this?

For example, I'm trying to find and then follow a specific alert arriving in the system to it [raising an incident in our 3rd party tool], and ultimately seeing the up alert / event. This is for a trap if it makes any difference but I'm not having much luck.

Crazy thing is I can see our external ticket side of things being fired off, and this will only have come from an alert trigger in Solarwinds causing an entry to be written to an external log file but I can't see the Solarwinds side of it.

Any thoughts?

I guess I'm asking - how do I see all the events going through the system? If this means trawling the database then anyhelp on the relevant commands would be welcome. Or pointers to a 3rd party tool (or Solarwinds addon) that interrogates the log without constant screen refreshes messingup what I'm looking for / at.

  • I would start with Message Center.  I have taken a ticket from our 3rd party ticket service, BMC Remedy, and searched message center for all events related to the node (or interface or whatever). 

    On an active alert in SolarWinds, I can see a History of Alerts on this Object resource that at least for us can show when an active alert triggered and the alert action taken.  In our case sending an email to Remedy.

  • I have had similar issues with tracking alerts and whether or not emails were generated etc.

    One of the (I regard as significant) issues with this is, once an Alert has been reset (i.e. is no longer 'Active") there is really no easy way to find if/when it triggered in the "Messages Center" view.

    There is an SQL report that I "borrowed" and modified (badly - still tweaking it) that pulls the Alert/Trigger Actions etc. from the SolarWinds SQL DB and that has records for Actions triggered and "results" (not necessarily "true" as a "successful" result can mean only that SolarWinds (for example) sent to the SMTP server and considers that as "I'm done - it's a success - I won't actually check anything to ensure the email was received/forwarded etc.")

    If one of the Actions on the Alert Trigger is to log something to the NetPerfMon log, then that Event record is searchable via the "Events" option in Message Center.

    What I ended up doing to have a searchable (view Message Center, Reports etc.) "trail" is:

    1) For every alert - in Trigger Actions - add a "NetPerfMon Event Log - Alert Trigger Logging" action

    2) For every alert - in Reset Actions - add a "NetPerMon Event Log- Alert Reset Logging" action

    For the Trigger Action - the  Message is:

    Alert -- ${N=Alerting;M=AlertName} -- triggered for Node ${N=SwisEntity;M=Caption}

    For the Reset Action - the Message is:

    Alert -- ${N=Alerting;M=AlertName} -- reset for Node ${N=SwisEntity;M=Caption}

    So - when an Alert is triggered and/or reset, then there is a corresponding Event Log record for it - and the assumption is, if the "NetPerfMon Event Log" action is performed then the alert was triggered and/or reset and any other Actions (email, sms, run script etc.) should also have triggered. So the Alert Triggering worked (for that Node/Event/etc.) and at that time etc.

    Easy to setup - one Trigger Action and one Reset Action and assign each to every alert you want (there are a couple we don't bother with).

    Any new alert setup - select the existing Action(s) to assign.

    I have a report setup that pulls all the "Alert -- " triggered and reset events and orders them by Node, then time for the last 24 hours, 7 days etc. and can be easily and quickly run to see what Alert (M=AlertName) was triggered/reset at what time for the Node in question.

    I have found this approach extremely helpful to track if/what occurs when someone says "I didn't receive any alerts/emails for Node xx last night - alerting isn't working". Run the report - see very quickly if the expected alert was actually triggered/reset and if something there know straight away not to start trawling through the quagmire to find what alerting was doing.

    Every time (so far) has saved me huge amount of time - 95% alerts were triggered and quickly pointed to/diagnosed issue that problem was in SMTP/Exchange land (i.e. email system cactus) and the other 5% led me (much more) rapidly to see that there was a general alerting issue traced to MSMQ queues clogging (which "stopped" status changes/alerting).

    Above may not help with your complete issue (end-to-end) but might help with a large chunk perhaps.

  • Thanks John,

    >> would start with Message Center. I have taken a ticket from our 3rd party ticket service, BMC Remedy, 

    We are also using Remedy, and it ay be of interest to understand how you integrate with it, as I remain unconvinced that our method is the best. Unfortunately the BMC side is done by a spearate team that also used ot manage (install, support, troubleshoot, etc) all our monitoring tools until about 6yrs ago when they gave us 3 months notice that they would no longer support our tools, just the integration.  So what we we have is a hangover from then... it may be the best, but I simply don't know - anyway, perhaps better a private message to discuss this.

    Back on Topic: I've been using message center <sic> but find it so hit and miss with what it provides and thus my query about how to track 'a down alert to incident raised to up alert to incident closed'. The incident and integration side I can do, and despite above reservations works and is traceable by us without the aid of our BMC team. It's understanding the Solarwinds side.

  • it looks like we mirror what you do except for the report.

    I'll need to double check every alert has the netperfmon event log aspect though. I'm pretty sure they do, but it doesn't hurt to double check - which make sunderstanding why I see 'multiple downs' for a device but not the intervening up that clearly must have occured to generate a fresh down?

    Now, the report sounds like a good idea, and I wonder if you'd be willing to share its makeup, as me and the reporting module are not the best of chums?

    >> general alerting issue traced to MSMQ queues clogging

    Thankfully we don't use emails for alerting (we do for some reporting I believe) so that is one less aspect in our chain.

  • Every alert we trigger we write an event to the NetPerfMon log so it should show up in Message Center.  This helps us prove the alert triggered.   I can see the Alert Triggered event, but I like to also see the log of the trigger as another verification.

    As for our BMC integration, it isn't the best.  We email a BMC mailbox and they pull it into Remedy that way.  The problem with that is it is dependent on Exchange working and working efficiently to get the email there on time.  As a result, I am occasionally called on to verify an alert triggered via the Message Center.

    I think the report is a good idea too.  I have a report that can pull up all alerts that triggered in the last 24 hours.  I am not sure anyone on the other teams uses it though.

  • >> As for our BMC integration, it isn't the best. 

    Intriguing - so it isn't just us, but at least I believe we are slightly beter off than this.

    As BMC is also monitoring our servers it does two things - checks for a 'heartbeat' from the Solarwinds platform (this is an every 5m check), and also monitors a speciric folder where we generate a simple text file from the alerts.

    The text file is created by the alert itself and takes the following (loose) format:

    'execute external prog'
    the command is then: path to BATch file along with some arguments (8 of them to be precise)

    The batch file then takes the 8 arguments, inserts the result into a single line text file and appends the same line to a log file. The arguments include time / date / name of device / type of device / what's wrong (i.e. Int G0/1 is down - and so on. One argument specifies the down in the trigger action and one argument specifies UP in the reset action. The arguments rely heavily on custom properties but it seems to work.

    As I say, my issue is believing, trusting or following through on the Solarwinds side.