This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

How Can I Alert on a Backup Failure of a node

So, I am aware of this thread: https://thwack.solarwinds.com/product-forums/network-configuration-manager-ncm/f/forum/55296/backup-failure-alerts And I am also aware the job logs can be emailed, but that is a very crude email or don't email approach. 

I also like the idea of the mop up action that showed, and I can see that working in some cases for us but not for all. We have >100 weekly backup schedules which attempt to backup >6k devices. We don't want folks to manually trawl through log files as we already get blindsided enough by emails and so need an automated method to do this.

Simply stated:

We need to alert on any specific node that fails to backup so that it can be raised as an incident.

- personally speaking I know this is likely to raise more manual work to go and trigger a fresh backup or investigate why it is failing, but in this glorious new world of automation, this is what management want. And, let's face it, there is only so much sense you can talk to management before they start demanding stuff Wink

Parents
  • I see where you are headed, and it makes sense.

    My NOC has a dashboard with these stats (and links to a report called NCM Backup Audit). 



    The report is powered by the SWQL below, I don't remember where this came from. Might be default, I might have found it here on Thwack, heck, I occasionally even write it myself. 

    SELECT DISTINCT
    CASE WHEN CT.NodeID IS NOT NULL THEN 'BackedUp' ELSE 'NotBackedUp' END AS NCMStatus,     
    N.Caption,         
    N.IPAddress,
    N.DetailsUrl,
    N.Vendor,
    N.Status,
    N.Icon,
    NP.LoginStatus
    FROM NCM.NodeProperties as NP
    JOIN Orion.Nodes N ON N.NodeID = NP.CoreNodeID
    Inner JOIN Orion.Vendors OV ON N.Vendor = OV.Name
    LEFT JOIN NCM.ConfigArchive CT ON NP.NodeID = CT.NodeID
     

    So I 'know' that NCM will blank out a record from the ConfigArchive when the back up fails, which is why the report works. At least I hope it works and somebody smarter than me won't point out a huge gap in my understanding. It might happen. 

    To fix the query above to make Orion alerts happy, I bridge to the NCM nodes, then to the ConfigArchive. I look only at things missing archives, but do exist in NCM. This alert assumes you back up everything in NCM, so you might need more conditions in your where statement. The whole SWQL is below that I tried. 

    SELECT top 100 Nodes.Uri, Nodes.DisplayName 
    FROM Orion.Nodes AS Nodes
    Left Join NCM.Nodes NCM ON Nodes.NodeID = NCM.CoreNodeID
    LEFT JOIN NCM.ConfigArchive CA ON NCM.NodeID = CA.NodeID
    Where CA.NodeID IS NULL and NCM.NodeID is Not Null 

    And the alert might look something like this. 

    I haven't done more that a brief test, but it seemed right here. Let me know what you think. 

  • Thanks Jake ...

    I can see where the above is heading and would be useful if we had a NOC view. Job shifts and the pandemic essentially got rid of any NOC that we did have.

    What I had originally started working with was from the 'Cirris.Audit' table in SWQL Studio and trying to find a way to show any node that hadn't backed up in the last 7 days (which doesn't seem to work either and to be fair, I did mangle this code together). Best I came up with was to show the jobs that had failed attempts

    That code is:

    SELECT TOP 1000 UserName, ModuleName, Type, Action, Details, DateTime
    FROM Cirrus.Audit
    WHERE Action LIKE '%Download%'
    AND DateTime < ADDDATE('DAY', -7, GETUTCDATE())
    AND Type = 'Failed'

    As this wasn't getting anywhere for me I segued over to alerts and mashed this together:

    The downside of this, as I see it is that the date is fixed from a specific time/date, whereas I need it to dynamically see the last 7 days. So that I could see what this was grabbing I copied the SWQL it generated into SWQL Studio.

    A good result that sees a relevant number of nodes that failed to backup since my above arbitrary date. Only those nodes are identified purely as 'Uri' and looks something like: 

    SolarWinds_Server/Orion/Cirrus.Audit/ID=DF3838A4-EE67-4850-BF85-020558D97D65

    If I add that Uri into my browser I get nothing back. I think what I did (it's been a while since I started looking at this), is that I took one of those ID's, adjust the code below to include " AND ID LIKE '%blah%' " which narrowed it down to the specific job. Then I looked at the job and could indeed see a failure.

    Essentially, if I can link (somehow) that ID to a node, I guess via a join somewhere/somehow then the last 7 days is the least of my worries.

    For completeness sake, the code for the above trigger is:

    SELECT E0.[Uri], E0.[DisplayName]
    FROM Cirrus.Audit AS E0 
    WHERE ( ( E0.[Action] = 'Download Config requested by job' ) 
    AND ( E0.[DateTime] > '20210613 23:00:00' ) 
    AND ( E0.[Type] = 'Failed' ) )

    I may of course be barking completely up the wrong tree here.

  • Getting the hard coded date out of there isn't hard, take a look at: 

    SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
    FROM Cirrus.Audit A 
    Where  DateTime > ADDDAY(-7,DateTrunc('day',Getdate()))

    I do not see a good way yet to jump out to the node information. I assume it can be done, but haven't figured that out yet. 

  • Right ... so expanding on that slightly, I can do this:

    SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
    FROM Cirrus.Audit A 
    Where  DateTime > ADDDAY(-1,DateTrunc('day',Getdate()))
    AND Type = 'Failed'
    AND UserName LIKE '%Weekly%'

    To give me just the Weekly backup jobs that failed in the last day.

    As you point out - the link to the specific node and details of are the key part to this for us. So I can see that one client had two failed nodes in yesterdays backup and if I look at the log for that job I can see the job list concurs with that.

    It's getting there.  Grin

Reply
  • Right ... so expanding on that slightly, I can do this:

    SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
    FROM Cirrus.Audit A 
    Where  DateTime > ADDDAY(-1,DateTrunc('day',Getdate()))
    AND Type = 'Failed'
    AND UserName LIKE '%Weekly%'

    To give me just the Weekly backup jobs that failed in the last day.

    As you point out - the link to the specific node and details of are the key part to this for us. So I can see that one client had two failed nodes in yesterdays backup and if I look at the log for that job I can see the job list concurs with that.

    It's getting there.  Grin

Children
  • @stuartd I have 2 errors in the details field. I can pull the IP out of one of them, and use that to link to the nodes (works for successful downloads too). I still don't see a way to use the ID to get an IP to solve the other type, and do you have any other failure messages that I haven't seen? 

    Connection Refused by 10.666.666.666
    Connectivity issues, discarding configuration (or configuration is too short)
  • Yes, lots of failure messages Grin

    In addition to your one above, I've seen:

    - Connection refused by
    - Error downloading config to TFTP Host
    - Protocol is not supported for binary config
    Error downloading config to SCP Host
    - Transfer failure due to timeout
    - Config "2jhfzr3dgx2.config" not found on SCP/TFTP