How Can I Alert on a Backup Failure of a node

So, I am aware of this thread: And I am also aware the job logs can be emailed, but that is a very crude email or don't email approach. 

I also like the idea of the mop up action that showed, and I can see that working in some cases for us but not for all. We have >100 weekly backup schedules which attempt to backup >6k devices. We don't want folks to manually trawl through log files as we already get blindsided enough by emails and so need an automated method to do this.

Simply stated:

We need to alert on any specific node that fails to backup so that it can be raised as an incident.

- personally speaking I know this is likely to raise more manual work to go and trigger a fresh backup or investigate why it is failing, but in this glorious new world of automation, this is what management want. And, let's face it, there is only so much sense you can talk to management before they start demanding stuff Wink

  • I see where you are headed, and it makes sense.

    My NOC has a dashboard with these stats (and links to a report called NCM Backup Audit). 

    The report is powered by the SWQL below, I don't remember where this came from. Might be default, I might have found it here on Thwack, heck, I occasionally even write it myself. 

    CASE WHEN CT.NodeID IS NOT NULL THEN 'BackedUp' ELSE 'NotBackedUp' END AS NCMStatus,     
    FROM NCM.NodeProperties as NP
    JOIN Orion.Nodes N ON N.NodeID = NP.CoreNodeID
    Inner JOIN Orion.Vendors OV ON N.Vendor = OV.Name
    LEFT JOIN NCM.ConfigArchive CT ON NP.NodeID = CT.NodeID

    So I 'know' that NCM will blank out a record from the ConfigArchive when the back up fails, which is why the report works. At least I hope it works and somebody smarter than me won't point out a huge gap in my understanding. It might happen. 

    To fix the query above to make Orion alerts happy, I bridge to the NCM nodes, then to the ConfigArchive. I look only at things missing archives, but do exist in NCM. This alert assumes you back up everything in NCM, so you might need more conditions in your where statement. The whole SWQL is below that I tried. 

    SELECT top 100 Nodes.Uri, Nodes.DisplayName 
    FROM Orion.Nodes AS Nodes
    Left Join NCM.Nodes NCM ON Nodes.NodeID = NCM.CoreNodeID
    LEFT JOIN NCM.ConfigArchive CA ON NCM.NodeID = CA.NodeID
    Where CA.NodeID IS NULL and NCM.NodeID is Not Null 

    And the alert might look something like this. 

    I haven't done more that a brief test, but it seemed right here. Let me know what you think. 

  • Thanks Jake ...

    I can see where the above is heading and would be useful if we had a NOC view. Job shifts and the pandemic essentially got rid of any NOC that we did have.

    What I had originally started working with was from the 'Cirris.Audit' table in SWQL Studio and trying to find a way to show any node that hadn't backed up in the last 7 days (which doesn't seem to work either and to be fair, I did mangle this code together). Best I came up with was to show the jobs that had failed attempts

    That code is:

    SELECT TOP 1000 UserName, ModuleName, Type, Action, Details, DateTime
    FROM Cirrus.Audit
    WHERE Action LIKE '%Download%'
    AND DateTime < ADDDATE('DAY', -7, GETUTCDATE())
    AND Type = 'Failed'

    As this wasn't getting anywhere for me I segued over to alerts and mashed this together:

    The downside of this, as I see it is that the date is fixed from a specific time/date, whereas I need it to dynamically see the last 7 days. So that I could see what this was grabbing I copied the SWQL it generated into SWQL Studio.

    A good result that sees a relevant number of nodes that failed to backup since my above arbitrary date. Only those nodes are identified purely as 'Uri' and looks something like: 


    If I add that Uri into my browser I get nothing back. I think what I did (it's been a while since I started looking at this), is that I took one of those ID's, adjust the code below to include " AND ID LIKE '%blah%' " which narrowed it down to the specific job. Then I looked at the job and could indeed see a failure.

    Essentially, if I can link (somehow) that ID to a node, I guess via a join somewhere/somehow then the last 7 days is the least of my worries.

    For completeness sake, the code for the above trigger is:

    SELECT E0.[Uri], E0.[DisplayName]
    FROM Cirrus.Audit AS E0 
    WHERE ( ( E0.[Action] = 'Download Config requested by job' ) 
    AND ( E0.[DateTime] > '20210613 23:00:00' ) 
    AND ( E0.[Type] = 'Failed' ) )

    I may of course be barking completely up the wrong tree here.

  • Getting the hard coded date out of there isn't hard, take a look at: 

    SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
    FROM Cirrus.Audit A 
    Where  DateTime > ADDDAY(-7,DateTrunc('day',Getdate()))

    I do not see a good way yet to jump out to the node information. I assume it can be done, but haven't figured that out yet. 

  • Right ... so expanding on that slightly, I can do this:

    SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
    FROM Cirrus.Audit A 
    Where  DateTime > ADDDAY(-1,DateTrunc('day',Getdate()))
    AND Type = 'Failed'
    AND UserName LIKE '%Weekly%'

    To give me just the Weekly backup jobs that failed in the last day.

    As you point out - the link to the specific node and details of are the key part to this for us. So I can see that one client had two failed nodes in yesterdays backup and if I look at the log for that job I can see the job list concurs with that.

    It's getting there.  Grin

  • @stuartd I have 2 errors in the details field. I can pull the IP out of one of them, and use that to link to the nodes (works for successful downloads too). I still don't see a way to use the ID to get an IP to solve the other type, and do you have any other failure messages that I haven't seen? 

    Connection Refused by 10.666.666.666
    Connectivity issues, discarding configuration (or configuration is too short)
  • Love the NOC Widget! Can you share the config behind your NCM Backup Summary?

  • Hi

    I use below query as a custom SWQL alert, seem to work:

    Nodes.Uri, Nodes.DisplayName,
    FROM Orion.Nodes AS Nodes
    INNER JOIN Cirrus.Nodes AS NCM ON Nodes.Nodeid=NCM.CoreNodeID
    CA.NodeID AS NodeID,
    MAX(CA.AttemptedDownloadTime) as LastBackup
    FROM Cirrus.ConfigArchive AS CA
    HAVING MAX(CA.AttemptedDownloadTime)<ADDDAY(-2,GETDATE()) -- Adjust how old backups that are ok
    ) AS A ON NCM.NodeID=A.Nodeid
    WHERE NCM.Status=1

    Like this:

  • Thank you but that didn't quite work for me.

    First, and it took me a while as I don't code, the Select statement had one too many commas. Once it was removed it ran and that was great.

    So, I have to presume that 'NCM.Status=1' means it failed. But LastAttemptedDownloadTime doesn't seem to be right in terms of what we are trying to achieve. It doesn't work for us because what we really need to see is when a job fails, not when it was last attempted - which is what yours does if I read it right. please do correct me if I'm wrong.

    Going back to my little script:

    SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
    FROM Cirrus.Audit A 
    Where  DateTime > ADDDAY(-1,DateTrunc('day',Getdate()))
    AND Type = 'Failed'
    AND UserName LIKE '%Weekly%'

    This works to a degree, and gives the result we need only not offering the flexibility of pulling in node names, etc from Orion.Nodes or similar. It also works as something we could run once a day looking for jobs that failed in the past 24hrs. Currently that sits at zero so no incidents would get raised. I f I set your query to the same time period I get 1,927 returns.

    No way the team could handle an influx of almost 2k incidents to investigate failed backups, that at a quick glance, aren't failing (most of them) they are just older than a day since last attempt. 

  • Yes, lots of failure messages Grin

    In addition to your one above, I've seen:

    - Connection refused by
    - Error downloading config to TFTP Host
    - Protocol is not supported for binary config
    Error downloading config to SCP Host
    - Transfer failure due to timeout
    - Config "2jhfzr3dgx2.config" not found on SCP/TFTP

  • Hi Stuartd

    Sorry about the comma.

    Above alert script assumes we are trying to do a backup every night. Alerting if backup hasn't worked for 2 days. Not checking the job status, but if we have a successful backup attempt (could be that we don't do backup if there is no change, therefor checking for an successful attempt) .

    NCM.status=1 is the node status. So we exclude nodes that are down. 

    If you do weekly backups you could change the "-2" to "-8" or so. Thats probably why you get so many returns