This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

How Can I Alert on a Backup Failure of a node

So, I am aware of this thread: https://thwack.solarwinds.com/product-forums/network-configuration-manager-ncm/f/forum/55296/backup-failure-alerts And I am also aware the job logs can be emailed, but that is a very crude email or don't email approach. 

I also like the idea of the mop up action that showed, and I can see that working in some cases for us but not for all. We have >100 weekly backup schedules which attempt to backup >6k devices. We don't want folks to manually trawl through log files as we already get blindsided enough by emails and so need an automated method to do this.

Simply stated:

We need to alert on any specific node that fails to backup so that it can be raised as an incident.

- personally speaking I know this is likely to raise more manual work to go and trigger a fresh backup or investigate why it is failing, but in this glorious new world of automation, this is what management want. And, let's face it, there is only so much sense you can talk to management before they start demanding stuff Wink

Parents
  • Hi

    I use below query as a custom SWQL alert, seem to work:

    SELECT
    Nodes.Uri, Nodes.DisplayName,
    
    FROM Orion.Nodes AS Nodes
    INNER JOIN Cirrus.Nodes AS NCM ON Nodes.Nodeid=NCM.CoreNodeID
    INNER JOIN
    (
    SELECT
    CA.NodeID AS NodeID,
    MAX(CA.AttemptedDownloadTime) as LastBackup
    FROM Cirrus.ConfigArchive AS CA
    GROUP BY CA.NodeID
    HAVING MAX(CA.AttemptedDownloadTime)<ADDDAY(-2,GETDATE()) -- Adjust how old backups that are ok
    ) AS A ON NCM.NodeID=A.Nodeid
    
    WHERE NCM.Status=1

    Like this:

Reply
  • Hi

    I use below query as a custom SWQL alert, seem to work:

    SELECT
    Nodes.Uri, Nodes.DisplayName,
    
    FROM Orion.Nodes AS Nodes
    INNER JOIN Cirrus.Nodes AS NCM ON Nodes.Nodeid=NCM.CoreNodeID
    INNER JOIN
    (
    SELECT
    CA.NodeID AS NodeID,
    MAX(CA.AttemptedDownloadTime) as LastBackup
    FROM Cirrus.ConfigArchive AS CA
    GROUP BY CA.NodeID
    HAVING MAX(CA.AttemptedDownloadTime)<ADDDAY(-2,GETDATE()) -- Adjust how old backups that are ok
    ) AS A ON NCM.NodeID=A.Nodeid
    
    WHERE NCM.Status=1

    Like this:

Children
  • Thank you but that didn't quite work for me.

    First, and it took me a while as I don't code, the Select statement had one too many commas. Once it was removed it ran and that was great.

    So, I have to presume that 'NCM.Status=1' means it failed. But LastAttemptedDownloadTime doesn't seem to be right in terms of what we are trying to achieve. It doesn't work for us because what we really need to see is when a job fails, not when it was last attempted - which is what yours does if I read it right. please do correct me if I'm wrong.

    Going back to my little script:

    SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
    FROM Cirrus.Audit A 
    Where  DateTime > ADDDAY(-1,DateTrunc('day',Getdate()))
    AND Type = 'Failed'
    AND UserName LIKE '%Weekly%'

    This works to a degree, and gives the result we need only not offering the flexibility of pulling in node names, etc from Orion.Nodes or similar. It also works as something we could run once a day looking for jobs that failed in the past 24hrs. Currently that sits at zero so no incidents would get raised. I f I set your query to the same time period I get 1,927 returns.

    No way the team could handle an influx of almost 2k incidents to investigate failed backups, that at a quick glance, aren't failing (most of them) they are just older than a day since last attempt. 

  • Hi Stuartd

    Sorry about the comma.

    Above alert script assumes we are trying to do a backup every night. Alerting if backup hasn't worked for 2 days. Not checking the job status, but if we have a successful backup attempt (could be that we don't do backup if there is no change, therefor checking for an successful attempt) .

    NCM.status=1 is the node status. So we exclude nodes that are down. 

    If you do weekly backups you could change the "-2" to "-8" or so. Thats probably why you get so many returns