Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

How Can I Alert on a Backup Failure of a node

So, I am aware of this thread: https://thwack.solarwinds.com/product-forums/network-configuration-manager-ncm/f/forum/55296/backup-failure-alerts And I am also aware the job logs can be emailed, but that is a very crude email or don't email approach.

I also like the idea of the mop up action that @RichardLetts showed, and I can see that working in some cases for us but not for all. We have >100 weekly backup schedules which attempt to backup >6k devices. We don't want folks to manually trawl through log files as we already get blindsided enough by emails and so need an automated method to do this.

Simply stated:

We need to alert on any specific node that fails to backup so that it can be raised as an incident.

- personally speaking I know this is likely to raise more manual work to go and trigger a fresh backup or investigate why it is failing, but in this glorious new world of automation, this is what management want. And, let's face it, there is only so much sense you can talk to management before they start demanding stuff

Find more posts tagged with

Accepted answers

stuartd

And replying to myself because I can.

As we do a rolling 7 day backup process then I just added:

AND AttemptedDownloadTime >= GETDATE()-7

Which only shows me any failures in the last 7 days.

All comments

jm_sysadmin

I see where you are headed, and it makes sense.

My NOC has a dashboard with these stats (and links to a report called NCM Backup Audit).

The report is powered by the SWQL below, I don't remember where this came from. Might be default, I might have found it here on Thwack, heck, I occasionally even write it myself.

SELECT DISTINCT
CASE WHEN CT.NodeID IS NOT NULL THEN 'BackedUp' ELSE 'NotBackedUp' END AS NCMStatus,     
N.Caption,         
N.IPAddress,
N.DetailsUrl,
N.Vendor,
N.Status,
N.Icon,
NP.LoginStatus
FROM NCM.NodeProperties as NP
JOIN Orion.Nodes N ON N.NodeID = NP.CoreNodeID
Inner JOIN Orion.Vendors OV ON N.Vendor = OV.Name
LEFT JOIN NCM.ConfigArchive CT ON NP.NodeID = CT.NodeID

So I 'know' that NCM will blank out a record from the ConfigArchive when the back up fails, which is why the report works. At least I hope it works and somebody smarter than me won't point out a huge gap in my understanding. It might happen.

To fix the query above to make Orion alerts happy, I bridge to the NCM nodes, then to the ConfigArchive. I look only at things missing archives, but do exist in NCM. This alert assumes you back up everything in NCM, so you might need more conditions in your where statement. The whole SWQL is below that I tried.

SELECT top 100 Nodes.Uri, Nodes.DisplayName 
FROM Orion.Nodes AS Nodes
Left Join NCM.Nodes NCM ON Nodes.NodeID = NCM.CoreNodeID
LEFT JOIN NCM.ConfigArchive CA ON NCM.NodeID = CA.NodeID
Where CA.NodeID IS NULL and NCM.NodeID is Not Null

And the alert might look something like this.

I haven't done more that a brief test, but it seemed right here. Let me know what you think.

stuartd

Thanks Jake ...

I can see where the above is heading and would be useful if we had a NOC view. Job shifts and the pandemic essentially got rid of any NOC that we did have.

What I had originally started working with was from the 'Cirris.Audit' table in SWQL Studio and trying to find a way to show any node that hadn't backed up in the last 7 days (which doesn't seem to work either and to be fair, I did mangle this code together). Best I came up with was to show the jobs that had failed attempts

That code is:

SELECT TOP 1000 UserName, ModuleName, Type, Action, Details, DateTime
FROM Cirrus.Audit
WHERE Action LIKE '%Download%'
AND DateTime < ADDDATE('DAY', -7, GETUTCDATE())
AND Type = 'Failed'

As this wasn't getting anywhere for me I segued over to alerts and mashed this together:

The downside of this, as I see it is that the date is fixed from a specific time/date, whereas I need it to dynamically see the last 7 days. So that I could see what this was grabbing I copied the SWQL it generated into SWQL Studio.

A good result that sees a relevant number of nodes that failed to backup since my above arbitrary date. Only those nodes are identified purely as 'Uri' and looks something like:

SolarWinds_Server/Orion/Cirrus.Audit/ID=DF3838A4-EE67-4850-BF85-020558D97D65

If I add that Uri into my browser I get nothing back. I think what I did (it's been a while since I started looking at this), is that I took one of those ID's, adjust the code below to include " AND ID LIKE '%blah%' " which narrowed it down to the specific job. Then I looked at the job and could indeed see a failure.

Essentially, if I can link (somehow) that ID to a node, I guess via a join somewhere/somehow then the last 7 days is the least of my worries.

For completeness sake, the code for the above trigger is:

SELECT E0.[Uri], E0.[DisplayName]
FROM Cirrus.Audit AS E0 
WHERE ( ( E0.[Action] = 'Download Config requested by job' ) 
AND ( E0.[DateTime] > '20210613 23:00:00' ) 
AND ( E0.[Type] = 'Failed' ) )

I may of course be barking completely up the wrong tree here.

jm_sysadmin

Getting the hard coded date out of there isn't hard, take a look at:

SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
FROM Cirrus.Audit A 
Where  DateTime > ADDDAY(-7,DateTrunc('day',Getdate()))

I do not see a good way yet to jump out to the node information. I assume it can be done, but haven't figured that out yet.

stuartd

Right ... so expanding on that slightly, I can do this:

SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
FROM Cirrus.Audit A 
Where  DateTime > ADDDAY(-1,DateTrunc('day',Getdate()))
AND Type = 'Failed'
AND UserName LIKE '%Weekly%'

To give me just the Weekly backup jobs that failed in the last day.

As you point out - the link to the specific node and details of are the key part to this for us. So I can see that one client had two failed nodes in yesterdays backup and if I look at the log for that job I can see the job list concurs with that.

It's getting there.

stevenstadel

Love the NOC Widget! Can you share the config behind your NCM Backup Summary?

Seashore

I use below query as a custom SWQL alert, seem to work:

SELECT
Nodes.Uri, Nodes.DisplayName,

FROM Orion.Nodes AS Nodes
INNER JOIN Cirrus.Nodes AS NCM ON Nodes.Nodeid=NCM.CoreNodeID
INNER JOIN
(
SELECT
CA.NodeID AS NodeID,
MAX(CA.AttemptedDownloadTime) as LastBackup
FROM Cirrus.ConfigArchive AS CA
GROUP BY CA.NodeID
HAVING MAX(CA.AttemptedDownloadTime)Like this:

stuartd

Thank you @Seashore but that didn't quite work for me.

First, and it took me a while as I don't code, the Select statement had one too many commas. Once it was removed it ran and that was great.

So, I have to presume that 'NCM.Status=1' means it failed. But LastAttemptedDownloadTime doesn't seem to be right in terms of what we are trying to achieve. It doesn't work for us because what we really need to see is when a job fails, not when it was last attempted - which is what yours does if I read it right. please do correct me if I'm wrong.

Going back to my little script:

SELECT A.ID,    A.UserName, A.ModuleName, A.Type, A.Action, A.Details, A.[DateTime]
FROM Cirrus.Audit A 
Where  DateTime > ADDDAY(-1,DateTrunc('day',Getdate()))
AND Type = 'Failed'
AND UserName LIKE '%Weekly%'

This works to a degree, and gives the result we need only not offering the flexibility of pulling in node names, etc from Orion.Nodes or similar. It also works as something we could run once a day looking for jobs that failed in the past 24hrs. Currently that sits at zero so no incidents would get raised. I f I set your query to the same time period I get 1,927 returns.

No way the team could handle an influx of almost 2k incidents to investigate failed backups, that at a quick glance, aren't failing (most of them) they are just older than a day since last attempt.

Seashore

Hi Stuartd

Sorry about the comma.

Above alert script assumes we are trying to do a backup every night. Alerting if backup hasn't worked for 2 days. Not checking the job status, but if we have a successful backup attempt (could be that we don't do backup if there is no change, therefor checking for an successful attempt) .

NCM.status=1 is the node status. So we exclude nodes that are down.

If you do weekly backups you could change the "-2" to "-8" or so. Thats probably why you get so many returns

stuartd

Circling back round to this, I've done some more work and got not a lot further.

What is stumping me at present is that I can't find where the DB stores the date of the last actual backup be it a manually instigated one or one run via a job.

I have a little script (and my first ever successful written join from scratch - thanks @jm_sysadmin for your time today) which is:

SELECT TOP 1000 CNP.CoreNodeID, N.Caption, CNP.LoginStatus, CNP.ConfigTypes, CNP.LastTransferActionType, CNP.LastTransferDate, CNP.LastInventory, CNP.LastTransferMessage, CNP.IsTransferError, CNP.IsActiveTransfer, CNP.IsTransferCanceling
FROM Cirrus.NodeProperties CNP
JOIN  Orion.Nodes N ON CNP.CoreNodeID = N.NodeID
WHERE CNP.LastTransferMessage <> 'Complete'

Now, a lot of those fields are not required but I left them in just in hopes / for visibility, but the one that looks most promising is: LastTransferDate - sadly, this only ever shows the last manually instigated backup via the Config Management screen and not the ones from jobs.

So, yet another plea, does anyone know where the DB stores the actual last backup date however it was instigated. PLEASE?

stuartd

So I've found the location of the last actual date of a backup or backup attempt. It's stored in 'Cirrus.ConfigArchive' as 'DownloadTime'

Which leads to my next query on this....

I need to JOIN this table into the prior query - and as I've only just worked out how to do a JOIN I have no idea how to join in a third table. Any help on this please?

mnsh_majumder

have modified the query(based on sql) to alert me whose backup last transfer message in not complete.

SELECT Nodes.CAPTION,NODES.VENDOR,NP.LASTTRANSFERMESSAGE,NCA.AttemptedDownloadTime,np.loginstatus fROM [DBO].[Nodes] AS Nodes
INNER JOIN [dbo].[NCM_nodeproperties] AS NP ON Nodes.Nodeid=NP.coreNodeID
INNER JOIN [dbo].[NCM_ConfigArchive] AS NCA ON NP.NodeID=NCA.NODEID
/*INNER JOIN
(
SELECT CA.NodeID AS NodeID,MAX(CA.AttemptedDownloadTime) as LastBackup FROM [dbo].[NCM_ConfigArchive] AS CA
GROUP BY CA.NodeID
HAVING MAX(CA.AttemptedDownloadTime)<DATEADD(DAY,-2,GETDATE()) -- Adjust how old backups that are ok
)
AS A ON NP.NodeID=A.Nodeid */
WHERE NCA.AttemptedDownloadTime<DATEADD(DAY,-2,GETDATE())
--and NP.LOGINSTATUS NOT LIKE'%Login OK%'
-- NODES.VENDOR LIKE '%F5%' AND
AND LASTTRANSFERMESSAGE NOT LIKE '%Complete%'

stuartd

@mnsh_majumder nice.

Any idea how you'd narrow it down to only show the last attempted backup? Using DISTINCT doesn't work, so the query needs to include the most recent date that it didn't complete - not all of them.

stuartd

And replying to myself because I can.

As we do a rolling 7 day backup process then I just added:

AND AttemptedDownloadTime >= GETDATE()-7

Which only shows me any failures in the last 7 days.

mnsh_majumder

this query is giving me more better result as per requirement.

INNER JOIN [dbo].[NCM_nodeproperties] AS NP ON Nodes.Nodeid=NP.coreNodeID
--INNER JOIN [dbo].[NCM_ConfigArchive] AS NCA ON NP.NodeID=NCA.NODEID
WHERE Np.lasttransferdate<DATEADD(DAY,-10,GETDATE())
AND NP.LASTTRANSFERMESSAGE NOT LIKE '%Complete%'
and nodes.vendor like'%F5%'