3 Replies Latest reply on Apr 4, 2017 8:23 PM by gcp

    Alerts for SQL Agent Jobs Missing thier run - due to downtime.

    dessalil

      All,

       

      I am new to DPA, hence would appreciate if someone could help.

       

      Consider that we have a SQL Server, which has a lot of jobs scheduled.

      Lets say we have a cluster failover or some downtime on this machine due to which some Jobs dint kick off during their scheduled time.

      How can we create an alert of report for the same? Alert is preferred.

       

      Thanks,

      SD

        • Re: Alerts for SQL Agent Jobs Missing thier run - due to downtime.
          gcp

          You can set up a custom alert such as "Custom SQL Alert - Multiple Numeric Return" using something like this:

           

          SELECT

              JOBS.NAME AS [Job Name]

              ,DATEDIFF(minute, MAX(msdb.DBO.AGENT_DATETIME(RUN_DATE, RUN_TIME)), getdate() )

          FROM msdb.dbo.SYSJOBS JOBS

          LEFT OUTER JOIN msdb.dbo.SYSJOBHISTORY JH

              ON JOBS.job_id = JH.job_id

                  WHERE JH.step_id = 0

          GROUP BY JOBS.name

           

          This just tells you the job name and how long (minutes) since the job last ran (not whether it was successful). Set the alert threshold to a suitable value to pick up jobs that haven't run in the relevant period.

          This SQL is not fully tested, so please verify the results before you use it in Prod

            • Re: Alerts for SQL Agent Jobs Missing thier run - due to downtime.
              fquintero

              This Query return count JOB Fail , I did create metric custom

               

              yuo try in dev

               

              select count(*) as JOBFail

              from (

              SELECT  distinct name AS [Job Name]

                     

                       ,CASE WHEN enabled=1 THEN 'Enabled' 

                             ELSE 'Disabled' 

                        END [Job Status]

                       ,CASE WHEN SJH.run_status=0 THEN 'Failed'

                                   WHEN SJH.run_status=1 THEN 'Succeeded'

                                   WHEN SJH.run_status=2 THEN 'Retry'

                                   WHEN SJH.run_status=3 THEN 'Cancelled'

                             ELSE 'Unknown' 

                        END [Job Outcome]

              FROM msdb..sysjobhistory SJH 

              JOIN msdb..sysjobs SJ  ON  SJH.job_id=sj.job_id 

              WHERE  step_id=0 

              and SJH.run_status=0

              AND    DATEADD(S, 

                (run_time/10000)*60*60 /* hours */ 

                +((run_time - (run_time/10000) * 10000)/100) * 60 /* mins */ 

                + (run_time - (run_time/100) * 100)  /* secs */, 

                CONVERT(DATETIME,RTRIM(run_date),113)) >= DATEADD(d,-1,GetDate()) )as s

                • Re: Alerts for SQL Agent Jobs Missing thier run - due to downtime.
                  gcp

                  That does give you a count of failed jobs, provided it is a single step job, or they actually catch success/failure for each step.

                  If the 'On failure action' is defaulted to 'Go to the next step' then you'll never see an error if a second or subsequent step fails.

                   

                  I have a vendor (*cough* airwatch *cough*) job that fails in step 9 of 51, but still shows overall as succeeded - because they only check the status of the last step! The standard DPA Administrative Alert for SQL Server Job Failure does actually flag it as failed, which is how I found out.

                   

                  If I change to "WHERE  step_id>=0 " in your SQL it does count the failure, but otherwise it does not.

                   

                  This also won't help in the original question to determine if a job has NOT run, as that is not a failure.

                   

                  I got caught a long time ago with a backup job that didn't fail. It was showing as running, but the scheduler had hung and we didn't get a backup for way too long.

                  After that we added checks for how long since the last successful backup job ran as well.