5 Replies Latest reply on May 13, 2015 1:30 PM by nicole pauls

    Help eliminate Alert Email Overload

    magicpercy

      We have a rule setup to alert us when one of our PCI workstations agent goes offline. We set it up to only alert us if the agent has been offline for an hour, so we don't get false positives for maintenance reboots.

       

      We have some users at remote (Home Office) locations that are always on VPN'n to our organization.

       

      Over this past weekend one of those remote sites had a connectivity issue, and the line would go down and up about every 10 minutes.

       

      This caused the alert to trigger an email about every 10 minutes, which our security chief got very upset about. What we need is to setup to only send the email once per set period. So if we set that period to an hour, if it goes down once or multiple times, only send one email.

       

      Here is the rule as it is today:

       

       

      The exclusions are for some test agents that we do not need this alert for.

        • Re: Help eliminate Alert Email Overload
          curtisi

           

          Maybe this will help.

            • Re: Help eliminate Alert Email Overload
              magicpercy

              Ok, this is great information.

               

              However in our situation, how do we get just one agent offline within a specified timeframe, say only alert once in a 12 hour window.

               

              The situation we had with the flakey VPN, was the agent went offline, then back online, about every 10 minutes. In the video, the reset correlation would be agent online, and in this case would reset constantly therefore still send out emails every 10 minutes.

            • Re: Help eliminate Alert Email Overload
              nicole pauls

              Hmm.... interesting case. I'm waffling between whether this is a bug and expected behavior, not exists rules always get me confused

               

              One thing you could try is adding

              InternalAgentOffline.DetectionTime < InternalAgentOnline.DetectionTime

              which will at least specify that the offline should come before the online.

               

              I think a new correlation instance in memory is getting triggered for each Offline per machine, which may not exactly be what we expect.

               

              Our thresholded rules do have the concept of "time over threshold" (which is something like "fire this rule if you see 10 of these in 30 seconds, then tell me again if the condition is still valid after 5 minutes"), but not on a single event.

                • Re: Help eliminate Alert Email Overload
                  magicpercy

                  Maybe I am not explaining the event very well, as I don't thing an InternalAgentOffline.DetectionTime < InternalAgentOnline.DetectionTime would help, as the offline was already preceding the online.

                   

                  Think of it like this. A PC has an agent, the connection is unplugged, therefore triggering an offline event as there is no communication from the manager to the PC. A few minutes later the connection is plugged back in, bringing the agent back online. This therefore resets the conditions. About 10 minutes later the whole process repeats, thus triggering a new alet every 10 minutes or so.

                   

                  What our understanding is with the rule the way it is setup, that the Correlation is if you se an agent offline after 60 minutes trigger an alert. However if it is back online within 60 minutes do not trigger an alert. Therefore we should only see the alert once every hour if it keeps cycling up and down.

                    • Re: Help eliminate Alert Email Overload
                      nicole pauls

                      I agree, it won't eliminate them - but once you're IN the chain of events there's a small chance it might reduce them.

                       

                      LEM isn't deterministic unless you tell it to, so unless you specify the EventA.DetectionTime < EventB.DetectionTime thing they could happen in any order, so you could actually have an online that came in BEFORE the offline that cancels it out (or doesn't). That 60 minutes is a sliding window before/after the first event that starts the clock.

                       

                      What you described is definitely the ideal case, and I'm still duking it out with the development team as to whether it's a bug or just an artifact of the behavior of the rule. I've found a couple of other NOT EXISTS bugs myself which don't work the way I'd expect so there's still a pretty solid chance it's not working as intended either.