So, here's the scenario. We do a ridiculous amount of logging because we have to for compliance auditing. In a perfect world, the LEM logs and alerts just fine without any problems. The problem is, that I work in the real world where servers go down and dns entries change etc causing loss of communication from the agent to the LEM. If it happens at the right time it might be 24 hours or more of logs buffered by the agents (20+ servers) (Close to 36 hrs in my scenario). As it turns out our lem can't keep up when it gets flooded with logs from 26 servers trying to dump their load. PUMPSTATUS shows that we dropped about 3.2million of 12million when the flurry of events came in(what those values mean I have no idea). Our system has been running fine since the flurry, hasn't dropped anymore and has gone on to process a over 37 million rules as of this request. For auditing, I can not allow anything to be dropped.
Instead of dropping, how about giving the manager the ability to tell the agents to back off or throttle them? The Agents communicate via TCP so two way communications shouldn't be an issue.