Windows Devices stop "sending" events every few days...appliance reboot only resolution.

Question

I put "sending" in quotes, because the agents on about 20 domain controllers are all still running.  At first boot up, all of the  Windows domain controllers suddenly begin sending events to SEM again, and all seems well.

However, every few days, I notice that the "node health" pane on the dashboard shows that the "Last event" for ALL windows devices starts increasing from "a few seconds ago" up to minutes, then hours, then days.  As soon as I log into the console and reboot the appliance, they all begin functioning again, immediately.

I've been fighting this for months with preventative reboots.  Currently on version 2023.2 of SEM, but this has been a problem on every version since 2022.2 (our initial deployment).  Cisco devices do not have this problem.

It appears that SEM may be closing the agent communication port off, perhaps due to memory use?  Although, our appliance has a large amount of resources available (8GB memory, 2TB of storage) and they are still at less than 50% use.  We average about 400 events-per-second with 30 nodes: 10 Cisco/ 20 Windows.

Any insights or similar experiences?

npatterson · Answer

I had a support issue for a while that did not resolve I finally found a solution that worked.

First I upgraded to 2023.2.1 does fixed the issues I also archived backups after 6 months using the SSH logmarchive (only one time then turned it off)

This purged some old records now seems the issue is resolved now.  For reference only the non-agent work and the agent stopped communication with me and found the appliance upgrade to solve the agent communication issue.

aannavar · Answer

Did you work with support on this? If not I would suggest working with them and they can escalate to engineering team appropriately.