System time changing on main polling engine

We've got a strange condition that's happened twice in the last two weeks. The system time on our main polling engine has randomly changed and then changed back a few minutes later. The latest iteration, it changed on 4/20 at 10:06:21 to 8/18/2024 at 1:21:34am. This caused a chain reaction of issues causing failures of SolarWinds components, reports and alerts to run and get out of sync, and database maintenance to run, purge large amounts of detail data that it thought it needed to remove, and subsequently causing DB Maintenance failures since the first time jump. The time changed back approximately 5 minutes after the initial switch.

For the DB maintenance failures I do have a case open with support which is, to be generous, spinning its wheels. 

The offending PID was clearly running Windows Time. All our servers are set to sync to one of two DCs, and this issue has only happened on our main polling engine. Time didn't change on any of the DCs and no other servers had their times adjusted randomly; not our additional poller, nor any other servers. Our OS engineering team has investigated and has a case open with Microsoft but they can't find definitive root cause. They are suggesting that it may be STS and are recommending disabling that feature.

We upgraded to SolarWinds Platform 2024.1 on 3/20 and this hadn't happened prior to 4/12, so whether it is SolarWinds isn't clear, but as noted it hasn't happened on any other servers. Barring any other info, has anyone seen this happen with SolarWinds or just in general? Any thoughts anyone may have?

www.kaspersky.co.in/.../

Parents
  • We encountered this issue about 5 months ago.

    Time getting randomly reset (mostly on the MPE but also on APE's).

    The worst incidents ended up with the MPE System Time being reset to 51 days into the future for about 12 hours - had to "wait" for almost 2 months before some DB entries aged past and stopped confusing us with weird dates/results.

    Spent a lot of effort working out the issue - from trawling Time Service logs and researching the issue seemed to point STS (learnt more than I'll ever need to know about time syncing on Windows).

    Cases opened including by our Server team with Microsoft (which was useless - they were no help and appeared to know less than us anyway).

    Ended up convincing org that we needed to disable STS which was done.

    No issues since.

    In the course of tracking this there were the odd thwack post that mentioned STS but no definitive "STS is the issue, turn it off" so they helped to expose STS as something to look at but not as the silver bullet.

    It's obvious that STS is an issue and can clearly cause chaos with SolarWinds (polluting the DB with invalid data/timestamps) and I'm surprised (not really) that:

    a) There is no KB related to this with instructions to ensure STS is disabled on SolarWinds servers

    b) The Install/Admin guides and (particularly) the Server etc. Requirements specs don't instruct to check if STS is enabled and ensure it is disabled.

    One could assume that admins who build Windows Servers won't have STS enabled (after all - if you look at why STS exists it's obvious that it's not designed/required for Enterprise Systems) but I have yet to work anywhere where admins actually understand what they're working with and they will just cook a Windows image/install without looking/changing the defaults (and STS in typical Windows style will be on for Windows Server - even Microsoft don't have a clue).

    STS is fundamentally broken (it is known that a "random" decision that it will use the SSL mechanisms can occur but MS don't care and it wont be fixed) and for Servers where lots of SSL connections being made (i.e. SolarWinds MPE) the chances increase exponentially and (it appears) once it occurs it appears to happen "regularly) - once it started for us we started see the "random" resets regularly - most were no impact but many (10 or so?) had reset for 51, 15, 10, 7, 3 days (past and/or future).

    SolarWInds really needs to include advice about checking/setting this in the Requirements/Install doc. 

    If left on it won't be an issue until it is.

  • These are all valid points  . I'll have this discuss internally to see if we can make changes to our documentation to have this clearly stated.  For the most part, I am glad that disabling STS has resolved the issue.

Reply Children
No Data