System time changing on main polling engine

We've got a strange condition that's happened twice in the last two weeks. The system time on our main polling engine has randomly changed and then changed back a few minutes later. The latest iteration, it changed on 4/20 at 10:06:21 to 8/18/2024 at 1:21:34am. This caused a chain reaction of issues causing failures of SolarWinds components, reports and alerts to run and get out of sync, and database maintenance to run, purge large amounts of detail data that it thought it needed to remove, and subsequently causing DB Maintenance failures since the first time jump. The time changed back approximately 5 minutes after the initial switch.

For the DB maintenance failures I do have a case open with support which is, to be generous, spinning its wheels. 

The offending PID was clearly running Windows Time. All our servers are set to sync to one of two DCs, and this issue has only happened on our main polling engine. Time didn't change on any of the DCs and no other servers had their times adjusted randomly; not our additional poller, nor any other servers. Our OS engineering team has investigated and has a case open with Microsoft but they can't find definitive root cause. They are suggesting that it may be STS and are recommending disabling that feature.

We upgraded to SolarWinds Platform 2024.1 on 3/20 and this hadn't happened prior to 4/12, so whether it is SolarWinds isn't clear, but as noted it hasn't happened on any other servers. Barring any other info, has anyone seen this happen with SolarWinds or just in general? Any thoughts anyone may have?

www.kaspersky.co.in/.../

  • Are you perhaps using VMware to host this system? If so check on the VM -> Edit settings -> VM Options tab, expand VMware Tools -> check if Synchronize Time with Host is checked.

    If this is the case you may want to check your time settings on your host and sync them with your timeserver.

  • Thanks. The VMs in our environment are set to only sync on startup, "sync periodically" is disabled. The systems themselves sync with our DC's, and none of them changed, nor did any of the other systems. Just the Orion main polling engine.

    Barring any other info we disabled STS and are now crossing our fingers that this never happens again.

  • We encountered this issue about 5 months ago.

    Time getting randomly reset (mostly on the MPE but also on APE's).

    The worst incidents ended up with the MPE System Time being reset to 51 days into the future for about 12 hours - had to "wait" for almost 2 months before some DB entries aged past and stopped confusing us with weird dates/results.

    Spent a lot of effort working out the issue - from trawling Time Service logs and researching the issue seemed to point STS (learnt more than I'll ever need to know about time syncing on Windows).

    Cases opened including by our Server team with Microsoft (which was useless - they were no help and appeared to know less than us anyway).

    Ended up convincing org that we needed to disable STS which was done.

    No issues since.

    In the course of tracking this there were the odd thwack post that mentioned STS but no definitive "STS is the issue, turn it off" so they helped to expose STS as something to look at but not as the silver bullet.

    It's obvious that STS is an issue and can clearly cause chaos with SolarWinds (polluting the DB with invalid data/timestamps) and I'm surprised (not really) that:

    a) There is no KB related to this with instructions to ensure STS is disabled on SolarWinds servers

    b) The Install/Admin guides and (particularly) the Server etc. Requirements specs don't instruct to check if STS is enabled and ensure it is disabled.

    One could assume that admins who build Windows Servers won't have STS enabled (after all - if you look at why STS exists it's obvious that it's not designed/required for Enterprise Systems) but I have yet to work anywhere where admins actually understand what they're working with and they will just cook a Windows image/install without looking/changing the defaults (and STS in typical Windows style will be on for Windows Server - even Microsoft don't have a clue).

    STS is fundamentally broken (it is known that a "random" decision that it will use the SSL mechanisms can occur but MS don't care and it wont be fixed) and for Servers where lots of SSL connections being made (i.e. SolarWinds MPE) the chances increase exponentially and (it appears) once it occurs it appears to happen "regularly) - once it started for us we started see the "random" resets regularly - most were no impact but many (10 or so?) had reset for 51, 15, 10, 7, 3 days (past and/or future).

    SolarWInds really needs to include advice about checking/setting this in the Requirements/Install doc. 

    If left on it won't be an issue until it is.

  • Sort of glad to know we're not alone in having seen this. Did you or your enterprise see this only on SolarWinds servers, or also on any other servers? We only saw it on our Orion MPE, but so many other tools are time dependent, As we haven't seen it elsewhere (yet), wondering if it is something within Orion itself that happened to cause STS to "randomly" wake up and do this. Good luck proving it though.

    That said, I am of the mind to have our admins disable STS on other servers in my purview, if not across the enterprise. It doesn't seem to have much/any value for us and has only caused us grief.

  • We saw it on our servers. Not aware of anyone else seeing it on other servers but that does not mean it wasn't (isn't) happening,

    Chances are that it has caused an issue in our environment at some point but unless you spend a lot of time digging and have any sort of ability to go "I see something that indicates a time change, how does that work and what things are related to that" it will never be understood/resolved. Most issues are are along the lines of "had a problem, funny, time seems to have jumped and then back, no idea why that happened and nothing pinpointing and I don't have a clue how time stuff works. Reboot - it's fixed, close ticket with "time issue - unknown cause" - next.

    The STS setting time is (can/be) happening all the time - it only becomes an issue when it results in the time being set to some obviously out-of-range value - so it "may" be getting used but the increments are small and get reset by the next round of sensible NTP syncing so no one ever notices and the jumps aren't large enough/long enough to cause issues with Windows or Apps.

    The issue gets bad when (due to the known random behaviour) STS returns a value from it's computing that is randomly cactus and this value is large. The default settings for how far into the future or past the clock is allowed to be set is ~120 days so the system will happily (like in our case) set it 51 days ahead. From memory this particular instance didn't reset to normal time - I think that when an NTP sync occurred the time service couldn't reset (or more likely the NTP sync didn't happen at all because the local time was so far out that it caused connection/auth issue or something),

    It really should be disabled on all servers in an Enterprise environment. It's not needed and likely to cause issues that will be hard to pinpoint the cause of. Enterprise environments have NTP servers and VMs that get their (at least initial) clocks from the Hypervisor level (which themselves usually have a robust NTP setup/syncing) and STS is not needed and is not serving the purpose for which it was "designed" (modern devices (Laptops etc.) with no battery clock (RTC) that need some mechanism to "roughly" determine real time at boot so the OS can start an connect with Auth to then sync to ongoing time souirce).

    Told you I learnt more than I'll ever need to know about time syncing on Windows :-),

    The issue you will likely have is convincing you admins that it should be disabled. From experience - the admins "know better" and you don't know what you are talking about etc.

    This is (generally) because they don't understand any of it and their brains start melting as soon as you begin explaining.

    Good luck - I have raised that we should get this disabled everywhere but it's gone into the abyss of "I don't understand this and it's a lot of work for me so I'll note it so he's happy and promptly forget about it".

  • These are all valid points  . I'll have this discuss internally to see if we can make changes to our documentation to have this clearly stated.  For the most part, I am glad that disabling STS has resolved the issue.