This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

NPM 12.4 not stable?

Hi team, we have recently upgraded Solarwinds environment to NPM 12.4 version (Orion 2018.4) along with other modules - VNQM, IPAM, NCM.

Since we have upgraded we have started facing issues with the Database, as the transaction logs are reaching more than 1 TB in size and blocking the further operations, even not able to login to Solarwinds Console, DB team has done all the options they had but it transaction log is unstoppable.

Now DB has found one command which is causing blocks and not letting further operations, they suspect that this might be the root cause for this issue. Could you please check and let us know whether this is potential cause for this issue or something else and what can be done in order to resolve this issue.

We had initially Full recovery model now we have changed it to Simple but still same issue with Simple recovery model.

The same delete operation has been running from two different host CHIRDCSWPE-01 and CHIRDCSWPE-05 which has been creating blocking.

329 Poller-01 2019-03-21 21:01:48.873

DELETE Wireless_Clients WHERE NodeID = @NodeID AND LastUpdate < @PollTi

430 Poller-05 2019-03-21 21:01:25.440

DELETE Wireless_Clients WHERE NodeID = @NodeID AND LastUpdate < @PollTi

aLTeReGoKMSigma

  • Team, This seems to be a potential bug in NPM 12.4 version since we are getting blocked very frequently becasue of Wireless Controllers query. See below again we got his today.

    Spid-322  running from the host Poller-01 

          

    Currently running script with the session 322-

    (@NodeID int,@PollTime datetime)

    DELETE Wireless_Controllers WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_AccessPoints WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_Interfaces WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_Clients WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_Rogues WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    Session 331 running from the host Poller-01        

    Currently running script with the session 331-

    (@NodeID int,@PollTime datetime)

    DELETE Wireless_Controllers WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_AccessPoints WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_Interfaces WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_Clients WHERE NodeID = @NodeID AND LastUpdate < @PollTime

    DELETE Wireless_Rogues WHERE NodeID = @NodeID AND LastUpdate < @PollTime

  • This will require some deeper troubleshooting.  I'd like to get the diagnostic file to look into.  Are you able to open a ticket on this and provide me with the ticket number?

  • Hi We have a case opened, below is the case number and Diagnostics have been uploaded.

    00282454

  • We have uploaded the diagnostics on the case, please check. Environment is very unstable and breaking again and again.

    cobrien

  • Is there an update to this?  We are about to roll out a new environment and are planning on a NPM 12.4 upgrade shortly after that.

  • There is a Hotfix which was released for this issue, we are still validating the same.

  • i am still having issues after upgrading to NPM 12.4 with NTA 4.4 to SQL at the end Feb 2019, ran the lasted updates to see if that would correct and hasnt.

    Orion Platform 2018.4 HF3, NCM 7.9, NPM 12.4, DPAIM 11.1.1, NTA 4.5.0, SAM 6.7.1, NetPath 1.1.4

    I opened case (Case # - 00281166 web page timeout after upgrade to NPM 12.4). Going on my 3rd SW Technical Support Person.

    SW had me do the following:

    * Said its the NTA Server causing the issue. NTA SQL Server: Increased RAM to 64GB, CPU to 12 Cores

    * Applied AV Exceptions to ALL SW boxes.

    Web Timeout continue when pulling up Node Details, editing NCM Jobs, Pulling up Netflow stats, Running reports, Applying APM Templates....

    larger number of long running queries seen.

    2019-04-15 10:20:50,607 [123] WARN SolarWinds.InformationService.Core.InformationService - (null) (null)  Support! -- LONG RUNNING QUERY: OperationContextId ffd3d3f8-520a-41f8-bac1-9afd6af5ab12 - Query took 22395.0432 ms: SELECT (SUM(IngressBytes) / (60.0 * 2)) as IngressTotalBytes, (SUM(EgressBytes) / (60.0 * 2)) as EgressTotalBytes, SourceIP, DestinationIP, InterfaceIDRx, InterfaceIDTx FROM Orion.Netflow.Flows WHERE TimeStamp > @startTime AND TimeStamp <= @endTime AND (InterfaceIDRx IN (107204,100560,87253,107203) OR InterfaceIDTx IN (107204,100560,87253,107203)) GROUP BY SourceIP, DestinationIP, InterfaceIDRx, InterfaceIDTx RETURN XML RAW

  • We are also facing issues intermittently with the Solarwinds DB, after a period of week or so we are seeing DB issues like locks happening very frequently due to long running queries also transaction logs are full even after scheduled purging of them by automated script.

  • We are getting non-stop blocks due to below query, is it the right conclusion? or some kind of a bug.

       @NodeID int,@PollTime datetime)  DELETE Wireless_Controllers WHERE NodeID = @NodeID AND LastUpdate < @PollTime

         DELETE Wireless_AccessPoints WHERE NodeID = @NodeID AND LastUpdate < @PollTime 

         DELETE Wireless_Interfaces WHERE NodeID = @NodeID AND LastUpdate < @PollTime 

         DELETE Wireless_Clients WHERE NodeID = @NodeID AND LastUpdate < @PollTime

         DELETE Wireless_Rogues WHERE NodeID = @NodeID AND LastUpdate < @PollTime