4 Replies Latest reply on Oct 17, 2018 2:47 PM by whomademesme

    SQL block, lock, deadlock - System Down

    whomademesme

      Community,

       

      I have been experiencing a large number of system down incidents (29 as of yesterday) over the past couple months. These are caused by SQL deadlocks. I need assistance determining root cause. Before you ask, Orion is completely updated, SQL server 2016, 12 cpu cores per VM, and minimum 128MB RAM. These servers are fully updated and overpowered. The queries are the same every time, pulling from the events table. Any assistance or guidance would be appreciated. Please do not ask me to put in a support ticket, I already have two with no progress made, including one for system down right now (been on hold too long).

       

       

       

      Blocking Session (From Primary Server):

        Application name: SolarWinds.Collector.Service@domain-SWCollectorService-0

       

          First 2KB of SQL: (@engineId int,@eventType int,@eventDate int)SELECT MAX(e.EventTime), e.NetObjectID FROM Events e

                                                     INNER JOIN Nodes n ON n.NodeID = e.NetObjectID

                                                     WHERE n.EngineID = @engineId AND e.EventType = @eventType AND e.EventTime >= DATEADD(D, -@eventDate, GETDATE()) GROUP BY e.NetObjectID

       

       

       

       

      Blocked session (From additional poller):

        Application name: SolarWinds.Collector.Service@domain-SWCollectorService-0

       

          First 2KB of SQL:

      CREATE PROCEDURE [dbo].[apm_AddEvent]

      @NetworkNode int,

      @NetObjectID int,

      @NetObjectType char(10),

      @EventType int,

      @Message ntext,

      @EventTimeUtc datetime = NULL

                     AS

      BEGIN

                     IF ((@EventType IS NOT NULL) AND ((@Message IS NOT NULL) AND (DATALENGTH(@Message) > 0)) AND ((Select Record From EventTypes where EventType = @EventType) = 1))

                                    BEGIN

                                                   SET NOCOUNT ON;

       

                                                   DECLARE @EventTimeLocal datetime

       

                                                   IF @EventTimeUtc IS NULL

                                                                  SET @EventTimeLocal = GETDATE()

                                                   ELSE

                                                                  SET @EventTimeLocal = DATEADD(MINUTE, DATEDIFF(MINUTE, GETUTCDATE(), GETDATE()), @EventTimeUtc)

       

                                                   INSERT INTO Events (EventTime, NetworkNode, NetObjectID, NetObjectType, EventType, Message, Acknowledged)

                                                   VALUES (@EventTimeLocal, @NetworkNode, @NetObjectID, @NetObjectType, @EventType, @Message, 0)

                                    END

      END

        • Re: SQL block, lock, deadlock - System Down
          mesverrum

          Looking at these I would be curious, is your events table unusually large?  Has the database maintenance been running successfully?  There was a bug running around a bit ago where certain conditions would cause people to have a crazy amount of noise in their events table, I wonder if some variation on that issue is what's getting you.  Are there specific types of events or specific entities that show up in your system with way more events than others, in a way that doesn't seem right? 

            • Re: SQL block, lock, deadlock - System Down
              whomademesme

              Looking at these I would be curious, is your events table unusually large?

              -Define unusually large. Is it large by my standards? Yes. Number of devices in monitoring alone make this table large.

               

              Has the database maintenance been running successfully?

              -Tables were heavily fragmented, not the events table though. This has been corrected, incidents still occur.

               

              There was a bug running around a bit ago where certain conditions would cause people to have a crazy amount of noise in their events table, I wonder if some variation on that issue is what's getting you.

              -Can you elaborate on this bug?

               

              Are there specific types of events or specific entities that show up in your system with way more events than others, in a way that doesn't seem right?

              -None that I cannot explain or otherwise do not expect.

               

               

               

              I am wondering if it is possible that a cloned production environment that communicates through the API can make a request, and suddenly not be available to receive the response.This environment queries the events table. i.e. If the cloned environment was stood up, made an API quest for data in the events table, and was disconnected before receiving the response, could this cause a deadlock (someone goes whoops i meant to disconnect that first)?

                • Re: SQL block, lock, deadlock - System Down
                  mesverrum

                  In most environments I would say an events table that is larger than about a gigabyte would be weird and potentially cause issues.  Also a factor in this could be your event retention setting (you can see this under all settings > polling settings), I see a lot of people change this from the default 30 days to something like a year and SW can generate a LOT of events in a year and having to query against that dataset might take longer than what the devs had in mind when the tool was built, so I discourage that.

                   

                  I wouldn't expect an API call to lock the table indefinitely, but i could see how running a complex enough query against a large table could maybe interfere with normal operations, but this is entirely speculation and I've never personally seen such a thing in the wild. 

                  1 of 1 people found this helpful
                    • Re: SQL block, lock, deadlock - System Down
                      whomademesme

                      The events table is showing as the second largest at 1.4GB, with NCM_CacheDiffResults at 1.7GB. Default settings of 30 day retention. I am monitoring the subproduction to API enviornment change first (scripted to disconnect before future clonings). If there are more incidents after that change, I will reduce retention from 30 days to 25 and let you know.