This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SQL block, lock, deadlock - System Down

Community,

I have been experiencing a large number of system down incidents (29 as of yesterday) over the past couple months. These are caused by SQL deadlocks. I need assistance determining root cause. Before you ask, Orion is completely updated, SQL server 2016, 12 cpu cores per VM, and minimum 128MB RAM. These servers are fully updated and overpowered. The queries are the same every time, pulling from the events table. Any assistance or guidance would be appreciated. Please do not ask me to put in a support ticket, I already have two with no progress made, including one for system down right now (been on hold too long).

Blocking Session (From Primary Server):

  Application name: SolarWinds.Collector.Service@domain-SWCollectorService-0

    First 2KB of SQL: (@engineId int,@eventType int,@eventDate int)SELECT MAX(e.EventTime), e.NetObjectID FROM Events e

                                               INNER JOIN Nodes n ON n.NodeID = e.NetObjectID

                                               WHERE n.EngineID = @engineId AND e.EventType = @eventType AND e.EventTime >= DATEADD(D, -@eventDate, GETDATE()) GROUP BY e.NetObjectID

Blocked session (From additional poller):

  Application name: SolarWinds.Collector.Service@domain-SWCollectorService-0

    First 2KB of SQL:

CREATE PROCEDURE [dbo].[apm_AddEvent]

@NetworkNode int,

@NetObjectID int,

@NetObjectType char(10),

@EventType int,

@Message ntext,

@EventTimeUtc datetime = NULL

               AS

BEGIN

               IF ((@EventType IS NOT NULL) AND ((@Message IS NOT NULL) AND (DATALENGTH(@Message) > 0)) AND ((Select Record From EventTypes where EventType = @EventType) = 1))

                              BEGIN

                                             SET NOCOUNT ON;

                                             DECLARE @EventTimeLocal datetime

                                             IF @EventTimeUtc IS NULL

                                                            SET @EventTimeLocal = GETDATE()

                                             ELSE

                                                            SET @EventTimeLocal = DATEADD(MINUTE, DATEDIFF(MINUTE, GETUTCDATE(), GETDATE()), @EventTimeUtc)

                                             INSERT INTO Events (EventTime, NetworkNode, NetObjectID, NetObjectType, EventType, Message, Acknowledged)

                                             VALUES (@EventTimeLocal, @NetworkNode, @NetObjectID, @NetObjectType, @EventType, @Message, 0)

                              END

END

  • Looking at these I would be curious, is your events table unusually large?  Has the database maintenance been running successfully?  There was a bug running around a bit ago where certain conditions would cause people to have a crazy amount of noise in their events table, I wonder if some variation on that issue is what's getting you.  Are there specific types of events or specific entities that show up in your system with way more events than others, in a way that doesn't seem right? 

  • Looking at these I would be curious, is your events table unusually large?

    -Define unusually large. Is it large by my standards? Yes. Number of devices in monitoring alone make this table large.

    Has the database maintenance been running successfully?

    -Tables were heavily fragmented, not the events table though. This has been corrected, incidents still occur.

    There was a bug running around a bit ago where certain conditions would cause people to have a crazy amount of noise in their events table, I wonder if some variation on that issue is what's getting you.

    -Can you elaborate on this bug?

    Are there specific types of events or specific entities that show up in your system with way more events than others, in a way that doesn't seem right?

    -None that I cannot explain or otherwise do not expect.

    I am wondering if it is possible that a cloned production environment that communicates through the API can make a request, and suddenly not be available to receive the response.This environment queries the events table. i.e. If the cloned environment was stood up, made an API quest for data in the events table, and was disconnected before receiving the response, could this cause a deadlock (someone goes whoops i meant to disconnect that first)?

  • In most environments I would say an events table that is larger than about a gigabyte would be weird and potentially cause issues.  Also a factor in this could be your event retention setting (you can see this under all settings > polling settings), I see a lot of people change this from the default 30 days to something like a year and SW can generate a LOT of events in a year and having to query against that dataset might take longer than what the devs had in mind when the tool was built, so I discourage that.

    I wouldn't expect an API call to lock the table indefinitely, but i could see how running a complex enough query against a large table could maybe interfere with normal operations, but this is entirely speculation and I've never personally seen such a thing in the wild. 

  • The events table is showing as the second largest at 1.4GB, with NCM_CacheDiffResults at 1.7GB. Default settings of 30 day retention. I am monitoring the subproduction to API enviornment change first (scripted to disconnect before future clonings). If there are more incidents after that change, I will reduce retention from 30 days to 25 and let you know.

  • Did you ever resolve this?  We are trying to stop the tables growth but it won't shrink