This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SQL block, lock, deadlock - System Down

whomademesme over 5 years ago

Community,

I have been experiencing a large number of system down incidents (29 as of yesterday) over the past couple months. These are caused by SQL deadlocks. I need assistance determining root cause. Before you ask, Orion is completely updated, SQL server 2016, 12 cpu cores per VM, and minimum 128MB RAM. These servers are fully updated and overpowered. The queries are the same every time, pulling from the events table. Any assistance or guidance would be appreciated. Please do not ask me to put in a support ticket, I already have two with no progress made, including one for system down right now (been on hold too long).

Blocking Session (From Primary Server):

Application name: SolarWinds.Collector.Service@domain-SWCollectorService-0

First 2KB of SQL: (@engineId int,@eventType int,@eventDate int)SELECT MAX(e.EventTime), e.NetObjectID FROM Events e

INNER JOIN Nodes n ON n.NodeID = e.NetObjectID

WHERE n.EngineID = @engineId AND e.EventType = @eventType AND e.EventTime >= DATEADD(D, -@eventDate, GETDATE()) GROUP BY e.NetObjectID

Blocked session (From additional poller):

Application name: SolarWinds.Collector.Service@domain-SWCollectorService-0

First 2KB of SQL:

CREATE PROCEDURE [dbo].[apm_AddEvent]

@NetworkNode int,

@NetObjectID int,

@NetObjectType char(10),

@EventType int,

@Message ntext,

@EventTimeUtc datetime = NULL

BEGIN

IF ((@EventType IS NOT NULL) AND ((@Message IS NOT NULL) AND (DATALENGTH(@Message) > 0)) AND ((Select Record From EventTypes where EventType = @EventType) = 1))

BEGIN

SET NOCOUNT ON;

DECLARE @EventTimeLocal datetime

IF @EventTimeUtc IS NULL

SET @EventTimeLocal = GETDATE()

ELSE

SET @EventTimeLocal = DATEADD(MINUTE, DATEDIFF(MINUTE, GETUTCDATE(), GETDATE()), @EventTimeUtc)

INSERT INTO Events (EventTime, NetworkNode, NetObjectID, NetObjectType, EventType, Message, Acknowledged)

VALUES (@EventTimeLocal, @NetworkNode, @NetObjectID, @NetObjectType, @EventType, @Message, 0)

END

Top Replies

0 mesverrum over 5 years ago

Looking at these I would be curious, is your events table unusually large? Has the database maintenance been running successfully? There was a bug running around a bit ago where certain conditions would cause people to have a crazy amount of noise in their events table, I wonder if some variation on that issue is what's getting you. Are there specific types of events or specific entities that show up in your system with way more events than others, in a way that doesn't seem right?
Cancel
Vote Up 0 Vote Down

Cancel
0 whomademesme over 5 years ago in reply to mesverrum

Looking at these I would be curious, is your events table unusually large?
-Define unusually large. Is it large by my standards? Yes. Number of devices in monitoring alone make this table large.
Has the database maintenance been running successfully?
-Tables were heavily fragmented, not the events table though. This has been corrected, incidents still occur.
There was a bug running around a bit ago where certain conditions would cause people to have a crazy amount of noise in their events table, I wonder if some variation on that issue is what's getting you.
-Can you elaborate on this bug?
Are there specific types of events or specific entities that show up in your system with way more events than others, in a way that doesn't seem right?
-None that I cannot explain or otherwise do not expect.
I am wondering if it is possible that a cloned production environment that communicates through the API can make a request, and suddenly not be available to receive the response.This environment queries the events table. i.e. If the cloned environment was stood up, made an API quest for data in the events table, and was disconnected before receiving the response, could this cause a deadlock (someone goes whoops i meant to disconnect that first)?
Cancel
Vote Up 0 Vote Down

Cancel
0 mesverrum over 5 years ago in reply to whomademesme

In most environments I would say an events table that is larger than about a gigabyte would be weird and potentially cause issues. Also a factor in this could be your event retention setting (you can see this under all settings > polling settings), I see a lot of people change this from the default 30 days to something like a year and SW can generate a LOT of events in a year and having to query against that dataset might take longer than what the devs had in mind when the tool was built, so I discourage that.
I wouldn't expect an API call to lock the table indefinitely, but i could see how running a complex enough query against a large table could maybe interfere with normal operations, but this is entirely speculation and I've never personally seen such a thing in the wild.
Cancel
Vote Up +1 Vote Down

Cancel
0 whomademesme over 5 years ago in reply to mesverrum

The events table is showing as the second largest at 1.4GB, with NCM_CacheDiffResults at 1.7GB. Default settings of 30 day retention. I am monitoring the subproduction to API enviornment change first (scripted to disconnect before future clonings). If there are more incidents after that change, I will reduce retention from 30 days to 25 and let you know.
Cancel
Vote Up 0 Vote Down

Cancel
0 jxchappell over 4 years ago

Did you ever resolve this? We are trying to stop the tables growth but it won't shrink
Cancel
Vote Up +1 Vote Down

Cancel