12 Replies Latest reply on Nov 18, 2015 3:03 PM by prawij

    Last 50 events causing timeouts and DB blocking

    mdriskell

      So I had to open a case yesterday (Case #376623) because many of the views I have written are timing out.  Through the process of elimination I was able to determine that it was the Last 50 Events component as the issue.  As soon as I remove that from the views the pages load with very little delay.

       

      I had my DBA take a look at the health of our DB with a tool they use and they found one query was causing considerable blocking on the database.  The query in question is calling the events table (query included below for reference).  He also was able to pull a report showing the wait time this particular query is causing on my system and the results were disturbing.  The query shows wait times of close to 15 hours (per day).  I'm no SQL expert so I'm not sure why a query that should simply be pulling a TOP 50 could be killing our performance.

       

      I have submitted diagnostics to support and waiting for information.  If they come back and tell me disk IO (which is a known problem on my system until we upgrade) I might lose it.  Upgrading to NPM 10.3.1 took our average disk queue length down to 3 from over 7 which is better but still not ideal.  We are hoping to move to our new SQL server in the next month.

       

      (QUERY CAUSING THE BLOCKING)

      (@maxRecords int,@fromDate datetime,@toDate datetime)

      SELECT TOP (@maxRecords) NetObjectType, 

         NetObjectID, 

         NetObjectID2, 

         EventID, 

         Acknowledged, 

         EventTime, 

         Events.EventType as EventType, 

         Message, 

         ISNULL(EventTypes.BackColor, 0) as BackColor, 

         Nodes.NodeID as LinkNodeID 

      FROM Events 

      LEFT JOIN Nodes 

      ON Events.NetworkNode = Nodes.NodeID 

      LEFT JOIN EventTypes 

      ON Events.EventType = EventTypes.EventType 

      WHERE Acknowledged='false' 

      AND EventTime >= @fromDate 

      AND EventTime <= @toDate 

      ORDER BY EventID DESC

        • Re: Last 50 events causing timeouts and DB blocking
          mdriskell

          So I've written what I believe to be a more efficient query to grab the last 50 events.  I would just throw together a report but that doesn't give me the color coding so I'd have to figure out how to put this data into HTML.

           

          SELECT  NetObjectType,

             NetObjectID,

             NetObjectID2,

             EventID,

             Acknowledged,

             EventTime,

             Events.EventType as EventType,

             Message,

             ISNULL(EventTypes.BackColor, 0) as BackColor,

             Nodes.NodeID as LinkNodeID

          FROM Events

          LEFT JOIN Nodes

          ON Events.NetworkNode = Nodes.NodeID

          LEFT JOIN EventTypes

          ON Events.EventType = EventTypes.EventType

          WHERE EventID > (SELECT MAX(EventId) - 50 FROM Events)

          AND Acknowledged='false' 

           

          ORDER BY EventID DESC

            • Re: Last 50 events causing timeouts and DB blocking
              mdriskell

              Initial response from support.

              Hello Mike;

                        Thank you for the information, could you please send me the Hardware configuration for you SQL and for all three SolarWinds servers.  I will need the CPU configuration, Total RAM, HDD Configuration, and if the servers are Physical or Virtual.

              Also let's plan a down time and remote session at the same, 

               

              I explained I will need more info as to what might be causing the issue before I can schedule down time.  I can't down a prod system from monitoring our entire enterprise because one view is malfunctioning.

            • Re: Last 50 events causing timeouts and DB blocking
              mdriskell

              Thanks Steve,  I sent in my specs and I am awaiting their response.  I'm well aware my SQL HD configuration is a bottleneck and am expecting that to be fully blamed however considering I can pull up the main events page which by default shows 2000 events in a couple of seconds but the last 50 events component completely times out I don't see how you can blame hardware.  Something else is going on with that component.

                • Re: Last 50 events causing timeouts and DB blocking
                  netlogix

                  I didn't follow your response about the SQL HD stuff, was that directed towards me?

                   

                  I had also noticed that the "last 50" events are slow for my enviroment and I think it is because the clustered index is based on seemingly random incoming data (node ID - who can predict the order of nodes generating events).  This will cause the table to grow larger than it should and cause slow indexes because it is always needing to re-organize the table (a clustered index says what order you want the data to be stored in the database).  By switching to IX_EventTime as the clustered index, it keeps the database from massively fragment itself by just inserting new rows.

                   

                  I don't think it can be completely blamed on your hardware. (oh, there are other tables in the SW databases that have indexes that don't make sense to me)

                    • Re: Last 50 events causing timeouts and DB blocking
                      mdriskell

                      No that was directed toward support .  I know we have bottlenecks and I know that's part of the problem but to not be able to return 25 events when I can pull 2000 on another page is not a hardware issue.

                        • Re: Last 50 events causing timeouts and DB blocking
                          mdriskell

                          I emailed support this morning because it has been two days since I have heard anything.  I need some sort of answer back.

                            • Re: Last 50 events causing timeouts and DB blocking
                              mdriskell

                              So I had a support call scheduled on Thursday to look into this matter.  The window came and went with no call from support.  I emailed in about 20 minutes after my window began and didn't get a reply until the next morning asking if I had time to do a call that day.  I was out on Friday so I emailed in this morning saying I had availability this afternoon and we settled on 1PM CT.  I was expecting a call at 1PM (40 minutes ago) and I'm still sitting here waiting. 

                               

                              I'm beyond frustrated right now

                                • Re: Last 50 events causing timeouts and DB blocking
                                  mdriskell

                                  Support finally called yesterday around 3:15 2 + hours after our planned timeframe.  I reproduced the issue for him and the first thing the guy had me do was open up a connection to the database.  He provided a query he wanted me to run.  The query was a truncate table for the trapvarbinds table.  There was no explanation of what it would do or why just here run this.  I explained that the query he provided would dump an entire table worth of data that we require and I refused to allow it to be ran.  (keep in mind our issue is with the events table and has nothing to do with the varbinds table).  Had I not known what this query would do I probably would have ran it and ended up losing data.  Is it policy for support to just have customers purge data without informing the customer of the repercussions? 

                                   

                                  I showed him that both the canned Last 25 and Last 50 events fail but I can easily pull up the events page that shows 2000 events by default.  He is going to try and replicate these issues but I am very suspect anything viable will come back. 

                                   

                                  Most of the time I praise support for their good work but this time I am less than impressed.

                                    • Re: Last 50 events causing timeouts and DB blocking
                                      mdriskell

                                      I give up...no real troubleshooting done to see why this one view won't work and I'm told to do the standard remove/repair all components.  Sure this might work but it's just masking the problem. 

                                       

                                      I requested that the case be escalated to another engineer if that doesn't happen then I will just close it.  I don't like being asked to have an entire system outage while I repair all components when no one can tell me even what is causing the issue or any real troubleshooting has been done.  All i did for support was reproduce the issue nothing else was requested.

                                        • Re: Last 50 events causing timeouts and DB blocking
                                          prawij

                                          Encountered this issue today (running NPM 11.5.2, SAM 6.2.1, WPM etc with HotFix 4 applied), first indication of trouble was the DBAs noticing our TempDB volume was filling up rapidly. Searching for the long running query online found this thread, disabled the Last xx events resource view on our Home page and under our Network tab and the TempDB space was released. Planning on a gotomeeting with support tomorrow to find the root cause.... will update this when I have more info.

                                            • Re: Last 50 events causing timeouts and DB blocking
                                              prawij

                                              So recapping the solution for this one, the issue was we had over 26 million rows in the events table (event retention set to 21 days). After setting event retention to 7 days, that dropped the events table rows to approx 8 million. At which point we could re-enable the Last 25 Events resource successfully without having issues with TempDB filling up.

                                               

                                              Regarding why the Last 25/50 Events was failing yet the Events tab was easily showing the last 2000 Events, it was mentioned by support that the query to retrieve the Last xx Events works differently in the back end as compared to the Events tab.