19 Replies Latest reply on Apr 5, 2018 12:26 PM by ryan.davis26

    reporting out of SWIS API

    ryan.davis26

      We have a reporting platform that we're trying to leverage the solarwinds API to query data from.  It uses a homegrown tool that can use an API to ingest data.  This works great for availability, cpu, disk, memory and response time statistics.

       

      However, we're running into issues with interface related tables where the queries are timing out and the SWIS service itself starts to consume over 12GB of ram and then doesn't go back down until it is restarted, here is an example of one query:

       

      select Nodes.Caption as DEVICE_NAME, Nodes.IP_Address as DEVICE_IP,  'InPercentUtil' as PARAMETER_NAME, InterfaceTraffic.DateTime as CHECK_TIME
      ,InterfaceTraffic.InPercentUtil as OUTPUT_DATA,  interfaces.TypeDescription as 
      OUTPUT_DATA2, Interfaces.Caption as OUTPUT_DATA3
      from orion.Nodes, orion.npm.InterfaceTraffic, orion.npm.Interfaces
      where InterfaceTraffic.nodeID=Nodes.NodeID AND interfaces.nodeid=nodes.nodeid and MINUTEDIFF(InterfaceTraffic.DateTime,getdate()) = 1 

       

      tdanner, would you have any suggestions?

       

      Id say we have a medium to large deployment and interfaces are definitely the biggest portion of the environment:

      3 Polling Engines

      Network Elements 32814

      Nodes 2496

      Interfaces 21062

      Volumes 9256

       

      Thanks!

        • Re: reporting out of SWIS API
          tdanner

          First, you probably want "GETUTCDATE()" not "GETDATE()" in the WHERE clause - SWIS operates in UTC even when the underlying database tables use local time.

           

          My hunch is that account limitations are getting involved and making this query more complicated. Can you try this query with an account that has no limitations?

          • Re: reporting out of SWIS API
            ryan.davis26

            Another observation I made is a select * without a top x, will timeout in swql studio like I mentioned and I noticed the SWIS service will keep consuming RAM until there inst any left and will never go back down until it is restarted.  Not sure if this is a SWQL studio thing or not because I do get the 2 minute timeout warning from swql studio.

              • Re: reporting out of SWIS API
                mesverrum

                Yeah,  that's not specific to swql studio.  If you wrote a query that pulls too much data and stick it in the web consoles or a report you can cause the info service to balloon like that. I've learned this through painful experience in client environments. It seems like the info service doesn't actually give up on the query, just the ui gives up on waiting for the response.

              • Re: reporting out of SWIS API
                ryan.davis26

                So I was doing some troubleshooting in SQL Management studio and I can see the query come through as SolarWinds.InformationService.ServiceV3@domain-Orion.  It runs and actually executes rather quickly, doesn't even register as an "recent expensive query" however it does immediately go into an async_network_io wait state and then the SQL server suspends the query.  Based on what I read, that particular wait (async_network_io) occurs when SQL is sending the data back to the client faster than the client can ingest it.  Not knowing anything about how the SWIS service works, are there technical limitations to SWIS that would cause a large query like this to completely fail?  It seems as if our SQL server is not the issue, and is rather a problem with SWIS.  Do we need more ram on the poller where we're doing the API call on?  We only have 16GB, would 32GB be sufficient? will SWIS continue to use all available RAM and not scale back down?  Whats strange is I can do TOP 127,000's all day long and SWIS can handle that, the moment I take it out, it crashes.

                • Re: reporting out of SWIS API
                  ryan.davis26

                  So I've been experimenting with comparing dates in order to get data from the last hour with a different method instead of doing the hourdiff.  If I do a datetime > adddate('hour',-1,getdate()) the performance improvement is staggering.  But this doesn't appear to return data from the last hour like I think it should in my head.  I'm probably doing something wrong.  maybe tdanner you could offer a suggestion?

                  This is my query thus far, which to be honest, I'm not sure really what its doing as it returns 30 rows per server:

                   

                  select
                      N.Caption as DEVICE_NAME,
                      N.IP_Address as DEVICE_IP,
                      'InPercentUtil' as PARAMETER_NAME,
                      IT.DateTime as CHECK_TIME,
                      IT.InPercentUtil as OUTPUT_DATA,
                      I.TypeDescription as OUTPUT_DATA2,
                      I.Caption as OUTPUT_DATA3
                  from Orion.Nodes AS N
                  inner join orion.npm.InterfaceTraffic AS IT
                  on N.nodeid = IT.nodeid
                  inner join orion.npm.interfaces AS I
                  on N.nodeid = I.nodeid
                  where datetime > ADDDATE('hour',-1,GETDATE()) and nodeid = 1
                  • Re: reporting out of SWIS API
                    ryan.davis26

                    tdanner,  this is feeling hopeless now, almost to the point where we're considering abandoning the API...  We sort of confirmed this on another call with a tech yesterday but when is the API going to be available for support with our maintenance contracts?  I truly believe there is a problem with the SWIS service....

                     

                    We've increased ram on our pollers to 32GB and have retried all queries on the interfacetraffic tables.  None of them complete successfully.  Whats different now from 16GB and 32GB of ram is at least this time the SWIS service releases the memory and I get a different error from SWQL studio after it failes.  I still get an async_network_io wait from SQL as well.

                     

                    Another thing I did was look at the SQL query that is being executed on the SQL side and it looks totally bizarre and you can see that its actually performing some math which I suspect is part of the problem.  I think I get that the API exposes the inpercentutil field to you and then abstracts you from the math behind which is why I'm seeing it in the query but what is the point if it never works?

                     

                    SET DATEFIRST 7;
                    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
                    SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil' AS C3, DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7
                    FROM dbo.NodesData AS T1
                    INNER JOIN
                    (
                    SELECT it.*
                            ,CASE WHEN ISNULL(i.InBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.In_Averagebps,0) / i.InBandwidth * 100 END InPercentUtil
                            ,CASE WHEN ISNULL(i.OutBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.Out_Averagebps,0) / i.OutBandwidth * 100 END OutPercentUtil
                            FROM [dbo].[InterfaceTraffic] it
                            LEFT JOIN [dbo].[Interfaces] i ON it.InterfaceID = i.InterfaceID
                    ) AS T2 ON [T1].[NodeID] = [T2].[NodeID]
                    INNER JOIN dbo.Interfaces AS T3 ON [T1].[NodeID] = [T3].[NodeID]
                    WHERE DateDiff(hour,DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]),getdate()) = 1
                    • Re: reporting out of SWIS API
                      ryan.davis26

                      tdanner, I know this isn't official support... but any assistance you can offer would be greatly appreciated.  We're really concerned here that the API is just up to the task - we've tried modifying, adjusting what have you and nothing is working.  The bottle neck clearly resides with SWIS and we're going to have to revert to straight to SQL connection.

                       

                      The latest error message from SWIS:

                      ---------------------------

                      SWQL Studio

                      ---------------------------

                      The communication object, System.ServiceModel.Security.SecuritySessionClientSettings`1+ClientSecurityDuplexSessionChannel[System.ServiceModel.Channels.IDuplexSessionChannel], cannot be used for communication because it is in the Faulted state.

                      ---------------------------

                      OK  

                      ---------------------------

                        • Re: reporting out of SWIS API
                          tdanner

                          Sorry for the late reply; I was out for a few days.

                           

                          The extra math you are seeing relates to two things SWIS is doing: providing the illusion that all timestamps in the database are in UTC when in fact not all of them are, and synthesizing in/out percent utilization values out of bps and bandwidth settings. In our experience this is not a major problem for SQL performance. And it certainly shouldn't make SWIS use all the memory and fail!

                           

                          We need to find out what SWIS is actually doing that is chewing up memory. SWQL Studio has a feature that can help here. Append these two lines to your query:

                           

                          WITH QUERYPLAN

                          WITH QUERYSTATS

                           

                          The first line will cause SWIS to return a blob of XML that describes how SWIS is going to process the query. SWQL Studio will display this XML in a tab at the bottom of the window. On my system, it basically says "I'm going to run this SQL query and just stream the results back."

                           

                          The second line will cause SWIS to return a table of statistics listing all the operations it performed in executing the query, including the time taken and number of rows processed. This will also be displayed in a tab on the bottom.

                           

                          However, this only works if the query actually runs to completion. Your queries are failing, so you won't get this data. We will need to modify the query so that it actually completes. I would start by adding "TOP 10" to the SELECT clause. Assuming that completes, repeat with larger numbers.

                          3 of 3 people found this helpful
                            • Re: reporting out of SWIS API
                              mesverrum

                              How have i lived without knowing about this feature?

                              • Re: reporting out of SWIS API
                                ryan.davis26

                                Thanks tdanner.  I couldn't get those 2 options to work on my query because SWQL studio times out before I get results (I do have the timeout set to 10 minutes I believe).  But, it does run when I filter the query down:

                                 

                                select 
                                    N.Caption as DEVICE_NAME, 
                                    N.IP_Address as DEVICE_IP,  
                                    'InPercentUtil' as PARAMETER_NAME, 
                                    IT.DateTime as CHECK_TIME,
                                    IT.InPercentUtil as OUTPUT_DATA,  
                                    I.TypeDescription as OUTPUT_DATA2, 
                                    I.Caption as OUTPUT_DATA3
                                from Orion.Nodes AS N 
                                inner join orion.npm.InterfaceTraffic AS IT 
                                on N.nodeid = IT.nodeid
                                inner join orion.npm.interfaces AS I
                                on N.nodeid = I.nodeid
                                where  HOURDIFF(IT.DateTime,GETUTCDATE()) = 1 and nodeid =1
                                order by it.datetime desc
                                
                                
                                WITH QUERYPLAN
                                
                                
                                WITH QUERYSTATS
                                

                                 

                                 

                                - <queryPlan type="physical" xmlns="http://schemas.solarwinds.com/2007/08/informationservice">
                                - <op type="ProjectOp">
                                - <selectors>
                                 <selector name="T1.C1" type="String" expression="PropertyRef/T1.C1" />
                                 <selector name="T1.C2" type="String" expression="PropertyRef/T1.C2" />
                                 <selector name="N.C3" type="String" expression="PropertyRef/N.C3" />
                                 <selector name="T2.C4" type="DateTime" expression="PropertyRef/T2.C4" />
                                 <selector name="T2.C5" type="Single" expression="PropertyRef/T2.C5" />
                                 <selector name="T3.C6" type="String" expression="PropertyRef/T3.C6" />
                                 <selector name="T3.C7" type="String" expression="PropertyRef/T3.C7" />
                                 </selectors>
                                - <metaData>
                                 <field name="T1.C1" type="String" ordinal="0" xmlPath="N[EntityType=Orion.Nodes]/@DEVICE_NAME" />
                                 <field name="T1.C2" type="String" ordinal="1" xmlPath="N[EntityType=Orion.Nodes]/@DEVICE_IP" />
                                 <field name="N.C3" type="String" ordinal="2" xmlPath="N[EntityType=Orion.Nodes]/@PARAMETER_NAME" />
                                 <field name="T2.C4" type="DateTime" ordinal="3" xmlPath="IT[EntityType=Orion.NPM.InterfaceTraffic]/@CHECK_TIME" />
                                 <field name="T2.C5" type="Single" ordinal="4" xmlPath="IT[EntityType=Orion.NPM.InterfaceTraffic]/@OUTPUT_DATA" />
                                 <field name="T3.C6" type="String" ordinal="5" xmlPath="I[EntityType=Orion.NPM.Interfaces]/@OUTPUT_DATA2" />
                                 <field name="T3.C7" type="String" ordinal="6" xmlPath="I[EntityType=Orion.NPM.Interfaces]/@OUTPUT_DATA3" />
                                 </metaData>
                                - <children>
                                - <op type="ProviderPassThroughScanOp">
                                 <passThrough>SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil' AS C3, ToUtc([T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7 FROM [dbo].[NodesData] AS T1 INNER JOIN [dbo].[InterfaceTraffic] AS T2 ON ([T1].[NodeID] = [T2].[NodeID]) INNER JOIN [dbo].[Interfaces] AS T3 ON ([T1].[NodeID] = [T3].[NodeID]) WHERE ((HOURDIFF(ToUtc([T2].[DateTime]),GETUTCDATE()) = 1) AND ([T1].[NodeID] = 1)) ORDER BY [T2].[DateTime] DESC WITH QUERYPLAN WITH QUERYSTATS RETURN XML Raw</passThrough>
                                - <metaData>
                                 <field name="T1.C1" type="String" ordinal="0" />
                                 <field name="T1.C2" type="String" ordinal="1" />
                                 <field name="N.C3" type="String" ordinal="2" />
                                 <field name="T2.C4" type="DateTime" ordinal="3" />
                                 <field name="T2.C5" type="Single" ordinal="4" />
                                 <field name="T3.C6" type="String" ordinal="5" />
                                 <field name="T3.C7" type="String" ordinal="6" />
                                 </metaData>
                                 <children />
                                 </op>
                                 </children>
                                 </op>
                                 </queryPlan>
                                

                                 

                                #Type
                                Query
                                Duration
                                Rows
                                1

                                SQL

                                SELECT * FROM Accounts WHERE AccountID=@AccountID5
                                2SQL

                                SET DATEFIRST 7;

                                SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

                                SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil' AS C3, DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7

                                FROM dbo.NodesData AS T1

                                INNER JOIN

                                (

                                SELECT it.*

                                        ,CASE WHEN ISNULL(i.InBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.In_Averagebps,0) / i.InBandwidth * 100 END InPercentUtil

                                        ,CASE WHEN ISNULL(i.OutBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.Out_Averagebps,0) / i.OutBandwidth * 100 END OutPercentUtil

                                        FROM [dbo].[InterfaceTraffic] it

                                        LEFT JOIN [dbo].[Interfaces] i ON it.InterfaceID = i.InterfaceID

                                ) AS T2 ON [T1].[NodeID] = [T2].[NodeID]

                                INNER JOIN dbo.Interfaces AS T3 ON [T1].[NodeID] = [T3].[NodeID]

                                WHERE DateDiff(hour,DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]),GETUTCDATE()) = 1 AND [T1].[NodeID] = 1

                                ORDER BY [T2].[DateTime] DESC

                                36836
                                3ProviderPassthroughScanOp36836
                                4ProjectionOp36836
                            • Re: reporting out of SWIS API
                              ryan.davis26

                              So I had quite a discovery tonight... and it appears the performance issues we've been seeing is related to the order of the joins in the query... yes I know, the order of inner joins in SQL shouldn't matter, but apparently with SWQL, it does!

                               

                               

                              Running the below query will cause SWIS to immediately start consuming all available RAM on the server and will never return data:

                              select
                                  N.Caption as DEVICE_NAME,
                                  N.IP_Address as DEVICE_IP, 
                                  'InPercentUtil' as PARAMETER_NAME,
                                  IT.DateTime as CHECK_TIME,
                                  IT.InPercentUtil as OUTPUT_DATA, 
                                  I.TypeDescription as OUTPUT_DATA2,
                                  I.Caption as OUTPUT_DATA3
                              from Orion.Nodes AS N
                              inner join orion.npm.InterfaceTraffic AS IT
                              on N.nodeid = IT.nodeid
                              inner join orion.npm.interfaces AS I
                              on N.nodeid = I.nodeid
                              where  HOURDIFF(IT.DateTime,getutcdate()) = 1

                               

                              However... running this query will NOT cause SWIS to immediately consume RAM and will actually return data in about 10-12 seconds in SWQL studio:

                               

                              SELECT 
                                  N.caption AS DEVICE_NAME,
                                  N.IP_Address AS DEVICE_IP,
                                  'InPercentUtil' AS PARAMETER_NAME,
                                  IT.DateTime AS CHECK_TIME, 
                                  IT.InPercentUtil AS OUTPUT_DATA,
                                  I.TypeDescription AS OUTPUT_DATA2,
                                  I.Caption AS OUTPUT_DATA3
                              FROM Orion.NPM.InterfaceTraffic AS IT
                              Inner Join Orion.Npm.Interfaces AS I
                              on IT.Interfaceid = I.InterfaceID
                              Inner Join Orion.Nodes AS N
                              on IT.NodeID = N.NodeID
                              where HOURDIFF(IT.DateTime, GETUTCDATE()) = 1
                                • Re: reporting out of SWIS API
                                  mesverrum

                                  You might not want to take it as a given that join order has no impact on query plans.  If everything goes well then it won't but there are definitely scenarios where it makes a difference.

                                  https://www.mssqltips.com/sqlservertutorial/3201/how-join-order-can-affect-the-query-plan/

                                   

                                  The tricky part about getting help when a query is timing out is most people with an environment big enough to time out on aren't willing to throw a query in them if they already expect it to bog things down.  One trick I've used when I'm testing risky queries is to point it at an APE or additional web server, that way if I make the Info service freak out I can bounce the services quickly and people will never know haha.  Every variation of the queries you've given ran fine in my lab with 500 elements and completed in about a second so I don't have enough meat in there to stress test the queries in any meaningful way.  To be honest, in SWQL you didn't actually need to do any joins by hand for your report, SWQL already has lots of the typical joins already in place and optimized, I'd be curious how the execution time of the below compares to what you've tried so far.   In my lab this runs about 5x faster (comparing 1-2 seconds to 0.2-.4 seconds) but like I said my whole lab is a drop in the bucket compared to your set so you'll have more useful benchmarks than mine.

                                   

                                  select n.Caption as DEVICE_NAME

                                  , n.IP AS DEVICE_IP

                                  , 'InPercentUtil' as PARAMETER_NAME

                                  , n.interfaces.Caption AS OUTPUT_DATA3

                                  , n.interfaces.TypeDescription AS OUTPUT_DATA2

                                  , n.Interfaces.Traffic.DateTime AS CHECK_TIME

                                  , n.Interfaces.Traffic.InPercentUtil AS OUTPUT_DATA

                                   

                                  from orion.nodes n

                                  where n.Interfaces.Traffic.DateTime > addhour(-1,GETUTCDATE())

                                   

                                  To know where the built in joins are and what to call them just watch for the chain link icons in SWQL studio

                                  Starting from nodes

                                  then from interfaces to the traffic history

                                   

                                   

                                  In most cases I try to build my queries from the table with the fewest entries to the ones with more.  You can also apply where conditions as part of the join instead of doing it with the where at the end, for inner joins this should be equal but I do a lot of outer joins and it makes a HUGE difference there because it cuts down the data sets earlier in the process.  Thinking about that actually prompted me to look at the SQL translated query you posted earlier and I noticed it had this bit of ugliness

                                  1. SET DATEFIRST 7; 
                                  2. SETTRANSACTIONISOLATIONLEVELREADCOMMITTED
                                  3. SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil'AS C3, DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7 
                                  4. FROM dbo.NodesData AS T1 
                                  5. INNERJOIN
                                  6. SELECT it.* 
                                  7.         ,CASEWHENISNULL(i.InBandwidth,0) = 0 THEN 0 ELSEISNULL(it.In_Averagebps,0) / i.InBandwidth * 100 END InPercentUtil 
                                  8.         ,CASEWHENISNULL(i.OutBandwidth,0) = 0 THEN 0 ELSEISNULL(it.Out_Averagebps,0) / i.OutBandwidth * 100 END OutPercentUtil 
                                  9. FROM [dbo].[InterfaceTraffic] it 
                                  10. LEFTJOIN [dbo].[Interfaces] i ON it.InterfaceID = i.InterfaceID 
                                  11. ) AS T2 ON [T1].[NodeID] = [T2].[NodeID] 
                                  12. INNERJOIN dbo.Interfaces AS T3 ON [T1].[NodeID] = [T3].[NodeID] 
                                  13. WHERE DateDiff(hour,DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]),getdate()) = 1 

                                   

                                  Line 11 seems less than ideal and I can see why it is done, but I think it might be causing us to scan the interfaces table at least one extra time? Could hurt while reporting in larger environments. That's noteworthy because I the only times I can specifically remember having a hard time getting queries to complete is always when I pull in the interfacehistory, I had just figured that since it tends to be larger it was something I had to live with.  Seeing this makes me re-evaluate that, bet I could convert all my reports from SWQL to a cleaner SQL execution and things would work faster, or maybe even keep the SWQL but not use the derived percentutils, just pull the real bps from interfacetraffic then manually specify the math to make it a percentage in my query as part of the select... Would have to play with it a few different ways and see how it turns out.

                                   

                                  Maybe SWIS is counting on SQL to optimize that but it might be possible to preemptively improve the way the info service handles the interfacetraffic table.   In practice, I have never written a report against traffic history where I wasn't also specifying a join or intrinsic link to Interfaces and now I'm thinking there is a possibly avoidable performance penalty with the way its being done.

                                  1 of 1 people found this helpful
                                    • Re: reporting out of SWIS API
                                      ryan.davis26

                                      thanks @mesverrum for your input.  I hear what you're saying, but this new query with the join first on the interfacetraffic table works flawlessly everytime where the old one on the nodes table first would fail everytime.  My reporting guy is also reporting the same.  The interfacetraffic table really was the only one we were having trouble with.  We only have about 300 network elements but some of them have tons of interfaces; over 21,000 interfaces spread across 3 pollers with standard polling intervals (9 minutes)  So that comes out to about 5-6 data points per hour or just over 126,000 total per hour.  I'm with you, I was using our APE to do these queries as I was able to quickly restart swis with minimal impact (if any that I could tell...)   Our other poller is in Ireland so I didn't want to use that one!

                                       

                                      I didn't know about those (inherited?) joins!?!? I've seen those lines in SWQL studio before but never really knew what they meant or how to use them!  I'm definitely going to dig deeper into your query to see how it works - thanks again!