This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

reporting out of SWIS API

We have a reporting platform that we're trying to leverage the solarwinds API to query data from.  It uses a homegrown tool that can use an API to ingest data.  This works great for availability, cpu, disk, memory and response time statistics.

However, we're running into issues with interface related tables where the queries are timing out and the SWIS service itself starts to consume over 12GB of ram and then doesn't go back down until it is restarted, here is an example of one query:

select Nodes.Caption as DEVICE_NAME, Nodes.IP_Address as DEVICE_IP,  'InPercentUtil' as PARAMETER_NAME, InterfaceTraffic.DateTime as CHECK_TIME
,InterfaceTraffic.InPercentUtil as OUTPUT_DATA,  interfaces.TypeDescription as
OUTPUT_DATA2, Interfaces.Caption as OUTPUT_DATA3
from orion.Nodes, orion.npm.InterfaceTraffic, orion.npm.Interfaces
where InterfaceTraffic.nodeID=Nodes.NodeID AND interfaces.nodeid=nodes.nodeid and MINUTEDIFF(InterfaceTraffic.DateTime,getdate()) = 1

tdanner​, would you have any suggestions?

Id say we have a medium to large deployment and interfaces are definitely the biggest portion of the environment:

3 Polling Engines

Network Elements 32814

Nodes 2496

Interfaces 21062

Volumes 9256

Thanks!

  • thanks tdanner​, that did work however the query still times out despite performance improvements when querying one node with this method vs the datediff method.

    If I let this cook, SQL will go into that async_network_io wait status and suspend the query, meanwhile the SWIS service on the poller continues to eat up RAM and never backs down.  Could this be a bug?  Is SWIS not capable of handling large queries when SQL clearly can?

  • tdanner​,  this is feeling hopeless now, almost to the point where we're considering abandoning the API...  We sort of confirmed this on another call with a tech yesterday but when is the API going to be available for support with our maintenance contracts?  I truly believe there is a problem with the SWIS service....

    We've increased ram on our pollers to 32GB and have retried all queries on the interfacetraffic tables.  None of them complete successfully.  Whats different now from 16GB and 32GB of ram is at least this time the SWIS service releases the memory and I get a different error from SWQL studio after it failes.  I still get an async_network_io wait from SQL as well.

    Another thing I did was look at the SQL query that is being executed on the SQL side and it looks totally bizarre and you can see that its actually performing some math which I suspect is part of the problem.  I think I get that the API exposes the inpercentutil field to you and then abstracts you from the math behind which is why I'm seeing it in the query but what is the point if it never works?

    SET DATEFIRST 7;
    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
    SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil' AS C3, DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7
    FROM dbo.NodesData AS T1
    INNER JOIN
    (
    SELECT it.*
            ,CASE WHEN ISNULL(i.InBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.In_Averagebps,0) / i.InBandwidth * 100 END InPercentUtil
            ,CASE WHEN ISNULL(i.OutBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.Out_Averagebps,0) / i.OutBandwidth * 100 END OutPercentUtil
            FROM [dbo].[InterfaceTraffic] it
            LEFT JOIN [dbo].[Interfaces] i ON it.InterfaceID = i.InterfaceID
    ) AS T2 ON [T1].[NodeID] = [T2].[NodeID]
    INNER JOIN dbo.Interfaces AS T3 ON [T1].[NodeID] = [T3].[NodeID]
    WHERE DateDiff(hour,DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]),getdate()) = 1
  • tdanner, I know this isn't official support... but any assistance you can offer would be greatly appreciated.  We're really concerned here that the API is just up to the task - we've tried modifying, adjusting what have you and nothing is working.  The bottle neck clearly resides with SWIS and we're going to have to revert to straight to SQL connection.

    The latest error message from SWIS:

    ---------------------------

    SWQL Studio

    ---------------------------

    The communication object, System.ServiceModel.Security.SecuritySessionClientSettings`1+ClientSecurityDuplexSessionChannel[System.ServiceModel.Channels.IDuplexSessionChannel], cannot be used for communication because it is in the Faulted state.

    ---------------------------

    OK  

    ---------------------------

  • Sorry for the late reply; I was out for a few days.

    The extra math you are seeing relates to two things SWIS is doing: providing the illusion that all timestamps in the database are in UTC when in fact not all of them are, and synthesizing in/out percent utilization values out of bps and bandwidth settings. In our experience this is not a major problem for SQL performance. And it certainly shouldn't make SWIS use all the memory and fail!

    We need to find out what SWIS is actually doing that is chewing up memory. SWQL Studio has a feature that can help here. Append these two lines to your query:

    WITH QUERYPLAN

    WITH QUERYSTATS

    The first line will cause SWIS to return a blob of XML that describes how SWIS is going to process the query. SWQL Studio will display this XML in a tab at the bottom of the window. On my system, it basically says "I'm going to run this SQL query and just stream the results back."

    The second line will cause SWIS to return a table of statistics listing all the operations it performed in executing the query, including the time taken and number of rows processed. This will also be displayed in a tab on the bottom.

    However, this only works if the query actually runs to completion. Your queries are failing, so you won't get this data. We will need to modify the query so that it actually completes. I would start by adding "TOP 10" to the SELECT clause. Assuming that completes, repeat with larger numbers.

  • How have i lived without knowing about this feature?

  • Thanks tdanner​.  I couldn't get those 2 options to work on my query because SWQL studio times out before I get results (I do have the timeout set to 10 minutes I believe).  But, it does run when I filter the query down:

    select

        N.Caption as DEVICE_NAME,

        N.IP_Address as DEVICE_IP, 

        'InPercentUtil' as PARAMETER_NAME,

        IT.DateTime as CHECK_TIME,

        IT.InPercentUtil as OUTPUT_DATA, 

        I.TypeDescription as OUTPUT_DATA2,

        I.Caption as OUTPUT_DATA3

    from Orion.Nodes AS N

    inner join orion.npm.InterfaceTraffic AS IT

    on N.nodeid = IT.nodeid

    inner join orion.npm.interfaces AS I

    on N.nodeid = I.nodeid

    where  HOURDIFF(IT.DateTime,GETUTCDATE()) = 1 and nodeid =1

    order by it.datetime desc

    WITH QUERYPLAN

    WITH QUERYSTATS

    - <queryPlan type="physical" xmlns="http://schemas.solarwinds.com/2007/08/informationservice">

    - <op type="ProjectOp">

    - <selectors>

    <selector name="T1.C1" type="String" expression="PropertyRef/T1.C1" />

    <selector name="T1.C2" type="String" expression="PropertyRef/T1.C2" />

    <selector name="N.C3" type="String" expression="PropertyRef/N.C3" />

    <selector name="T2.C4" type="DateTime" expression="PropertyRef/T2.C4" />

    <selector name="T2.C5" type="Single" expression="PropertyRef/T2.C5" />

    <selector name="T3.C6" type="String" expression="PropertyRef/T3.C6" />

    <selector name="T3.C7" type="String" expression="PropertyRef/T3.C7" />

    </selectors>

    - <metaData>

    <field name="T1.C1" type="String" ordinal="0" xmlPath="N[EntityType=Orion.Nodes]/@DEVICE_NAME" />

    <field name="T1.C2" type="String" ordinal="1" xmlPath="N[EntityType=Orion.Nodes]/@DEVICE_IP" />

    <field name="N.C3" type="String" ordinal="2" xmlPath="N[EntityType=Orion.Nodes]/@PARAMETER_NAME" />

    <field name="T2.C4" type="DateTime" ordinal="3" xmlPath="IT[EntityType=Orion.NPM.InterfaceTraffic]/@CHECK_TIME" />

    <field name="T2.C5" type="Single" ordinal="4" xmlPath="IT[EntityType=Orion.NPM.InterfaceTraffic]/@OUTPUT_DATA" />

    <field name="T3.C6" type="String" ordinal="5" xmlPath="I[EntityType=Orion.NPM.Interfaces]/@OUTPUT_DATA2" />

    <field name="T3.C7" type="String" ordinal="6" xmlPath="I[EntityType=Orion.NPM.Interfaces]/@OUTPUT_DATA3" />

    </metaData>

    - <children>

    - <op type="ProviderPassThroughScanOp">

    <passThrough>SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil' AS C3, ToUtc([T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7 FROM [dbo].[NodesData] AS T1 INNER JOIN [dbo].[InterfaceTraffic] AS T2 ON ([T1].[NodeID] = [T2].[NodeID]) INNER JOIN [dbo].[Interfaces] AS T3 ON ([T1].[NodeID] = [T3].[NodeID]) WHERE ((HOURDIFF(ToUtc([T2].[DateTime]),GETUTCDATE()) = 1) AND ([T1].[NodeID] = 1)) ORDER BY [T2].[DateTime] DESC WITH QUERYPLAN WITH QUERYSTATS RETURN XML Raw</passThrough>

    - <metaData>

    <field name="T1.C1" type="String" ordinal="0" />

    <field name="T1.C2" type="String" ordinal="1" />

    <field name="N.C3" type="String" ordinal="2" />

    <field name="T2.C4" type="DateTime" ordinal="3" />

    <field name="T2.C5" type="Single" ordinal="4" />

    <field name="T3.C6" type="String" ordinal="5" />

    <field name="T3.C7" type="String" ordinal="6" />

    </metaData>

    <children />

    </op>

    </children>

    </op>

    </queryPlan>

    #Type
    Query
    Duration
    Rows
    1

    SQL

    SELECT * FROM Accounts WHERE AccountID=@AccountID5
    2SQL

    SET DATEFIRST 7;

    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

    SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil' AS C3, DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7

    FROM dbo.NodesData AS T1

    INNER JOIN

    (

    SELECT it.*

            ,CASE WHEN ISNULL(i.InBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.In_Averagebps,0) / i.InBandwidth * 100 END InPercentUtil

            ,CASE WHEN ISNULL(i.OutBandwidth,0) = 0 THEN 0 ELSE ISNULL(it.Out_Averagebps,0) / i.OutBandwidth * 100 END OutPercentUtil

            FROM [dbo].[InterfaceTraffic] it

            LEFT JOIN [dbo].[Interfaces] i ON it.InterfaceID = i.InterfaceID

    ) AS T2 ON [T1].[NodeID] = [T2].[NodeID]

    INNER JOIN dbo.Interfaces AS T3 ON [T1].[NodeID] = [T3].[NodeID]

    WHERE DateDiff(hour,DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]),GETUTCDATE()) = 1 AND [T1].[NodeID] = 1

    ORDER BY [T2].[DateTime] DESC

    36836
    3ProviderPassthroughScanOp36836
    4ProjectionOp36836
  • So I had quite a discovery tonight... and it appears the performance issues we've been seeing is related to the order of the joins in the query... yes I know, the order of inner joins in SQL shouldn't matter, but apparently with SWQL, it does!

    Running the below query will cause SWIS to immediately start consuming all available RAM on the server and will never return data:

    select
        N.Caption as DEVICE_NAME,
        N.IP_Address as DEVICE_IP,
        'InPercentUtil' as PARAMETER_NAME,
        IT.DateTime as CHECK_TIME,
        IT.InPercentUtil as OUTPUT_DATA,
        I.TypeDescription as OUTPUT_DATA2,
        I.Caption as OUTPUT_DATA3
    from Orion.Nodes AS N
    inner join orion.npm.InterfaceTraffic AS IT
    on N.nodeid = IT.nodeid
    inner join orion.npm.interfaces AS I
    on N.nodeid = I.nodeid
    where  HOURDIFF(IT.DateTime,getutcdate()) = 1

    However... running this query will NOT cause SWIS to immediately consume RAM and will actually return data in about 10-12 seconds in SWQL studio:

    SELECT 
        N.caption AS DEVICE_NAME,
        N.IP_Address AS DEVICE_IP,
        'InPercentUtil' AS PARAMETER_NAME,
        IT.DateTime AS CHECK_TIME,
        IT.InPercentUtil AS OUTPUT_DATA,
        I.TypeDescription AS OUTPUT_DATA2,
        I.Caption AS OUTPUT_DATA3
    FROM Orion.NPM.InterfaceTraffic AS IT
    Inner Join Orion.Npm.Interfaces AS I
    on IT.Interfaceid = I.InterfaceID
    Inner Join Orion.Nodes AS N
    on IT.NodeID = N.NodeID
    where HOURDIFF(IT.DateTime, GETUTCDATE()) = 1
  • You might not want to take it as a given that join order has no impact on query plans.  If everything goes well then it won't but there are definitely scenarios where it makes a difference.

    https://www.mssqltips.com/sqlservertutorial/3201/how-join-order-can-affect-the-query-plan/

    The tricky part about getting help when a query is timing out is most people with an environment big enough to time out on aren't willing to throw a query in them if they already expect it to bog things down.  One trick I've used when I'm testing risky queries is to point it at an APE or additional web server, that way if I make the Info service freak out I can bounce the services quickly and people will never know haha.  Every variation of the queries you've given ran fine in my lab with 500 elements and completed in about a second so I don't have enough meat in there to stress test the queries in any meaningful way.  To be honest, in SWQL you didn't actually need to do any joins by hand for your report, SWQL already has lots of the typical joins already in place and optimized, I'd be curious how the execution time of the below compares to what you've tried so far.   In my lab this runs about 5x faster (comparing 1-2 seconds to 0.2-.4 seconds) but like I said my whole lab is a drop in the bucket compared to your set so you'll have more useful benchmarks than mine.

    select n.Caption as DEVICE_NAME

    , n.IP AS DEVICE_IP

    , 'InPercentUtil' as PARAMETER_NAME

    , n.interfaces.Caption AS OUTPUT_DATA3

    , n.interfaces.TypeDescription AS OUTPUT_DATA2

    , n.Interfaces.Traffic.DateTime AS CHECK_TIME

    , n.Interfaces.Traffic.InPercentUtil AS OUTPUT_DATA

    from orion.nodes n

    where n.Interfaces.Traffic.DateTime > addhour(-1,GETUTCDATE())

    To know where the built in joins are and what to call them just watch for the chain link icons in SWQL studio

    Starting from nodes

    pastedImage_1.png

    then from interfaces to the traffic history

    pastedImage_2.png

    In most cases I try to build my queries from the table with the fewest entries to the ones with more.  You can also apply where conditions as part of the join instead of doing it with the where at the end, for inner joins this should be equal but I do a lot of outer joins and it makes a HUGE difference there because it cuts down the data sets earlier in the process.  Thinking about that actually prompted me to look at the SQL translated query you posted earlier and I noticed it had this bit of ugliness

    1. SET DATEFIRST 7; 
    2. SETTRANSACTIONISOLATIONLEVELREADCOMMITTED
    3. SELECT [T1].[Caption] AS C1, [T1].[IP_Address] AS C2, 'InPercentUtil'AS C3, DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]) AS C4, [T2].[InPercentUtil] AS C5, [T3].[InterfaceTypeDescription] AS C6, [T3].[Caption] AS C7 
    4. FROM dbo.NodesData AS T1 
    5. INNERJOIN
    6. SELECT it.* 
    7.         ,CASEWHENISNULL(i.InBandwidth,0) = 0 THEN 0 ELSEISNULL(it.In_Averagebps,0) / i.InBandwidth * 100 END InPercentUtil 
    8.         ,CASEWHENISNULL(i.OutBandwidth,0) = 0 THEN 0 ELSEISNULL(it.Out_Averagebps,0) / i.OutBandwidth * 100 END OutPercentUtil 
    9. FROM [dbo].[InterfaceTraffic] it 
    10. LEFTJOIN [dbo].[Interfaces] i ON it.InterfaceID = i.InterfaceID 
    11. ) AS T2 ON [T1].[NodeID] = [T2].[NodeID] 
    12. INNERJOIN dbo.Interfaces AS T3 ON [T1].[NodeID] = [T3].[NodeID] 
    13. WHERE DateDiff(hour,DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T2].[DateTime]),getdate()) = 1 

    Line 11 seems less than ideal and I can see why it is done, but I think it might be causing us to scan the interfaces table at least one extra time? Could hurt while reporting in larger environments. That's noteworthy because I the only times I can specifically remember having a hard time getting queries to complete is always when I pull in the interfacehistory, I had just figured that since it tends to be larger it was something I had to live with.  Seeing this makes me re-evaluate that, bet I could convert all my reports from SWQL to a cleaner SQL execution and things would work faster, or maybe even keep the SWQL but not use the derived percentutils, just pull the real bps from interfacetraffic then manually specify the math to make it a percentage in my query as part of the select... Would have to play with it a few different ways and see how it turns out.

    Maybe SWIS is counting on SQL to optimize that but it might be possible to preemptively improve the way the info service handles the interfacetraffic table.   In practice, I have never written a report against traffic history where I wasn't also specifying a join or intrinsic link to Interfaces and now I'm thinking there is a possibly avoidable performance penalty with the way its being done.

  • thanks @mesverrum for your input.  I hear what you're saying, but this new query with the join first on the interfacetraffic table works flawlessly everytime where the old one on the nodes table first would fail everytime.  My reporting guy is also reporting the same.  The interfacetraffic table really was the only one we were having trouble with.  We only have about 300 network elements but some of them have tons of interfaces; over 21,000 interfaces spread across 3 pollers with standard polling intervals (9 minutes)  So that comes out to about 5-6 data points per hour or just over 126,000 total per hour.  I'm with you, I was using our APE to do these queries as I was able to quickly restart swis with minimal impact (if any that I could tell...)   Our other poller is in Ireland so I didn't want to use that one!

    I didn't know about those (inherited?) joins!?!? I've seen those lines in SWQL studio before but never really knew what they meant or how to use them!  I'm definitely going to dig deeper into your query to see how it works - thanks again!