This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Large increase in database connections using API (invoke/delete)

Hello,

We have noticed that when making a large amount of API calls (invoke/delete) we are seeing large increases in database connections. This is causing the entire SolarWinds web client to grind to a halt unable to load pages.

For example we run a script that refreshes the interfaces on a node using DELETE calls and INVOKE calls:

delete all interfaces:

query = swis.query(

   "SELECT Uri "

   "FROM Orion.NPM.Interfaces "

   f"WHERE NodeID = {node_id}"

)

uris = [x["Uri"] for x in query["results"]]

swis._req("POST", "BulkDelete", {"uris": uris})

add interfaces after filtering:

result = swis.invoke(

   "Orion.NPM.Interfaces", "DiscoverInterfacesOnNode", node_id

)

interfaces = result["DiscoveredInterfaces"]

interfaces = filter_interfaces(interfaces, apply_filter)

swis.invoke(

   "Orion.NPM.Interfaces",

   "AddInterfacesOnNode",

  node_id,

  interfaces,

   "AddDefaultPollers",

)

Running this on a single node can increase database connections by 200. This is seen under "User Connections" on the "AppInsight for SQL" page. If we run this code for 10 nodes the number of connections spikes up to ~1.5k which severely hinders the responsiveness of the web client.

- Is this expected? Do invoke calls normally generate a lot of subsequent database calls? Is bulkdelete the culprit? (i tried individual deletes).
- Is there a way to rate limit our API calls? Happy to throw some sleeps between each node run if not.




  • All API calls are going to get routed through the information service, the information service is ALSO where the website gets most of it's information as well so they are basically competing for resources. 

    In the past I have found it was beneficial for me to point API jobs at one of my APE's since those would usually have more unused resources, but I have also been bit by that because apparently the info service on APE's and Additional web services is not a 100% replica of what shows up when you connect to the core app server, but they been close enough to do the job for 98% of my needs.

    Another recommendation I almost always make to clients is just to get an additional web server license, they are super cheap, like $500, and can be really helpful in terms of offloading end users away from the core server.  Many times I've done really abusive things to the app server without the end users knowing because they only interacted with the AWS and it had what they needed cached.

    I've also done like you suggested and implemented sleep timers in my scripts to artificially rate limit myself and that's worked okay before too.

  • Thanks for the Info. We have our Main server (Plus HA), 9x APEs (so 10 Polling engines in total) plus 4x Additional Web servers. All our end users point to Additional Web Servers, and polling load is away from the main server and on the 9x APEs.

    At the moment we have the API pointing to the main server, but we will indeed try pointing it to an APE as we know the main server SWIS Service is quite busy due to doing all the alerting and other main admin stuff. I guess another open is to empty out an APE or the main server of all polling jobs. Is it often a good idea to have zero nodes assigned to the main server and leave it for other purposes?

    The main issue we find is not API <-> Server bottleneck, but Server <-> DB opens 1k pooled connection, the maxpooledconnections = 1000 in the Database config file (Support suggest not to change it, as thats really high anyway) but we see a massive spike in Database pooled connections when its run. Our pooled DB connections usually sit around 600/650, but spike to around 1.5k when the API runs. So we presume there is no server-side rate limit. What we are unsure, is why the app tries to keep opening more and more, when existing pooled connections sit idle. We only see this behaviour via API calls too.

  • Letting an APE sit idle for just api calls can be an expensive proposition, depending on your licensing, if you don't have one if the licenses that allows for unlimited APEs. 

    I've known many orgs who prefer not to load their main poller if they can avoid it, with 9 apes it's probably not hard for you to shuffle devices away from it.

    I haven't tried to mess with the connection pool limit before, I am going to talk to my DBA colleague and see what his thoughts are.

  • Thanks!

    I'm finding that hitting another APE is causing problems like this: Device discovery only works on primary polling engine on python API call

    I get "Cannot find assembly" looks like i have to hit main APE for some API requests (e.g. invoke)?

    {"Message":"Verb Orion.NPM.Interfaces.DiscoverInterfacesOnNode: Cannot find assembly","ExceptionType":"SolarWinds.InformationService.Verb.VerbExecutorException","FullException":"SolarWinds.InformationService.Verb.VerbExecutorException: Verb Orion.NPM.Interfaces.DiscoverInterfacesOnNode: Cannot find assembly\u000d\u000a   at SolarWinds.InformationService.Verb.VerbExecutor.EnsureCached(VerbExecutorContext verbExecContext)\u000d\u000a   at SolarWinds.InformationService.Verb.VerbCache.GetVerbContext(String entity, String verb, VerbExecutor& executor, VerbExecutorContext& context)\u000d\u000a   at SolarWinds.InformationService.Core.InformationService.Invoke[T](String entity, String verb, Action`1 setupParameters, Func`2 extractReturnValue)"}

  • Just an FYI regarding bulk interface deletions causing large increase in database connections. I don't believe this is specific to API, because we recently experienced the same issue when bulk deleting interfaces via Manage Nodes in the UI. In our case, Support ended up advising us to modify the "Program Files (x86)\SolarWinds\Orion\SWNetPerfMon.DB" file, and increase Max Pool Size = 2000. We don't like this approach, since that file will likely be overwritten during upgrades (right?). It seems like a workaround that should not be necessary. Since we have SAM, we'll start monitoring this, and likely set up some alerting. This also brings to mind the fact that I've done bulk interface deletions before in Manage Nodes on our other SolarWinds instances, and this has never happened before where the UI becomes unresponsive, nor has Support ever advised us to increase the max pool connections. I'm wondering if this is specific to having the SAM module, since our other instances (where this has not happened) do not have SAM?

  • Sound like your call is falling into those edge cases where the APE doesn't have the same stuff as the core server.  You can validate what verbs do and don't exist by launching swql studio and connection to the APE and seeing if what you need shows up but if it's not there then that might not be a usable work around in this case.

  • So much of this rings true! We too have SAM and I do get a huge "hang" on the webpage if I bulk delete maybe 100+ interfaces. Although I have never pushed it anywhere near the volume that the API is. Would you mind please sharing your SR number with support if you have it? We have one open and I would like to raise this further. If its been evidenced on the GUI and not the (unsupported) API, Id like to test this and get it resolved. I agree the Max Pool setting is not ideal to change. We are forever applying hotfix' and 24 different servers having to be manually changed every time is just not workable for us. Did you get any better results with the max pool size of 2000? our DB server can handle it. I am also wondering if that 1000 number is per server/APE or a total across the whole platform.

  • Here is the Support Case#: 00364981

    And the last comment from Support:

    After the reviewing with development we have found that this a bug in the application and you have been helpful to identify this for us. The development team is currently working on potentially fixing this on a future HotFix or a future update. Unfortunately currently the only work around is the provided one of modifying the SWNetPerfmon file or avoiding the deletion of more than 1,000 interfaces at the same time, separating this into less.

  • I am experiencing the same issue. Support case# 00384717

    I will update here as case gets updated.

  • Glad to catch this thread. Has there been any update on this bug guys?

    Ive hit the same issue where i cant delete more than 1000 interfaces and its not recommeded to delete interfaces from the database as the nightly job clears those interface references from other tables in the db.