nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
- SolarWinds Academy
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

SolarWinds.InformationService.ServiceV3.exe - Railing a single processor?

jbiggley

I've started to notice that some of our alerts periodically fail to translate variables in some cases. This is usually caused by a misconfigured alert, but recently I noticed it happening on long-standing alerts that generally work. In this specific case, it is a random occurrence. In the past few days we generated 142 alerts for this particular configuration and 4 of the 142 failed to translate the variables. While not a huge number, it does mean that 2.8% of those particular alerts didn't work. That's a high failure rate in a business where the only thing that matters is people trusting their alerting!

So I went digging. For context, we run a single instance SolarWinds environment with 17 additional polling engines across 3 data centers with almost 98K elements and 22K application monitors. We have VMAN integrated and have deployed SRM alongside a lightly used NTA. Now I know that the alerting service is SolarWinds.Alerting.Service.exe and Process Explorer says that is is using about 3% of my total CPU on my 8 vCPU. I also know that my processor queue length runs at about 5 for this server and that most of my CPUs are about 60% utilized at any given time.

Except for processor 2 (or in SolarWinds, processor 3!).

From the screenshot below you can see that SolarWinds.InformationService.ServiceV3.exe is railing this processor. All the time. (See screenshots below)

The SWISv3 exe is 2015.1.1.6134. No, I am not running NPM v12 yet.

Is anyone else seeing similar behaviour on this executable? Anyone else noticing problems with interpreting variables where this service is railing a processor? This SWISv3 service consumes far and away the most amount of CPU time of any service on our primary poller. At the time of this posting, SWISv3 had consumed 65 hours of CPU time. The next closest was our Splunk agent (31 hours) and the next closest SolarWinds process was a BusinessLayerHost that was 4.5 hours.

2016-09-23 15_16_23-WPOH0019SWPOL01 (srvsnmp01) (WPOH0019SWPOL01) - Remote Desktop Connection Manage.png 2016-09-23 15_36_05-WPOH0019SWPOL01 (srvsnmp01) (WPOH0019SWPOL01) - Remote Desktop Connection Manage.png

Find more posts tagged with

Performance

CPU

npm11.5

Accepted answers

jbiggley

I should also add that we found some of our custom-tuned entries in the SWNetPerfMon.db file in our \SolarWinds\Orion directory. Because we have such a large environment we modified this file (on the advice of support and the devs -- DO NOT CHANGE THIS WITHOUT TALKING TO THEM!) to ensure that we can talk to the DB without issues.

String must end with ;Max pool size=2000;Min Pool Size=20;Connection Timeout=300;Data Source={our_db_server_name};Initial Catalog={our_db_name}

Between the account locking out, this entry, some objects that weren't growing as they needed to in the DB, etc. it was a *long* day of troubleshooting but I am calling this one resolved.

Thanks for the help designerfx!

All comments

ecklerwr1

I have on and off problems with a few of these services... I'll check some more. I'll see event log messages like one has been restarted like 50 times.

jhandberg

I have recently seen some of the variables in our alerts not translating as well, but I had not looked into the reason yet, or at least not looked into this possible reason. It sounds like I need to look at what is happening on our CPUs. I know at my last position I had problems with my SolarWinds Information Service crashing and restarting itself a lot, but we didn't use as many variables in alerts there and relied more on the NOC.

jbiggley

Quick Monday morning update. Our primary poller in our production environment is still displaying the same CPU usage profile, however is occurred to me on the weekend (What? You don't think about monitoring problems on the weekend too??) that I should check out our dev environment. Granted, dev is currently a single poller environment with a fraction of the load of prod, but it seemed like we might see the same profile.

Checked this AM. Nope. Not even close to the same. In fact, CPU usage on that dev server for the SWISv3 process is <1%.

So, Monday morning status. Prod still being thrashed on a single CPU by the SWISv3 process and dev is not.

We do a scheduled reboot on Thursday mornings so if nothing changes between now and then I'll be sure to post on update on Thursday AM to see if a reboot helps.

jbiggley

As promised, here is my 'post-reboot update'.

No change. The SWISv3 process still hovers around the 28-30% total CPU utilization on our 8 CPU server. It *did* switch to CPU 4, but that CPU is still running hot.

2016-09-29 10_38_25-WPOH0019SWPOL01 (srvsnmp01) (WPOH0019SWPOL01) - Remote Desktop Connection Manage.png

designerfx

Hate to suggest it, but repair options (config wizard/repair all)? Any sort of errors in your solarwinds application eventlog?

jbiggley

Good call on the event log. We just discovered that an account that we use to connect to the Orion DB for another application (not the SolarWinds Orion DB user) was locked out. We've just unlocked the account and we're investigating the impact now.

Why does matter?

We were seeing Event ID 4001 Service was unable to open new database connection when requested show up periodically in the Application and Services Logs > SolarWinds.Net.

Why does that matter? While I don't know how the two are connected, take a look at our Process Explorer CPU now! Also no variables being passed without translation in the last 20 minutes. I'll keep monitoring, but this looks good.

2016-09-30 10_59_19-wpoh0019swpol01 - Remote Desktop Connection.png

designerfx

4001 errors were exactly what I figured you'd find, good deal.

jbiggley

String must end with ;Max pool size=2000;Min Pool Size=20;Connection Timeout=300;Data Source={our_db_server_name};Initial Catalog={our_db_name}

Between the account locking out, this entry, some objects that weren't growing as they needed to in the DB, etc. it was a *long* day of troubleshooting but I am calling this one resolved.

Thanks for the help designerfx!

jbiggley

I just wanted to point all pre-NPM12 customers to this great post by xtraspecialj

Re: Orion Alert History Table Retention (Post NPM 11.5)

We found found that the C:\ProgramData\Solarwinds\Logs\Orion\Alerting.Service.V2.log had a ton of LONG RUNNING QUERY entries...like more than 7000 in the last 24 hours! When we dug into the queries they appeared to be tied to the Alert History table. See the linked post above, but we are preparing to delete a bunch of entries as we confirmed that while maintenance was generally finishing the truncation of the AlertHistory table was not.

Hopefully this helps someone else as they dig into SWISv3 service performance issues in pre-NPM12 installs.

cczech

jbiggley

Seeing something similar to this on my 12.x instance, specifically where my AlertHistory table is not clearing past the retention. In this post you mention identifying long running query, did you see this noted in the actual log file, or more from a DB monitoring? Realize it's an old post, but thought I'd ask.

jbiggley

We used Database Performance Analyzer to see the long-running query. (I promise, I don't work for SolarWinds and I don't get kick-backs for helping sell their products!)

cczech

We've got DPA, so no hard sell there. Thanks for the info.