We are a relatively small environment. Just over 8000 total elements. Our SQL is shared with other software. Does anyone else run a similar environment where you aren't running dedicated SQL for SW? I have SAM template on this SQL and DPA is monitoring it as well. And between NPM, SAM, and DPA all available metrics shows relatively small demand out of the server. Paging does happen and ram and cpu i/o can be a tad high at times. But at a high level few the average for all metrics are tiny. We see long running queries piled up. We see deadlocks. We see a number of things all pointing to the database. Which leaves me as the admin scratching my head. Because everything is low and there is no demand crushing the database than why do we have problems pointing to the DB? I threw all I had at the database. Forced 30 day diagnostics on all my pollers. Decreased the thresholds for monitoring to aggressively fast monitor. Increase the amount of data we monitor. And forced a database maintenance. I was purposely after errors. I wanted to see something break. And while I did cause spikes in charting and metrics it wasn't anything that would kill a system. It slowed things to a near halt. But the interesting part is that it did not error. Largely it was just delayed is all. Which again doesn't make since to me. Recently we increased vCPU's and ram to the max available under our SQL license. And we moved our storage to a higher tier SSD drive storage on our SAN. Our configuration in our environment has luns carved out for the ESXi hosts and the VM mounts and drives are all out of the ESXi infra. Not sure if SQL should have a more dedicated direct access to the SAN. Problem wise what we see is slow performance intermittently. Logs littered with long running queries. Large number of deadlocks during database maintenance. Macro Parser failures due to delays from the database. And things randomly failing because of what appears to be sql performance. A little more information on our setup. We have a Primary HA pool and an APE HA pool. And on SQL we have customized Always on HA pair. On the always on HA SQL cluster we do have the log rotation happening roughly about every 6 hours to avoid log partition filling up. I believe the primary pollers are in the same ESXi host. While the standby's are in a different data center. The main SQL has 8 cores. And 60 gigs of ram. While this seems low, considering our less aggressive monitoring, low data collection, and low element count should technically be able to support this. The drives are separated but probably running out of the same LUN since the ESX has one LUN for all the drives it hosts. I've been spinning my wheels here for about a full month on this trying to figure this out and have come up empty handed so I'm turning to the community if anyone could help me out. This would be greatly appreciated. I can provide more information if the above does not paint a full picture. Need help to anyone who could help me out. Thanks!
Ah yes that sounds like a real pain.
For paralellism we run the threshold at 50, and 0 (unlimited) for maxdop and it seems to work well for what we have.
I definitely think the memory optimized being disabled is strange, over the last several releases memory optimized objects are quite significant for Orion. Given a blank SQL install the Orion installer definitely does turn them on and create them with no effort needed. I'd suspect that in conjunction with your atypical replication situation someone has disabled them since they are only compatible with SQL AAG and not any other legacy HA solutions.
We don't run any kind of HA of our own on the SQL server. Every variation of HA that I have seen has some kind of negative performance impact and at the end of the day my current employer is relying on the google cloud reliability to hold us up. If Google cloud was offline almost our entire company is dead in the water and there's not really anything we could do about it from our end so that's the call that was made before I arrived. With the scripts and such that I use I could rebuild my environment from a fresh server in a few hours if we had to. All my views and alerts and the most relevant data to rebuild are backed up to Github so for us we are comfortable with the balance of risk/improved day to day performance of just not bothering with SQL HA. With that said we've not had an unscheduled downtime or performance issue for our database server since we built the cloud server, so far so good.
The issue represents itself as slowness. Database maintenance that takes way longer than it should. Among other performance related problems. Messages hung up in rabbit mq. Deadlocks. Etc. Plenty of operation time out error messages on nearly every log. Commit errors and retry errors. I mean there is more than enough evidence to prove the issue is backend server related. Hubble shows it's not .net and it's not SWIS. Those respond near instant. But database shows higher times than I would like to see.
Previously I managed an environment for a different company. And they had near instant sql response where you could through the most expansive query and procedures that sql could process and it would crutch through it without a sweat. I'm expecting similar, even if there is a difference it should stray too for.
If my memory serves me right hubble shows SQL times of 300ms or below. Never above that. On our current I've see it as high as 30 seconds. Running manual queries just to test response time I give it small easy queries and it sits there executing before returning a response which again shows delay in execution.
I've learned that our DB is not cache optimized. Not memory optimized. And most settings are SQL defaults meaning when it was built it was assumed that config wizard would set that all up when it didn't.
Parallelism settings where default. Memory was default. And we noticed when we changed parallelism, changed max memory, changed drives to SSD. There was a very evident performance increase. But not quite there yet.
As an Admin it bugs me when I see errors and I hear teams complain about response times. Then they look at me as if I'm the problem. When really it's not even solarwinds but rather SQL that's not optimized correctly for SW use.
I hope this helps. And as always I appreciate the helping hand. I've literally been burnt out chasing this day and night to get it resolved.
My sql instance right now is 16 vcpu, 600 gb ram and runs all of the applications associated with the monitoring platforms. I imagine that my predecessors wanted to avoid any kind of ram bottleneck when they moved to GCP so they just went hard with it. SQL drives are all split out per database plus tempdb and logson SSD, only the backup drive is bulk storage. Down to 10 pollers, with about 50k elements but they had significantly more in the past so I think that contributes to the monstrous DB. Shared isn't inherently a deal breaker but you want to do what you can to slice out resources as best you can, and the ultimate bottleneck I usually ran into with SQL was almost always going to be disk latency.
Storage teams often struggled to explain why the infrastructure they put together was giving my clients intermittent massive bursts of latency. I have to say that's one of the nice things in GCP, I monitor for the performance data but ultimately we don't have to be responsible for it and more often than not it performs exactly as advertised with no back and forth haggling with someone trying to convince me that i shouldn't be bothering them about being told I'm on the fastest available SSD storage and then measuring latencies that would have been considered a red flag on spinning disks.
The help from everyone here including yourself has been very appreciated. And has helped me gain progress. But if you don't mind I needed a little more help to get me to the end goal.
Paging is something that happens constantly in our SQL. From my understanding paging happens due to lack of RAM where it has to page over to hard drive because there isn't enough ram to have information stored on ram. Correct?
While technically Paging at a low level wouldn't really suggest a performance issue. High paging becomes a performance concern. Page faults are also high. I was reading somewhere that this is when SQL goes to ram thinking the data is there than has to go searching for the data because it's not where the data was expected. Correct?
I was looking at our paging metrics and statistics and I see that after database maintenance page life expectancy goes from 15 days to 4 days and through out the day increases fast. Best practices stats that it shouldn't even be as high as a day. So the fact that we are going two to nearly three weeks out indicates possible paging problems. Page hard faults are also very high.
The Memory and CPU i/o metric are also high. Being that this is a VM I'm assuming that this happens due to ESX policing who gets access to the cpu correct?
Network is the only brights spot as network status do not show overhead. Do not show delays. ever.
Disk performance has improved ever since we made enhancements. We are connects to partitions on the ESX. ESX has 1 LUN per host from the SAN. Fast cach policies are enabled to provide the best possible storage performance to all hosts.
Because we have always on Active/Active cluster on our sql tempdb's are cycled every 6 hours and I believe replication also happens at predetermined times. I see that when replications happen large delays are seen. It's going from one DB in one datacenter to another in a different datacenter and our interconnect link isn't the most reliable. It's always prone to errors such as CRC errors and outages. It's something we believe could cause delays in replication ever.
Are there any best practices and high level birds eye view advice you can provide given the additional information above? This way I can add on to the information already here to have a productive discussion with my DBA's.
Because right now they are reluctant and hesitant to make changes as they aren't seeing problems from their perspectives.
I ran ours on a shared instance for years and it ran fine. The only time I ever ran into issues is when doing upgrades. In a shared instance its probably not a good idea to run any type of installer with SQL SA, so on a couple occasions I ran into rights issues during an install. Other than that it ran fine.
What you have is what I'm trying to accompish haha. I know it's possible. But proving and getting approvals etc becomes an act of congress when the metrics do not clearly spell out trouble. More often I get the numbers thrown at me and people saying you see there is no problems. Not approved.
Typically these are also the people that do not understand how solarwinds work and it's requirements. haha
I am calling directly to you all because you guys have all greatly helped me in the past and provided me guidance when I couldn't find it anywhere through my own research.
Happy to provide any additional information needed. But as you can see from the available I have a very interesting and at the same time weird problem happening. And I can't figure out if it comes down to how our environment and SQL was deployed and setup or if this is something in the configuration that needs to be change? Confused is an understatement at this point. I've been spinning wheels for a full month now trying to figure this out and haven't gained my traction.
We are hitting up against several constraints due to the licenses we have for SQL and hardware wise. And proving my point to get more resources etc is an uphill battle that's been difficult to fight. I say this because all possible metrics, charts, etc simply do not show critical numbers to warrant any approvals beyond what we already have.
But before I get there I need to build this picture out a little more. In my research it appears to be deployment and design that's working against us even though the numbers look good. We've made many SQL enhancements within the last month and those took an act of congress to get the approvals.
Can any of you guys give me your opinion on what we can do? Should we separate drives into individual LUN's instead of housing it all under one roof? Should we use HBA fiber connections? Should we be on a fast policy setup on the SAN? Are there any SQL configurations to look at and configure? Any SQL settings to setup to allow SQL to use the resources available more efficiently?
Again. I can provide more information if need be. I just really need help on this front so I can try to find Root cause on my problem and work from the root up.
Thanks in advance.
Most customers run Orion on a shared SQL instance, though it is recommended to separate Orion out to its own LUN for the database and transaction logs to minimize any impact it might have on other applications sharing the SQL server. Your TempDB would also, ideally be on SSD for maximum performance. Virtually any query in Orion is processed inside the TempDB, so if you're suffering performance issues this is a good place to start.
Without knowing exactly what the long-running queries are, or those queries that are deadlocked it's difficult to speculate as to the cause. I would recommend that you run Orion on SQL 2016 SP1 or later if you're not already, to take full advantage of column store indexes. Also, ensure that you're running the 2020.2 release or later, as there were enormous performance improvements made throughout various different areas of the product in this release.
You Sir are the man! Much appreciated. And basically what you stated falls in line with what I had in mind. Essentially it points to design and deployment that could be hampering our performance. Feels good to know that I wasn't wrong in my assessment.
We are planning to switch over to 2020.2 version. We are just trying to schedule it out to get the right approvals etc. Then we plan to deploy to our test environment before promoting to prod. We want to see if we can actually pin point the improvements so we can do kind of like a proof of concept type demonstration to management. And show case the improvements we should have with the update.
To give you a little better picture. Our current setup basically has storage sliced up between ESX hosts and then the hosts slice that up for the VM's. I'm thinking overhead in this setup could be causing us potentially performance hits. This alone is very useful information for me to put a report together of sorts. And constrained resources are also another potential performance hits. All together these little things add up to a large performance problem.
They had switched us recently to higher storage adapters on our VM's, wintel stated that these adapters offer higher throughput then the ones we were setup with.
One question I have is running an entire deployment off of one esx host is against best practices right? Not recommended? Currently we have pollers, aditional web server, all sw databases, and practically everything sw running out of the same LUN assigned to that one esx host. I'm sure that Lun is also divided into many other vm's running other software meaning it's a pretty loaded lun I would assume.
The uphill battle for me is that wintel has ran several reports showing averages that are well below any threshold and they push back hard saying the numbers do not support any further upgrades and they deny the request. Typically these same people don't understand solarwinds or it's requirments and how something that seems good could actually be causing problems.
Thanks so much in advance for all your help.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.