This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Does anyone run SW on a shared SQL server?

We are a relatively small environment. Just over 8000 total elements. Our SQL is shared with other software. Does anyone else run a similar environment where you aren't running dedicated SQL for SW? I have SAM template on this SQL and DPA is monitoring it as well. And between NPM, SAM, and DPA all available metrics shows relatively small demand out of the server. Paging does happen and ram and cpu i/o can be a tad high at times. But at a high level few the average for all metrics are tiny. We see long running queries piled up. We see deadlocks. We see a number of things all pointing to the database. Which leaves me as the admin scratching my head. Because everything is low and there is no demand crushing the database than why do we have problems pointing to the DB? I threw all I had at the database. Forced 30 day diagnostics on all my pollers. Decreased the thresholds for monitoring to aggressively fast monitor. Increase the amount of data we monitor. And forced a database maintenance. I was purposely after errors. I wanted to see something break. And while I did cause spikes in charting and metrics it wasn't anything that would kill a system. It slowed things to a near halt. But the interesting part is that it did not error. Largely it was just delayed is all. Which again doesn't make since to me. Recently we increased vCPU's and ram to the max available under our SQL license. And we moved our storage to a higher tier SSD drive storage on our SAN. Our configuration in our environment has luns carved out for the ESXi hosts and the VM mounts and drives are all out of the ESXi infra. Not sure if SQL should have a more dedicated direct access to the SAN. Problem wise what we see is slow performance intermittently. Logs littered with long running queries. Large number of deadlocks during database maintenance. Macro Parser failures due to delays from the database. And things randomly failing because of what appears to be sql performance. A little more information on our setup. We have a Primary HA pool and an APE HA pool. And on SQL we have customized Always on HA pair. On the always on HA SQL cluster we do have the log rotation happening roughly about every 6 hours to avoid log partition filling up. I believe the primary pollers are in the same ESXi host. While the standby's are in a different data center. The main SQL has 8 cores. And 60 gigs of ram. While this seems low, considering our less aggressive monitoring, low data collection, and low element count should technically be able to support this. The drives are separated but probably running out of the same LUN since the ESX has one LUN for all the drives it hosts. I've been spinning my wheels here for about a full month on this trying to figure this out and have come up empty handed so I'm turning to the community if anyone could help me out. This would be greatly appreciated. I can provide more information if the above does not paint a full picture. Need help to anyone who could help me out. Thanks!

Parents
  • My sql instance right now is 16 vcpu, 600 gb ram and runs all of the applications associated with the monitoring platforms.  I imagine that my predecessors wanted to avoid any kind of ram bottleneck when they moved to GCP so they just went hard with it.  SQL drives are all split out per database plus tempdb and logson SSD, only the backup drive is bulk storage.  Down to 10 pollers, with about 50k elements but they had significantly more in the past so I think that contributes to the monstrous DB.  Shared isn't inherently a deal breaker but you want to do what you can to slice out resources as best you can, and the ultimate bottleneck I usually ran into with SQL was almost always going to be disk latency. 

    Storage teams often struggled to explain why the infrastructure they put together was giving my clients intermittent massive bursts of latency.  I have to say that's one of the nice things in GCP, I monitor for the performance data but ultimately we don't have to be responsible for it and more often than not it performs exactly as advertised with no back and forth haggling with someone trying to convince me that i shouldn't be bothering them about being told I'm on the fastest available SSD storage and then measuring latencies that would have been considered a red flag on spinning disks.

  •  

    Very interesting! But I have to ask what is GCP? Not sure I follow. Thanks!

Reply Children
  • Marc,

    The help from everyone here including yourself has been very appreciated. And has helped me gain progress. But if you don't mind I needed a little more help to get me to the end goal.

    Paging is something that happens constantly in our SQL. From my understanding paging happens due to lack of RAM where it has to page over to hard drive because there isn't enough ram to have information stored on ram. Correct?

    While technically Paging at a low level wouldn't really suggest a performance issue. High paging becomes a performance concern. Page faults are also high. I was reading somewhere that this is when SQL goes to ram thinking the data is there than has to go searching for the data because it's not where the data was expected. Correct? 

    I was looking at our paging metrics and statistics and I see that after database maintenance page life expectancy goes from 15 days to 4 days and through out the day increases fast. Best practices stats that it shouldn't even be as high as a day. So the fact that we are going two to nearly three weeks out indicates possible paging problems. Page hard faults are also very high. 

    The Memory and CPU i/o metric are also high. Being that this is a VM I'm assuming that this happens due to ESX policing who gets access to the cpu correct? 

    Network is the only brights spot as network status do not show overhead. Do not show delays. ever. 

    Disk performance has improved ever since we made enhancements. We are connects to partitions on the ESX. ESX has 1 LUN per host from the SAN. Fast cach policies are enabled to provide the best possible storage performance to all hosts. 

    Because we have always on Active/Active cluster on our sql tempdb's are cycled every 6 hours and I believe replication also happens at predetermined times. I see that when replications happen large delays are seen. It's going from one DB in one datacenter to another in a different datacenter and our interconnect link isn't the most reliable. It's always prone to errors such as CRC errors and outages. It's something we believe could cause delays in replication ever. 

    Are there any best practices and high level birds eye view advice you can provide given the additional information above? This way I can add on to the information already here to have a productive discussion with my DBA's. 

    Because right now they are reluctant and hesitant to make changes as they aren't seeing problems from their perspectives. 

    thanks!