3 Replies Latest reply on Aug 6, 2019 8:50 PM by badgenes

    Currently evaluating DPA seeing lots of wait time caused by CPU/RAM but...


      Not seeing excessive CPU utilization and have allocated the max that SQL 2017 will allow for RAM and he'll have plenty of overhead for the OS. For instance I'm seeing a lot of wait time sourced from CPU/RAM this morning yet CPU utilization peaked at 51% and as averaging 26%. I'm not a database administrator so perhaps I'm missing the boat on how SQL can be waiting on CPU/RAM when I've got eight cores that are relatively quiet?




        • Re: Currently evaluating DPA seeing lots of wait time caused by CPU/RAM but...

          Hoping you are evaluating DPA 12.1 as it has a key feature that may help here.

          The simple answer is that you want the vast majority of time spent in cpu/memory - that's an indicator of a healthy system.

          cpu/memory means that the queries are executing/finding data in cache - and not doing other activities like waiting for a lock to be released or physical I/O or ...


          There are some good discussions out there about running/runnable/suspended processes (and a few other states) that may help frame this conversation.


          Now as of DPA 12.0, there's a feature called Table Tuning (under the Tuning tab) that looks for inefficient data access by queries.

          Just because it's finding the data in cache doesn't mean it's efficient. Use the table tuning as a heatmap to find inefficient queries aggregated at the table level. Video here


          Bottom line though is that I'd focus on the other waits like LCK* in your case to see what can be done to resolve that. Ideally, those waits would move up into more cpu/memory which is WAY more productive.

          • Re: Currently evaluating DPA seeing lots of wait time caused by CPU/RAM but...

            To add to mandevil's explanation, you are potentially misunderstanding what cpu/memory wait is showing.  It's not like cpu ready on a VM where the process is not running because it needs an available cpu core.  CPU/Memory wait basically is the counter of how much time the server is actually spending processing data.  If all your wait is CPU/Mem then that means the server is basically processing data as fast as the hardware allows it to work through that data set and you are basically good. SQL has built in capabilities to multithread where it thinks it would be helpful, but not all queries can be sliced up that way so on a multicore system you are pretty unlikely to max out CPU across all cores.  If there was actual CPU bottlenecks you'd probably also see a bigger amount cxpacket or threadpool or sos_scheduler_yield waits.  More resources won't really make this process go any faster, except maybe increasing the actual clock speed of the cpu and memory on the server since SQL would actually be able to use those to perform its calculations faster, but unless you have some freaky old hardware it's probably a really minor bump.


            As alluded to above, in cases where a query is taking longer than we want to run there is almost always significant performance gains to be had by improving the quality of the query itself, or improving the structure of the data with things like better use of indexes or even shuffling the contents of the db around and precalculating certain values instead of mathing them out on the fly every time the query needs to execute. You'd want to look into the DPA Query and Table Tuning Advisors and see if maybe there is an inefficient step that is complicating your query, for example a badly put together query that forces a full table scan when all it needs were a small subset of the results.  You can see in this screenshot from DPA that the tuning advisor is calling out a possible issue where a query ran over 5000 times over the course of a day, it had to process through 1.29 million rows each time, and in the end it finds just 2 rows of results. 




            Maybe there is no better way to get that data, maybe there is.  Database performance tuning is very much a struggle of balancing competing demands where optimizing for one query might slow you down even worse on another action in the same db.  Sometimes it is better to let a nightly report take a long time to run because the "fixes" to that report would slow down millions of day to day transactions.  On the other side, if the nightly report runs so badly that it keeps missing the scheduled window and is causing delays for whatever business processes needed that data the next morning then you might have to give up some performance on the small queries to help the big report finish in time.  Maybe things were optimized for when the data set was much smaller, but it's been 5 years since the last time anyone worked on it and now the tables have tripled in size and there needs to be a new way of handling the data.


            Once you have wrung EVERYTHING you can out of optimizing the query and data side of things then you will end up where the only thing left is faster hardware, but in your case you can see that a significant amount of time is still being spent on locks and latches (can't look up data because the table in question is in the middle of a change).  Since you aren't a DBA it is possible that you aren't going to be able to actually make any changes to this database or to the code that talks to it and you just have to accept those run times as being "normal" for that database.


            A big part of using DPA is just watching for the unexpected changes in wait times, more than trying to resolve every single type of wait.  As in, if a normal day for my database has 30 hours of wait but one random day I see it spiked to 80 hours of wait and users are complaining, i'd be interested in seeing if there were any new queries that popped up that day, or if the number of executions changed.  Did I see a shift in the typical IO latency to my storage back end during that time?  DPA is collecting all those kinds of metrics and can help you as a non-DBA to narrow down the scope of things that could have changed to cause this unusual situation. 

            1 of 1 people found this helpful