Hi have a slight issue in that I am not a DBA so I'm not 100% sure what to look for. We had an issue last night on one of our main BI DBs where IO to the Temp DB drive went nuts;
Task Category: Server
|Computer:||Our DB Server|
SQL Server has encountered 4657 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [L:\Data-Files\tempdb3.mdf] in database [tempdb] (2). The OS file handle is 0x00000000000016C0. The offset of the latest long I/O is: 0x0000032c950000
|<Provider Name="MSSQLSERVER" />|
|<TimeCreated SystemTime="2019-08-19T20:25:00.446407000Z" />|
|<Computer>Our DB server</Computer>|
11 minutes later the Windows cluster service (RHS.EXE) decided MS SQL had failed and shut the service down and restarted it. The resource is already configured to have a 30 minute heart beat delay timer before kicking off such action so, what ever happened sounds pretty severe.
I know we have DB Structural issues (all of Temp DB's DB Files are on a single LUN and the storage provider has highlighted that this is outside of their best practice guidelines and suggested this be split across LUNs), but using dpa (which we have, albeit on 11.1.457) is there a way to identify what queries were running around this time so i can try and ID what potentially could be causing Temp DB to get hammered?
I've tried generating a report to provide top 15 sql queries within the time frame and it provides two queries but it doesnt say when those queries started and whether they succesfully finished (as, logically the query that may be responsible for this event wouldnt have finished succesfully.
Ideally i would like to see a list of queries that were running at the time of the event, preferably with start date/times.
As i'm not a DBA (we dont have one), and BI team does not know themselves whats running around these times trying to get to the root cause of the problem is, challenging ... Management as per normal want answers and are asking me to prove that us not following the storage vendor's best practice is definitely the issue (obvious answer of , just implementing the best practice guidelines then seeing what happens ..... ), we've had these issues now for the past couple of months.
Somewhat frustrated jack of all trades master of nothing ¬.¬