One of the most difficult things storage admins face on a day-to-day basis is that "It's always the storage's fault". You have Virtualization admins calling you constantly telling you saying there's something with the storage, then you have application owners telling you their apps are slow because of the storage. It's a never ending fight to prove out that it's not a storage issue, which leads to a lot of wasted time in the work week.

 

Why is it always a storage issue? Could it possibly be a application or compute  issue? Absolutely, but the reason these teams start pointing fingers is because they don't have insights into each others' IT Operations Management tools. In a lot of environments, an application team doesn't have insight into IOPS, latency, and throughput metrics for the storage supporting their application. On the other hand the storage team doesn't have insight into the application metrics such as paging, TTL, memory consumption, etc.

 

So for example let's look at the below scenario:

 

Application team starts noticing their database is running slow, so what comes to mind? We better call the storage team, as there must be a storage issue. Storage team looks into the issue; it doesn't find anything unusual and they've verified they haven't made in changes recently. So hours go by, then a couple days go by and they still haven't gotten to the bottom of the issue. Both teams keep finger pointing  and have lost trust in each other and just decide they must need more spindles to increase the performance of the application. Couple more days go by and the Virtualization Admin comes to the application team and says "Do you know you're over allocated memory on your SQL server"? So what happened here? A exorbitant amount of time was spent on troubleshooting the wrong issue. Why were they troubleshooting the wrong issue? This happened because each of these teams had no insight into the other teams' operations management tools. This type of scenario is not uncommon and happens more then we would ever like; as we  caused a disruption to business  and wasted a lot of time that could have been spent on valuable activities.

 

So the point is, when looking at operations management tools or process, you must ensure that these tools are transparent between multiple infrastructure groups and applications teams. By doing this we can provide better time-to-resolution, which will allows us to provide less impact to the business.

 

I would love to hear if other users in the community have these types of scenarios and how they have changed their processes to avoid these issues.