cancel
Showing results for 
Search instead for 
Did you mean: 

It's Always The Storage

Level 9

One of the most difficult things storage admins face on a day-to-day basis is that "It's always the storage's fault". You have Virtualization admins calling you constantly telling you saying there's something with the storage, then you have application owners telling you their apps are slow because of the storage. It's a never ending fight to prove out that it's not a storage issue, which leads to a lot of wasted time in the work week.

Why is it always a storage issue? Could it possibly be a application or compute  issue? Absolutely, but the reason these teams start pointing fingers is because they don't have insights into each others' IT Operations Management tools. In a lot of environments, an application team doesn't have insight into IOPS, latency, and throughput metrics for the storage supporting their application. On the other hand the storage team doesn't have insight into the application metrics such as paging, TTL, memory consumption, etc.

So for example let's look at the below scenario:

Application team starts noticing their database is running slow, so what comes to mind? We better call the storage team, as there must be a storage issue. Storage team looks into the issue; it doesn't find anything unusual and they've verified they haven't made in changes recently. So hours go by, then a couple days go by and they still haven't gotten to the bottom of the issue. Both teams keep finger pointing  and have lost trust in each other and just decide they must need more spindles to increase the performance of the application. Couple more days go by and the Virtualization Admin comes to the application team and says "Do you know you're over allocated memory on your SQL server"? So what happened here? A exorbitant amount of time was spent on troubleshooting the wrong issue. Why were they troubleshooting the wrong issue? This happened because each of these teams had no insight into the other teams' operations management tools. This type of scenario is not uncommon and happens more then we would ever like; as we  caused a disruption to business  and wasted a lot of time that could have been spent on valuable activities.

So the point is, when looking at operations management tools or process, you must ensure that these tools are transparent between multiple infrastructure groups and applications teams. By doing this we can provide better time-to-resolution, which will allows us to provide less impact to the business.

I would love to hear if other users in the community have these types of scenarios and how they have changed their processes to avoid these issues.

23 Comments
fitzy141
Level 12

We have been trying to get all our support teams including the App Dev teams looking at the tools we use some have bought into this frame but most look at it the typical way " someone else should be watching them "  I have used the ability to use external links in Orion to add these other tools into the dashboard at least from a point and click start getting them comfortable with seeing orion views and there tools with in the same orion view ... what makes it easier is how the tools ( solarwinds ) pull the data together into a single pain of glass - i hope to see more of that especially with DPA it would be nice to see some if not all that info in NPM or SAM 

mharvey
Level 17

Fortunately we've never run into a scenario since I've come on board that someone is blaming the storage as the problem.  So far it's all been requests for more storage rather than figure out a problem.  Rather than say, hey what could be filling up the drive and is it needed it becomes, hey just add more storage to it so we don't have to worry about it.

theflyingwombat
Level 9

Glad to know Networking people aren't the only ones having to deal with it always being their fault. Most of what I have seen is just teams pointing the finger because it is easier than looking into and troubleshooting their own systems. Any time it happens I do my do diligence and gather as many logs and stats (usually from NPM) to prove the network was not the cause of the issue.

cahunt
Level 17

Well, this is the first I've heard that it's NOT THE NETWORK! So I'll go with it.

This is the very reason you need at least a few of every device in the same sandbox. If you can't even get an idea about what processes and instances are going on from one system to the next as they change layers through the OSI model you may end up troubleshooting the wrong thing most days. It's an effort for large entities that have Silo problems, for some smaller places it may be the cost of monitoring it 'ALL'.  Once people start to see the potential, things change.

In the troubleshooting effort having the correct data is key. For finding not only the issue but also where the issue resides. It's tough to work in a world where everyone has their own sandbox which they can not see past.

mdecima
Level 11

The first thought that crossed my mind was something like this...

"Hello Appstack...

Thanks for coming to make our lives a little more easier...

/end of message"

Visibility is the key to unravel all the misteries between departments, no more, Its the App! Its the Network! Its the Storage! the Firewall! Its the Guy that cleans the toilets! kind of stuff.

Solarwinds is taking baby steps with this matter, appstack isn't even realeased yet but its the tool we need to avoid pointing fingers, nights without sleep, and good amount of hairloss

jkump
Level 15

Sometimes it is fortunate to have a test environment that is almost as active as the production environment and each of the teams are reviewing and interacting with each other to verify storage, network, etc activities and requirements.

First time that I have heard that it is something other than the network in the 30 years that I have been in IT.  I definitely agree that more visibility between teams and environments will help diffuse the debate and keep us focused on solving the root causes.

Jfrazier
Level 18

Ah...communication between teams.

Usually when we have a big issue of the sort the various teams dogpile and look at it from various angles and then we have a meeting to discuss what we have found.

tcbene
Level 11

I would have to agree.  It may end up being a storage problem, but the complain always comes to networks first regardless of the real problem.

goodzhere
Level 14

I would have to agree with @tcbene on this one.  I have found in almost all of my past experiences that people always go to the networks team first.  Once that team PROVES there is not problem there, then it goes to the virtualization, storage, and server admins.  I have found this to be true even if the problem is in the virtual environment.  But overall, this is not going to go away.  Most won't admit mistakes on their own team and most won't give access to look at "their stuff".  Job security and lack of trust!

mikegrocket
Level 10

As a network guy, my experience has been, "the network is slow!" But it is basically the same thing, if, we as a whole, could have insight into each others domain, then perhaps time to resolution could be shortened. But that only works if the person viewing the information has a clue to what they are seeing and not the jproverbial monkey staring at a football.

jay.perry
Level 11

I love that there are tools like this that exist to help troubleshoot database issues. I don't think it is always a storage issue, but storage tends to get blamed many times and proper planning helps. Great reads from everyone, this always happens.

_stump
Level 12

I think the issue you describe is the result of a flawed organization design. When you isolate and limit engineers to specific technology resources, you create a competitive environment in which the goal is not necessarily to resolve issues but to avoid blame. It's a human resources problem that sets up IT people for failure.

network_defender
Level 14

I agree with tcbene.  The network team will always hear about it first.

network_defender
Level 14

As much cross training as possible.  The more each team understands about the network as a whole, the better equipped they are to troubleshoot and isolate problems.

crwchief6
Level 11

Oh yeah. We have had exactly what was described above. Our storage guy has a short fuse and when programmers call the help desk saying storage it full because they cant save items or they complain the network is slow and he sees it has nothing to do with storage he lights a fuse under his manager about these accusations. Pretty funny when it does happen.

strebeld
Level 9

This scenario is what leads to a lot of inefficiencies in the infrastructure. It's a "through more resources at it, rather then find the underlining issue".

strebeld
Level 9

I actually almost wrote the article on "It's always the network", but I've had a lot of conversations lately that were more "it's a storage issue".

strebeld
Level 9

What you stated is the major issue with IT Operations today. There has to be a "cultural shift" from above to elevate most current issues.

strebeld
Level 9

This absolutely a huge issue in IT Operations, as lot of different groups do not understand the others groups metrics or baselines.

strebeld
Level 9

You hit the nail on the head... The major hurddle to overcome will be the "cultural shift" that will need to happen in IT operations.

gfsutherland
Level 14

True... understanding the whole environment is the key. It cuts down on the finger pointing and creates a team approach to solving the problem. (not always... but your odds definitely improve).

byrona
Level 21

michael stump I agree with you completely.  Unfortunately most environments I have seen are setup this way and the larger the organizations are, it seems it's more likely that the isolated silos exist.

jkump
Level 15

Working together out of silos is the way I have seen to get out of the "It's the network" versus "It's the storage" debate.  Good article.