IT'S NOT ALWAYS THE NETWORK! OR IS IT? Part 3

If you haven't read the earlier posts, here's a chance to catch up on the story so far:

  1. It's Not Always The Network! Or is it? Part 1 ​-- by John Herbert (jgherbert​)
  2. It's Not Always The Network! Or is it? Part 2​ -- by John Herbert (jgherbert​)

Now you're up to speed with the chaotic life of the two characters whose jobs we are following, here's the third installment of the story, by Tom Hollingsworth (networkingnerd​).

The View From Above: James (CEO)

I got another call about the network today. This time, our accounting department told us that their End of Year closeout was taking much too long. They have one of those expensive systems that scans in a lot of our paperwork and uploads it to the servers. I wasn't sure if they whole thing was going to be worth it, but we managed to pay for it with the savings from renting warehouse space to store huge file boxes full of the old paper records. That's why I agreed to sign off on it.

It worked great last year, but this time around I'm hearing nothing but complaints. This whole process was designed to speed things up and make everyone's job easier. Now I have to deal with the CFO telling me that our reports are going to be late and that the shareholders and the SEC are going to be furious. And I also have to hear comments in the hallways about how the network team still isn't doing their job. I know that Amanda has done a lot recently to help fix things, but if this doesn't get worked out soon the end of the year isn't going to be a good time for anyone.

The View From The Trenches: Amanda (Sr Network Manager)

Fresh off my recent issues with the service provider in Austin, I was hoping the rest of the year was going to go smoothly. Until I got a hotline phone call from James. It seems that the network was to blame for the end of year reporting issues that the accounting department was running into. I knew this was a huge issue after sitting in on the meetings about the records scanning program before I took over the network manager role. The arguments about the cost of that thing made me glad I worked in this department. And now it was my fault the thing wasn't working? Time to get to the bottom of this.

I fired up SolarWinds NPM and started checking the devices that were used by the accounting department. Thankfully, there weren't very many switches to look at. NPM told me that everything was running at peak performance; all the links to the servers were green, as was the connection between the network and the storage arrays. I was sure that any misconfiguration of the network would have shown up as a red flag here and given me my answer, but alas the network wasn't the problem. I could run a report right now to show to James to prove that the network was innocent this time.

I stopped short, though. Proving that it wasn't the network was not the issue; the issue was that the scanning program wasn't working properly. I knew that if it ended up being someone else's bigger issue that they were going to be on the receiving end of one of those conference room conversations that got my predecessor Paul fired. I knew that I had the talent to help this problem get fixed and help someone keep their job before the holidays.

So, if the network wasn't the problem, then what about the storage array? I called one of the storage admins, Mike, and asked him about the performance on the array. Did anything change recently? Was the firmware updated? Or out of date? I went through my standard troubleshooting questions for network problems. The answers didn't fill me with a lot of confidence.

Mike knew his arrays fairly well. He knew what kind they were and how to access their management interfaces. But when I started asking about firmware levels or other questions about the layout of the storage, Mike's answers became less sure. He said he thought maybe some of the other admins were doing something but he didn't know for sure. And he didn't know if there was a way to find out.

As if by magic, the answer appeared in my inbox. SolarWinds emailed me about a free trial of their Storage Resource Monitor (SRM) product. I couldn't believe it! I told Mike about it and asked him if he'd ever tried it. He told me that he had never even heard of it. Given my luck with NPM and keeping the network running, I told Mike we needed to give this a shot.

Mike and I were able to install SRM alongside NPM with no issues. We gave it the addresses of the storage arrays that the accounting data was stored on and let it start collecting information. It only took five minutes before I heard Mike growling on the other end of the phone. He was looking at the same dashboard I was. I asked him what he was seeing and he started explaining things.

It seems that someone had migrated a huge amount of data onto the fast performance storage tier. Mike told me that data should have been sitting around in the near-line tier instead. The data in the fast performance tier was using up resources that the accounting department needed to store their scanned data. Since that data was instead being written to the near-line storage, the performance hit looked like the network was causing the problem when in fact the storage array wasn't working like it should.

I heard Mike cup his hand over the phone receiver and start asking some pointed questions in the background. No one immediately said anything until Mike was able to point out the exact time and date the data was moved into the performance tier. It turns out one of the other departments wanted to get their reports done early this year and talked one of the other storage admins into moving their data into a faster performance tier so their reports would be done quicker. That huge amount of data had caused lots of problems. Now, Mike was informing the admin that the data was going to be moved back ASAP and they were going to call the accounting department and apologize for the delay.

Mike told me that he'd take care of talking to James and telling him it wasn't the network. I thanked him for his work and went on with the rest of my day. Not only was it not the network (again), but we found the real problem with some help from SolarWinds.

I wouldn't have thought anything else about it, but Mike emailed me about a week later with an update. He kept the SRM trial running even after we used it to diagnose the accounting department issue. The capacity planning tool alerted Mike that they were going to run out of storage space on that array in about six more weeks at the rate it was being consumed. Mike had already figured out that he needed to buy another array to migrate data and now he knew he needed a slightly bigger one. He used the tool to plan out the consumption rate for the next two years and was able to convince James to get a bigger array that would have more than enough room. It's time to convert that SRM trial into a purchase, I think; it's great value and I'm sure Mike will be only too happy to pay.

>>> Continue reading this story in Part 4

  • Companies without Change Control processes, and monitoring solutions equal to the need, are spinning their wheels while their competitors advance.  Don't invest your career in a place that isn't providing the right processes, tools, and training budget to ensure the jobs are planned and done correctly.

  • We also just implemented SRM and it's an amazing product.  Storage systems are notorious for being challenging to work with and providing terrible monitoring interfaces.  Having this data in Orion where I anybody (not just the storage admins) can access it has been a huge benefit.

  • Absolutely agreed. The funny thing is that in my (possibly biased) experience it is often the network team who pull together the silos that make up a service, and track down the issue. Maybe that's because Mean Time To Innocence (MTTI, as rschroeder​ mentioned above) seems to be a KPI for the network team, more than any other.

  • As ever, great comments rschroeder​.

    You asked "Why wasn't some sort of resource management monitoring tool part of the original package and deployment?" That's an entirely reasonable thing to ask, but I've seen it happen in many places, where there's a management tool in place but it has limited monitoring/performance capabilities, and somebody says "oh, just add it to our usual snmp polling system" where it doesn't get the attention it needed, the right MIBs loaded, and the right stats analyzed.

    Network management for so many people seems to be an afterthought, and I think it is often seen as a low-value overhead that's first on the chopping block if price is key. Plus there's the general issue that it's yet another system to learn, maintain and manage. Network gets its own system. Virtualization gets its own system. Storage gets it own system. Security gets its own systems (but don't tell anybody else about it, or share any data). And so on... So many silos, so much duplication of effort, and no collaboration across silos. I'll wager this is familiar to a lot of other Thwackers unfortunately emoticons_sad.png

  • That assumes of course that this company has a change management process in place. I've been in many companies that don't (much to my horror), and doubly so on the office network side of things (rather than the data center). Not communicating within your own team though...bad show.

Thwack - Symbolize TM, R, and C