Hey, guys! This week I’d like to share a very recent experience. I was troubleshooting, and the information I was receiving was great, but it was the context that saved the day! What I want to share is similar to the content in my previous post, Root Cause, When You're Neither the Root nor the Cause, but different enough that I thought I'd pass it along.
This tale of woe begins as they all do, with a relatively obscure description of the problem and little foundational evidence. In this particular case it was, “The internet wasn't working on the wireless, but once we rebooted, it worked fine.” How many of us have had to deal with that kind of problem before? Obviously, all answers lead to, “Just reboot and it’ll be fine." While that’s all fine and dandy, it is not acceptable, especially at the enterprise level, because it offers no real solution. Therefore, the digging began.
The first step was to figure out if I could reproduce the problem.
I had heard that it happened with some arbitrary mobile device, so I set up shop with my MacBook, an iPad, my iPhone and my Surface tablet. Once I was all connected, I started streaming content, particularly the live YouTube stream of The Earth From Space. It had mild audio and continuous video streaming that could not buffer much or for long.
The strangest thing happened in this initial wave of troubleshooting. I was able to REPRODUCE THE PROBLEM! That frankly was pretty awesome. I mean, who could ask for more than the ability to reproduce a problem! Though the symptoms were some of the stranger parts, if you want to play along at home, maybe you can try to solve this as I go. Feel free to chime in with something like, “Ha ha! You didn’t know that?" It's okay. I’m all for a resolution.
The weirdest part of this resolution was that for devices connecting on lower wireless bands, 802.11A, 802.11N, things were working like a champ, or seemingly working like a champ. They didn’t skip a beat and were working perfectly fine. I was able to reproduce it best with the MacBook connected at 802.11AC with the highest speeds available. But seemingly, when it would transfer from one APS channel to another AP on another channel, poof, I would lose internet access for five minutes. Later, it was proven to be EXACTLY five minutes (hint).
At the time though, like any problem in need of troubleshooting, there were other issues I needed to resolve because they could have been symptoms of this problem. Support even noted that these symptoms relate to a particular problem that was all fine and dandy when adjusted in the direction I preferred. Alas, they didn’t solve my overwhelming problem of, “Sometimes, I lose the internet for EXACTLY five minutes.” Strange, right?
So, I tuned up channel overlap, modified how frequent devices will roam to a new access point and find their new neighbor, cleaned up how much interference there was in the area, and got it working like a dream. I could walk through zones transferring from AP to AP over and over again, and life seemed like it was going great. But then, poof, it happened again. The problem would resurface, with its signature registering an EXACT five-minute timeout.
This is one of those situations where others might say, “Hey, did you check the logs?” That's the strange part. This problem was not in the logs. This problem transcended mere logs.
It wasn’t until I was having a conversation one day and said, “It’s the weirdest thing. The connection with a full wireless signal, with minimal to no interference and nothing erroneous showing in the logs would just die, for exactly five minutes.” My friend chimed in, “I experienced something similar once at an industrial yard. The problem would surface when transferring from one closet-stack to another closet-stack, and the tables for Mac Refresh were set to five minutes. You could shorten the Mac Refresh timeout, or simply tunnel these particular connections back to the controller."
That prompted an A-ha moment (not the band) and I realized, "OMG! That is exactly it." And it made sense. In the earlier phases of troubleshooting, I had noted that this was a condition of the problem occurring, but I had not put all of my stock in that because I had other things to resolve that seemed out of place. It’s not like I didn’t lean on first instincts, but it’s like when there’s a leak in a flooded basement. You see the flooding and tackle that because it’s a huge issue. THEN you start cleaning up the leak because the leak is easily a hidden signal within the noise.
In the end, not only did I take care of the major flooding damage, but I also took care of the leaks. It felt like a good day!
What makes this story particularly helpful is that not all answers are to be found within an organization and their tribal knowledge. Sometimes you need to run ideas past others, engineers within the same industry, and even people outside the industry. I can’t tell you the number of times I've talked through some arbitrary PBX problem with family members. Just talking about it out loud and explaining why I did certain things caused the resolution to suddenly jump to the surface.
What about you guys? Do you have any stories of woe, sacrifice, or success that made you reach deep within yourself to find an answer? Have you had the experience of answers bubbling to the surface while talking with others? Maybe you have other issues to share, or cat photos to share. That would be cool, too.
I look forward to reading your stories!