The story so far:
- It's Not Always The Network! Or is it? Part 1 -- by John Herbert (jgherbert)
- It's Not Always The Network! Or is it? Part 2 -- by John Herbert (jgherbert)
- It's Not Always The Network! Or is it? Part 3 -- by Tom Hollingsworth (networkingnerd)
- It's Not Always The Network! Or is it? Part 4 -- by Tom Hollingsworth (networkingnerd)
Easter is upon the team before they know it, and they're being pushed to make a major software change. Here's the fifth installment, by John Herbert (jgherbert).
The View From Above: James (CEO)
Earlier this week we pushed a major new release of our supply chain management (SCM) platform into production internally. The old version simply didn't have the ability to track and manage our inventory flows and vendor orders as efficiently as we wanted, and the consequence of that has been that we've missed completing a few large orders in their entirety because we have been waiting for critical components to be delivered. Despite the importance of this upgrade to our reputation for on-time delivery (not to mention all the other cost savings and cashflow benefits we can achieve by managing our inventory on a near real-time basis), the CTO has been putting this off for months because the IT teams have been holding back on giving the OK. Finally the Board of Directors had enough with the CTO's push back, and as a group we agreed that there had been plenty enough time for testing, and the directive was issued that unless there were documented faults or errors in the system, IT should proceed with the new software deployment within the month.
We chose to deploy the software over the Easter weekend. That's usually a quieter time for our manufacturing facilities, as many of our customers close down for the week leading up to Easter. I heard grumbling from the employees about having to work on Easter, but there's no way around it. The software has to launch, and we have to do whatever we need to do to make that happen, even if that means missing the Easter Bunny.
The deployment appeared to go smoothly, and the CTO was pleased to report to the Board on Monday morning that the supply chain platform had been upgraded successfully over the weekend. He reported that testing had been carried out from every location, and every department had provided personnel to test their top 10 or so most common activities after the upgrade so that we would know immediately if a mission-critical problem had arisen. Thankfully, every test passed with flying colors, and the software upgrade was deemed a success. And so it was, until Tuesday morning when we started seeing some unexplained performance issues, and things seemed to be getting worse as the day progressed.
The CTO reported that he had put together a
tiger team to start troubleshooting, and opened an ongoing outage bridge. This had the Board's eyes on it, and he couldn't fail now. I asked him to make sure Amanda was on that team; she has provided some good wins for us recently, and her insight might just make the difference. I certainly hope so.
The View From The Trenches: Amanda (Sr Network Manager)
With big network changes I've always had a rule for myself that just because the change window has finished successfully, it doesn't mean the change was a success, regardless of what testing we might have done. I tend to wait a period of time before officially calling the change a success, all the while crossing my fingers for no big issues to arise. Some might call that paranoia, and perhaps they are right, but it's a technique that has kept me out of trouble over time. This week has provided another case study for why my rule has a place when we make more complex changes.
Obviously I knew about the change over the Easter weekend; I had the pleasure of being in the office watching over the network while the changes took place. Solarwinds NPM made that pretty simple for me; no
red means a quiet time, and since there were no specific reports of issues, I really had nothing to do. On Monday the network looked just fine as well (not that anybody was asking), but by Tuesday afternoon it was clear that there were problems with the new software, and the CTO pulled me in to a war room where a group of us were tasked to focus on finding the cause of of performance issues being reported with the new application.
There didn't seem to be a very clear pattern to the performance issues, and reports were coming in from across the company. On that basis we agreed to eliminate the wide area network (WAN) from our investigations, except at the common points, e.g. the WAN ingress to our main data center. The server team was convinced it had to be a network performance issue, but when I got them to do some ping tests from the application servers to various components of the application and the data center, responses were coming back in 1 or 2ms. NPM also still showed the network as clean and green, but experience has taught me not to dismiss any potential cause until we can disprove it by finding what the actual problem is, so I shared that information cautiously but left the door open for it to still be a network issue that simply wasn't showing in these tests.
One of the server team suggested perhaps it was an MTU issue. A good idea, but when we issued some pings with large payloads to match the MTU of the server interface, everything worked fine. MTU was never really a likely cause--if we had MTU issues, you'd have expected the storage to fail early on--but there's no harm in quickly eliminating it, and that's what we were able to do. We double checked interface counters looking for drops and errors in case we had missed something in the monitoring, but those were looking clean too. We looked at the storage arrays themselves as a possible cause, but checking Solarwinds Storage Resource Monitor we confirmed that there were no active alerts, there were no storage objects indicating performance issues like high latency, and there were no capacity issues, thanks to Mike using the capacity planning tool when he bought this new array!
We asked the supply chain software support expert about the software's dependencies. He identified the key dependencies as the servers the application ran on, the NFS mounts to the storage arrays and the database servers. We didn't know about the database servers, so we pulled in a database admin and began grilling him. We discovered pretty quickly that he was out of his depth. The new software had required a shift from Microsoft SQL Server to an Oracle database. This was the first Oracle instance the DB team had ever stood up, and while they were very competent monitoring and administering SQL Server, the admin admitted somewhat sheepishly that he really wasn't that comfortable with Oracle yet, and had no idea how to see if it was the cause of our problems. This training and support issue is something we'll need to work on later, but what we needed right then and there was some expertise to help us look into Oracle performance. I was already heading to the Solarwinds website because I remembered that there was a database tool, and I was hopeful that it would do what we needed.
I checked the page for Solarwinds' Database Performance Analyzer (DPA), and it said:
Response Time Analysis shows you exactly what needs fixing - whether you are a database expert or not. That sounded perfect given our lack of Oracle expertise, so I downloaded it and began the installation process. It wasn't long before I had DPA monitoring our Oracle database transactions (checking them every second!) and starting to populate data and statistics. Within an hour it became clear what the problem was; DPA identified that the main cause for performance problems was occurring on database updates, where entire tables were being locked rather than using more a granular lock, like row-level locking. Update queries were being forced to wait while the previous query executed and released the lock on the table, and the latency in response was having a knock-on effect on the entire application. We had not noticed this at the weekend because the transaction loads were so low out of normal business hours that this problem didn't raise its head. But why didn't this happen on Monday? On a hunch I dug into NPM and looked at the network throughput for the application servers. As I had suspected, the Monday after Easter showed the servers handling about half the traffic that hit it on the Tuesday. At a guess, a lot of people took a 4-day weekend, and when they returned to work on Tuesday, that tipped the scales on the locking/blocking issue.
While we discussed this discovery, our supply chain software expert had been tapping away on his laptop.
You're not going to believe this, he said,
It turns out we are not the first people to find this problem. The vendor says that they posted a HotFix for the query code about a week after this release came out, but I just checked, and we definitely do not have that HotFix installed. I don't know how we missed that, but we can get it installed overnight while things are quiet, and maybe we'll get lucky. I checked my watch; I couldn't believe it was 7.30PM already. We really couldn't get much more done that night anyway, so we agreed to meet at 9AM and monitor the results of the application of the HotFix.
The next morning we met as planned, and watched nervously as the load ramped up as each time zone came on line. By 1PM we had hit a peak load exceeding Tuesday's peak, and not a single complaint had come in. Solarwinds DPA now indicated that the blocking issue had been resolved, and there were no other major alerts to deal with. Another bullet dodged, though this one was a little close for comfort. We prepared a presentation for the Board explaining the issues (though we tried not to throw the software expert under the bus for missing the HotFix), and presented a list of lessons learned / actions, which included:
- Set up a proactive post-change war-room for major changes
- Monitor results daily for at least one week for changes to key business applications
- Provide urgent Oracle training for the database team (the accelerated schedule driven by the Board meant this did not happen in time)
- Configure DPA to monitor our SQL Server installations too
We wanted to add another bullet saying "Don't be bullied by the Board of Directors into doing something when we know we aren't ready yet", but maybe that's a message best left for the Board to mull on for itself. Ok, we aren't perfect, but we can get better each time we make mistakes, so long as we're honest with ourselves about what went wrong.