My IT Life in Monitoring

As a product manager in the monitoring software industry, my job is understandably focused on the future of monitoring and observability. It’s a never-ending battle to be one step ahead of what IT professionals will want and need in their day-to-day lives and then build it for them. Though thinking about the future is always exciting, I recently found myself reflecting on the past. I’ve spent nearly 25 years in IT, and monitoring has been there every step of the way, albeit in many different forms. Below is a look back at some of the tools and technologies I encountered along the way as well as some of the inherent pinnacles and pitfalls of how those tools were implemented.

As I was thinking about my past, I realized my journey through the realms of monitoring is probably not too different than your own. The request to do more with less is always a struggle for the IT professional because we’re bound by the old adage: “Good, fast, cheap. Choose two.”

External Tech Support/Internal IT for a Software Startup

TL;DR: No monitoring software, no budget, no desire, and no clue 

My first official “tech” job was working for a small software startup as a technical support rep. Though my role on paper was defined as technical support for our external customers, I was also the de facto internal IT. Under “other duties as assigned,” I was responsible for monitoring our modest infrastructure. The only problem with this last part was we had no budget for monitoring tools, and (quite frankly) at this point in my career, I had no idea such a thing existed. I was the monitoring solution.

Physical inspection of gear (green blinky lights good, red blinky lights bad), jumping from terminal to terminal with a KVM, and manually checking services took up a large chunk of my day. The work was mind-numbing, but I dutifully and cheerfully plodded along because a) it was part of what they were paying me to do and b) this being my first job in IT, I didn’t know any better.

This arrangement had several drawbacks for the organization—not the least of which was me being a single point of failure. If I had a day off or it was a holiday, our “monitoring solution” was offline. This is an excellent example of the “bus factor.” IT professionals frequently talk about reducing the number of failure domains, but at the time, this concept was lost on me. We were all young once.

The company viewed this as an acceptable risk, and I got to continue to hone my skills. Looking back, I think it came down to a few factors. First, the cost of downtime was relatively low, and we had backup options (phone and fax processing). Second, and probably more important, the actual dollar cost of any solution would outweigh the benefit.

It’s hard to even comprehend this decision in light of today’s fast-paced, instant-gratification world, but as an example of the low urgency given to downtime, if our online order system was unavailable, most of our customers would simply call in or send a fax asking to enter an order directly. Someone would write down the details and push it through the system later when it came back online.

After a couple years of this, I started to suspect we might be doing this the wrong way and was starting to research better ways. Then the company was acquired, and I made the decision to take my skills elsewhere.

Result: Good [ ], Fast [?], Cheap [X]

Network Admin for ISP Startup

TL;DR: Existing open-source solution requires significant care and feeding 

The next step on my career trajectory was also a small startup. This one was focused on the delivery of network services to hard-to-reach spots on the planet via satellite-based internet. An internet service provider (ISP) in the stars! As you can probably guess, being an ISP meant the cost of downtime was considerably more impactful than it was for my previous employer, and they had already implemented some monitoring. However, it was thrown together quickly using a couple of popular open-source packages.

Early on, my boss sheepishly admitted the monitoring setup wasn’t optimal and needed some significant love. Needless to say, I was overjoyed! The solution I long suspected to exist did, in fact, exist. Bonus: it was now part of my job to take ownership of it and make it better. I did what all new employees do and dove in headfirst. I started by cleaning up the configurations, adding devices missed during customer setups, upgrading the software to the latest stable builds, and educating myself on the nuts and bolts of the solution. This last bit (including customizations) was where things went slightly askew.

The package we were using, being open-source, meant I could literally do anything with it if I was willing to devote the time and resources. Luckily for my employer, I was willing to do it. I got so deep at times that some of my coworkers joked about my true employer. Did I work for our company or for this open-source project? Truthfully, some days it was hard to tell.

Balancing time between customizing with endless possibilities and all the other myriad tasks on my plate was a challenge. We were always operating on a shoestring budget. Purchasing a costly commercial package and paying consultants to customize it wasn’t in the cards. I learned a lot and gained some valuable experience there, but as is always the case, it was eventually time to move on.

Result: Good [?], Fast [ ], Cheap [X]

Systems Admin/Software Developer for the Manufacturing Industry

TL;DR: Commercial solution on a limited budget can’t cover everything 

My next position was interesting in many ways. I was not only responsible for the general IT infrastructure and helping end users but for maintaining and writing a large amount of VB code used to integrate graphical applications running on Windows with data and services on a mainframe. The mainframe was where the enterprise resource planning (ERP) solution and other business-critical applications lived. Another interesting deviation from previous jobs was this company’s main product wasn’t technology. The company was focused on manufacturing, and IT was simply part of how they ran their business. They ran a breakneck 24/7 operation where a hiccup in the IT infrastructure could cost tens of thousands of dollars by putting orders behind schedule.

This meant a couple of things from a monitoring perspective. I was responsible for monitoring the services and solutions driving the company and for coding and presenting the services themselves. I was a mixture of developer and operations. Sound familiar? This meant I was monitoring myself, and this increased the stakes.

Since this company was a) well-established, b) stable, and c) easily able to correlate downtime to dollars lost, they weren’t as interested in open-source solutions as the people at my previous company. Instead, they favored commercial products from well-known and established entities. This was great news, and I had to wonder when the other shoe would drop. The other shoe took the form of budget.

My budget for monitoring was modest compared to the entire infrastructure. This meant I could only afford to monitor the most critical and visible aspects of the environment with my commercial solution. Leaning on my previous experience, I brought in an open‑source solution to cover the remainder of the environment. This doesn’t mean these secondary systems were unimportant. Quite the contrary—they just weren’t the most important.

Thankfully, most of the stuff covered by the open-source solution was “standard,” and I avoided the need to do so much time-intensive coding and customization of the monitoring package. I did, however, spend some time cobbling together the data from both solutions so uptime and other key performance indicators (KPIs) about our performance as an IT org were easy to read for the executives and others who cared about such things. I think the lesson I learned most from this company is one size does not fit all.

Result: Good [X], Fast [?], Cheap [ ]

Network Admin in the Financial Industry

TL;DR: Existing commercial product doesn’t keep pace with change in the environment. 

Next on my adventure, I found myself working IT in the financial industry, where the infrastructure was my primary focus. I was, for the first time in quite a while, not expected to do any coding. I did a bit of scripting here or there, but “developer” wasn’t part of my defined role. This environment was one where caution and carefully thought-out actions in concert with high expectations of stability were always at the fore. It made sense, really. When you’re dealing with someone else’s money (or really your own, for that matter) you’d better be sure of what you’re doing.

In the world of monitoring, this led to using well-known and commercially-supported tools. An incumbent solution was already in place and had staff to support it. It turns out the team that originally implemented the solution was incredibly defensive whenever questioned. It was a good solution in general, but it was perhaps showing its age, and the pace of development from the vendor (even for service and hot fix releases) was generally few and far between. To a certain extent, you could argue this was a good thing. It meant the product was stable and “just worked.” On the other hand, it meant as certain parts of the environment matured and migrated to newer solutions and technology, the monitoring software sometimes struggled to deal with it or simply had no mechanism to speak the same language of the newly implemented systems. Couple this with the lack of flexibility in the solution, and its undoing was already in the making.

We started a lengthy process to evaluate new solutions, but this was an uphill battle for many of the reasons above. Folks were used to the existing solution and weren’t keen to learn something new, and they thought anything new would have a hefty price tag. I’ve seen this time and time again with beloved solutions/vendors and ardent supporters who want to “protect their fiefdom.”

At this company, open source was a relative nonstarter due to policy. Right when the tough discussions were beginning, I found a new opportunity and decided to take it. My biggest regret is not knowing the solution they chose and the reasons behind it.

Result: Good [?], Fast [X], Cheap [ ]

Systems Engineer in the Medical Industry

TL;DR: Fast-paced environment with multiple mergers and acquisitions leads to tool overlap 

My last stop in the operational IT world before landing here at SolarWinds was the largest environment to date. Size and scope bring their own challenges, but this was just the start. It turns out the company grew to this large size by aggressive expansion and aggressive acquisition. If you’ve ever dealt with a single merger/acquisition, you know (along with the skills of new coworkers), the company is now in possession of their respective toolbox. For this company, this wasn’t just one but many mergers, which exacerbates the toolbox nightmare while increasing the responsibility of the IT team.

Our problem? Tools as far as the eye could see. As the far-flung and previously independently operated pieces slowly merge, there are seemingly endless overlaps of the technology in place. Depending on the size of the acquired company, they could have any level of potential investment in monitoring. Each team is confident their solution is the best, and without a clear mandate from leadership, infighting can (and frequently does) begin.

Depending on the source company’s investment in a monitoring solution (and I’ve seen ranges from “none whatsoever” to “entire teams and budgets”), it can make for an interesting path. It was an exciting time, and I was exposed to all manner of solutions—some keeping me up at nights—and leadership phased out certain tools as teams adopted those from other groups.

Though some openly welcomed the increased vigilance of their environment, others fought tooth and nail to keep their own pieces in place, rejecting the corporate standard for any number of reasons (both good and bad). As you can probably tell, this turned into a “managing people and their expectations” project more than a pure technology project. Somewhere along the way, SolarWinds came a-calling, and I exited stage left gracefully—but I’m certain the “show” went on without me.

Result: Good [?], Fast [ ], Cheap [ ]

Lessons from my Journey

As much as I’d like to hope the problems of the past are relegated to the past, I’m confident these scenarios are still playing out to this day all around the world. It’s the company’s right to prioritize or deprioritize monitoring as needed based on their situation, but if I’ve learned anything in the intervening years, it’s two-fold. First, monitoring is essential to a healthy IT organization and therefore essential to a healthy company. Finally, there’s no silver bullet capable of fixing everything. This is where we rely on the IT professionals to help fill those gaps, guide the decision-making, and devote themselves to the craft they love.

IT pros of the world will continue to work their magic to make whatever tools they’re handed rise to the occasion, and I’ll continue to try to build some of those tools to make their lives better.

I’m interested to know if you’ve had a similar story or if you’ve encountered completely different situations as you’ve advanced along your own career path. Please feel free to share your thoughts in the comments, and I look forward to reading and absorbing it all.

Anonymous