A few years ago I was working on a project as a project manager and architect when a developer came up to me and said, "You need to denormalize these tables…" and he handed me a list of about 10 tables that he wanted collapsed into one big table. When I asked him why, he explained that his query was taking four minutes to run because the database was "overnormalized." Our database was small: our largest table had only 40,000 rows. His query was pulling from a lot of tables, but it was only pulling back data on one transaction. I couldn't even think of a way to write a query to do that and force it to take four minutes. I still can't.
I asked him to show me the data he had to show me the duration of his query against the database. He explained that he didn't have data, he had just timed his application from button push to results showing up on the screen. He believed that because there could be nothing wrong with his code, then it just *had* to be the database that was causing his problem.
I ran his query against the database, and the results set came back in just a few milliseconds. No change to the database was going to make his four-minute query run faster. I told him to go find the cause that was happening between the database and the application. It wasn't my problem.
He eventually discovered that the issue was a complex one involving duplicate IP addresses and other network configuration issues in the development lab.
Looking back on that interaction, I realize that this is how most of us in IT work: someone brings us a problem, ("the system is slow"), we look into our tools and our data and make a yes-or-no answer about whether we caused it. If we can't find a problem, we close the ticket or send the problem over to another IT group. If we are in the database group, we send it over to the network or storage guys. If they get the report, they send it over to us. These sort of silo-based responses take longer to resolve, often lead to a lot of chasing down and re-blaming. It costs time and money because we aren't responding as a team, just a loose collection of groups.
Why does this happen?
The main reason we do this is because typically we don't have insights into anyone else's systems' data and metrics. And even if we did, we wouldn't understand it. Then we throw in the fact that most teams have their own set of specialized tools and that we don't have access to. I had no access to network monitoring tools nor permissions to run any. It wasn't my job.
We are typically measured and rewarded based on working within our own groups, be it systems, storage, or networks, not on troubleshooting issues with other parts of infrastructure. It's like we build giant walls around our "stuff" and hope that someone else knows how to navigate around them. This "not my problem' response to complex systems issues doesn't help anyone.
What if it didn't have to be that way?
Another contributing factor is the intense complexity of the architecture of modern application systems. There are more options, more metadata, more metrics, more interfaces, more layers, more options than ever before. In the past, we attempted to build one giant tool to manage them all. What if we could still use specialty tools to monitor and manage all our components *and* pull the graph of resources and their data in one place so that we could analyze and diagnose issues using a common and sharable way?
True collaboration requires data that is:
- Traceable across teams and groups
That's exactly what SolarWinds' PerfStack does. PerfStack builds upon the Orion Platform to help IT pros troubleshoot problems in one place, using a common interface, to help cross-platform teams figure out where a bottleneck is, what is causing it and get on to fixing it.
PerfStack combines metrics you choose from across tools like Network Performance Monitor Release Candidate @network and Server & Applications Monitor Release Candidate from the Orion Platform into one easy-to-consume data visualization, matching them up by time. You can see in the figure above how it's easy to spot a correlated data point that is likely the cause of less-than-spectacular performance your work normally delivers. PerfStack allows you to highlight exactly the data you want to see, ignore the parts that aren't relevant, and get right to the outliers.
As a data professional, I'm biased, but I believe that data is the key to successful collaboration in managing complex systems. We can't manage by "feelings," and we can't manage by looking at silo-ed data. With PerfStack, we have an analytics system, with data visualizations, to help us get to the cause faster, with less pain-and-blame. This makes us all look better to the business. They become more confident in us because, as one CEO told me, "you all look like you know what you are doing." That helped when we went to ask for more resources
Do you have a story?
Later in this series, I'll be writing about the nature of collaboration and how you can benefit from shared data and analytics in delivering better and more confidence-instilling results to your organization. Meanwhile, do you have any stories of being sent on a chase to find the cause of a problem? Do you have any great stories of bizarre causes you've found to a systems issue?