cancel
Showing results for 
Search instead for 
Did you mean: 

Better Metrics. Better Data. Better Analytics. Better IT.

Level 12

A few years ago I was working on a project as a project manager and architect when a developer came up to me and said, "You need to denormalize these tables…" and he handed me a list of about 10 tables that he wanted collapsed into one big table. When I asked him why, he explained that his query was taking four minutes to run because the database was "overnormalized." Our database was small: our largest table had only 40,000 rows. His query was pulling from a lot of tables, but it was only pulling back data on one transaction.  I couldn't even think of a way to write a query to do that and force it to take four minutes. I still can't.

I asked him to show me the data he had to show me the duration of his query against the database. He explained that he didn't have data, he had just timed his application from button push to results showing up on the screen. He believed that because there could be nothing wrong with his code, then it just *had* to be the database that was causing his problem.

I ran his query against the database, and the results set came back in just a few milliseconds. No change to the database was going to make his four-minute query run faster. I told him to go find the cause that was happening between the database and the application. It wasn't my problem.


He eventually discovered that the issue was a complex one involving duplicate IP addresses and other network configuration issues in the development lab.

Looking back on that interaction, I realize that this is how most of us in IT work: someone brings us a problem, ("the system is slow"), we look into our tools and our data and make a yes-or-no answer about whether we caused it. If we can't find a problem, we close the ticket or send the problem over to another IT group. If we are in the database group, we send it over to the network or storage guys. If they get the report, they send it over to us. These sort of silo-based responses take longer to resolve, often lead to a lot of chasing down and re-blaming. It costs time and money because we aren't responding as a team, just a loose collection of groups.


Why does this happen?


perfstacksingle.pngThe main reason we do this is because typically we don't have insights into anyone else's systems' data and metrics. And even if we did, we wouldn't understand it. Then we throw in the fact that most teams have their own set of specialized tools and that we don't have access to. I had no access to network monitoring tools nor permissions to run any.  It wasn't my job.

We are typically measured and rewarded based on working within our own groups, be it systems, storage, or networks, not on troubleshooting issues with other parts of infrastructure.  It's like we build giant walls around our "stuff" and hope that someone else knows how to navigate around them. This "not my problem' response to complex systems issues doesn't help anyone.

What if it didn't have to be that way?

Another contributing factor is the intense complexity of the architecture of modern application systems. There are more options, more metadata, more metrics, more interfaces, more layers, more options than ever before. In the past, we attempted to build one giant tool to manage them all. What if we could still use specialty tools to monitor and manage all our components *and* pull the graph of resources and their data in one place so that we could analyze and diagnose issues using a common and sharable way?

True collaboration requires data that is:

  • Integrated
  • Visualized
  • Correlated
  • Traceable across teams and groups
  • Understandable

That's exactly what SolarWinds' PerfStack does. PerfStack builds upon the Orion Platform to help IT pros troubleshoot problems in one place, using a common interface, to help cross-platform teams figure out where a bottleneck is, what is causing it and get on to fixing it.

PerfstackScreen.png

From <https://thwack.solarwinds.com/community/solarwinds-community/product-blog/blog>

PerfStack combines metrics you choose from across tools like Network Performance Monitor Release Candidate @network  and Server &amp; Applications Monitor Release Candidate​ from the Orion Platform into one easy-to-consume data visualization, matching them up by time. You can see in the figure above how it's easy to spot a correlated data point that is likely the cause of less-than-spectacular performance your work normally delivers. PerfStack allows you to highlight exactly the data you want to see, ignore the parts that aren't relevant, and get right to the outliers.

As a data professional, I'm biased, but I believe that data is the key to successful collaboration in managing complex systems. We can't manage by "feelings," and we can't manage by looking at silo-ed data. With PerfStack, we have an analytics system, with data visualizations, to help us get to the cause faster, with less pain-and-blame. This makes us all look better to the business. They become more confident in us because, as one CEO told me, "you all look like you know what you are doing." That helped when we went to ask for more resources

Do you have a story?

Later in this series, I'll be writing about the nature of collaboration and how you can benefit from shared data and analytics in delivering better and more confidence-instilling results to your organization. Meanwhile, do you have any stories of being sent on a chase to find the cause of a problem?  Do you have any great stories of bizarre causes you've found to a systems issue?

20 Comments
vinay.by
Level 16

rschroeder
Level 21

A top IBM server hardware guru came to me, once upon a time, with a complaint that the network was slow.  He said "I can ping from one of my IBM interfaces to another address on that same box, and the latency is 20 milliseconds.  The same thing happens when I ping to outside systems.  You have a network problem."

I smiled and asked him to rethink what he'd just said.  I then tracked his source & destination IP ARP info down through the network to verify his belief, and yes, both MAC addresses were coming from the same IBM Big Iron server NIC and attached to the same switchport.

If his system is pinging itself through its own internal NIC, it never touches the L3 routing solution that NIC is plugged into.  It's not passing subnet/VLAN boundaries.  In fact, it's never even leaving his NIC (I proved this with a packet capture of that port).  The problem was within his IBM hardware, and it was extending that problem into the external network when he'd ping things outside his box.

He got a confused look, asked for more information--he believed that since the problem manifested itself when pinging to other systems, the issue couldn't be his box--it proved the issue was the network.  I explained further, as kindly and as gently as I could.  He eventually verified it was an IBM internal hardware or driver issue, not a network issue.

Sometimes the answer is right in front of us, but the problem can be that we're too advanced, that we've recently been working on very complex issues, which may lead us to think in terms of complex troubleshooting and complex answers.  It's a case of when you have a hammer, every problem must be a nail.

I've gone down blind alleys many times myself.  Troubleshooting VRRP or OTV or BGP, then someone comes to me with a network issue and I assume they've already done their basic troubleshooting.  After too long I'll have eliminated everything I can think of, and then I remember to ask if they've verified power and link, if they've rebooted at least once, if they've verified they can ping their gateway and outside their subnet, that they've gone over the device's IP address and mask and gateway settings to confirm there are no typos (if they have a static IP entry).

This IBM guru assumed all was well in his system, and neglected to first perform the basic and most-useful troubleshooting: verifying Layer 1, then Layer 2, then Layer 3--BEFORE coming to the Network Team.  He'd have saved us both time if he'd done this, but his skill level is at the top end of complexity, and he was accustomed to dealing with much tougher issues.  That was his "hammer."

I was interested to learn he had multiple addresses & MAC assigned virtually to his NICs, so I got something out of it, too.

vinay.by
Level 16

Eagerly waiting to try PerfStack ....

datachick
Level 12

That's both a perfect and awful story -- at the same time.  Thanks for sharing. 

datachick
Level 12

I can't wait, either. What I saw at Tech Field Day was great.

Jfrazier
Level 18

datachick​, much of the silo'd issues you wrote about exist because of separation of duties. In many cases it goes beyond protecting a resource from other people/teams from altering it to preventing other people/teams from having visibility into it.  Since many things today are vastly complicated with so many options that can drastically impact things if the options are selected incorrectly or parameters fat fingered, separation of duties  helps to protect that from other peoples fingers touching things.  But in many cases it seems it is taken too far to mean you can't even see my environment because you don't have the need to see it since you are not allowed to alter it.

Now with that said, I will re-iterate something I mentioned in a different thread.  These tools help to break down the silo's and help to give a better holistic view of the business service and enterprise "provided the culture" of the shop is open to such a thing.  In some shops that culture is embraced and they are better for it and that is usually in a mature shop.  In some shops that are small and rapidly growing where people are not wanting to give up control although the environment is spiralling out of control because they don't have the resources to manage it, they feel the spotlight is going to be used to spotlight their issues.  Again, culture.  we are all a team and our success is as a team. There has to be the overall visibility so they everybody can see how it all interacts together.  All these different moving parts are a machine, but as also mentioned elsewhere, configuration issues and even individual bits of latency all add up.  While everything appears to be in spec individually, as a while the tolerances add up and it as a whole is out of spec.  Tolerance stacking is hard to find but it is a real thing.

goodzhere
Level 14

I'll be installing this coming week.  I am looking forward to it.

datachick
Level 12

That's a good point, too.  I wasn't advocating that cross-collaboration means that people should be make changes in these other environments, but being able to see across team resources to have a better view into what's going on.

I've also been in situations where separation of duties meant I couldn't do my job.  I was an Accidental DBA (and the only DBA for a while) responsible for diagnosing performance and data quality issues in production, but I had zero access to production systems, not even to monitor it. When there was a data issue, I had to call a data centre person and try to walk them through SSMS to help me figure out what was going on.  Eventually he became so frustrated he asked me to walk to his building and sit beside him while we together looked at the data.  Which, of course, defeated the whole separation of duties thing.

Jfrazier
Level 18

datachick​, exactly my point !!

I understand separation of duties and there is a case for it if implemented reasonably well.

In a number of cases I see it hampering people from doing their job.  I also understand the reasons behind least privilege access.

I find that in many cases they are implemented in a manner that is far overreaching and highly constrictive to ones ability to do their job.

Thus I find myself in the same boat as you interrupting someone else's day to get them to enter commands that I need to enter but cannot. Eventually I waste so much of their time that a bit more access is granted so that I can administer the app I am responsible for. 

Squeaky Wheel Syndrome at its best.

michael.kent
Level 13

Me too, just getting the test suite ready.

ecklerwr1
Level 19

Can't wait for this and the new ASA support when it goes normal release!

jkump
Level 15

I like the course this is taking.  I was not familiar with this line of products.  I can see the benefits.  May have to look at adding some additional modules.

zero_cool
Level 10

True collaboration requires data that is:

  • Integrated
  • Visualized
  • Correlated
  • Traceable across teams and groups
  • Understandable

This is exactly the thing my V.P. is looking to achieve.  We are pushing to use the Perf Stack tool in anticipation it will meet our organization's goals. Great post datachick Thank you!

datachick
Level 12

Happy to hear it helps.

shuckyshark
Level 13

loved the read!

byrona
Level 21

Great Story!

One of the challenges is getting people to bring us problems instead of solutions.  The guy in your story brought you a solution, not a problem.  The issue with this is he was not equipped with enough data to be proposing a solution, before you jump to a solution you need to fully understand the problem you are trying to solve.  Once he had enough info about the lab he was able to implement a much better solution that would actually solve his problem.

tallyrich
Level 15

Perfstack is very promising - I'm hoping to get the team excited about this.

designerfx
Level 16

Same. I'm also happy that it also marks the EOS for Server 2008R2, because apparently I need to use that as leverage to upgrade.

datachick
Level 12

Excellent point.  Have you been reading the yet to be published third post in this series?  

datachick
Level 12

I like the way you think.

About the Author
Data Evangelist Sr. Project Manager and Architect at InfoAdvisors. I'm a consultant, frequent speaker, trainer, blogger. I love all things data. I'm an Microsoft MVP. I work with all kinds of databases in the relational and post-relational world. I'm a NASA 2016 Datanaut! I want you to love your data, too.