1 2 Previous Next

Geek Speak

19 Posts authored by: datachick

Game tile spelling out "DATA"

Building a culture that favors protecting data can be challenging. In fact, most of us who love our data spend a huge amount of time standing up for our data when it seems everyone else wants to take the easiest route to getting stuff done. I can hear the pleas from here:


  • We don't have time to deal with SQL injection now. We will get to that later.
  • If we add encryption to this data, our queries will run longer. It will make the database larger, which will also affect performance. We can do that later if we get the performance issues fixed.
  • I don't want to keep typing our these long, complex passwords. They are painful.
  • Multi-factor authentication means I have to keep my phone near me. Plus, it's a pain.
  • Security is the job of the security team. They are a painful bunch of people.


…and so on. What my team members don't seem to understand is that these pain points are supposed to be painful. The locks on my house doors are painful. The keys to my car are painful. The PIN on my credit card is painful. All of these are set up, intentionally, as obstacles to access -- not my access, but unauthorized access. What is it about team members who lock their doors, shred sensitive documents, and keep their collector action figures under glass that don't want to protect the data we steward on behalf of customers? In my experience, these people don't want to protect data because they are measured, compensated, and punished in ways that take away almost all the incentives to do so. Developers and programmers are measured on the speed of delivery. DBAs are measured on uptime and performance. SysAdmins are measured on provisioning resources. And rarely have these roles been measured and rewarded for security and privacy compliance.


To Reward, We Must Measure


How do we fix this? We start rewarding people for data protection activities. To reward people, we need to measure their deliverables.


  • An enterprise-wide security policy and framework that includes specific measures at the data category level
  • Encryption design, starting with the data models
  • Data categorization and modeling
  • Test design that includes security and privacy testing
  • Proactive recognition of security requirements and techniques
  • Data profiling testing that discovers unprotected or under-protected data
  • Data security monitoring and alerting
  • Issue management and reporting


As for the rewards, they need to focus on the early introduction of data protection features and service. This includes reviewing designs and user stories for security requirements.


Then we get to the hard part: I'm of a thought that specific rewards for doing what was expected of me are over the top. But I recognize that this isn't always the best way to motivate positive actions. Besides, as I will get into later in this series, the organizational punishments for not protecting data may be so large that a company will not be able to afford the lack of data protection culture we currently have. Plus, we don't want to have to use a prison time measurement to encourage data protection.


In this series, I'll be discussing data protection actions, why they are important, and how we can be better at data. Until then, I'll love to hear about what, if any, data protection reward (or punishment) systems your organization has in place today.

In this last post of my 5 More Ways I Can Steal Your Data series, I focus on my belief that all data security comes down to empathy. Yes, that one trait that we in technology stereotypically aren't known for displaying. But I know there are IT professionals out there who have and use it. These are the people I need on my teams to help guide them toward making the right decisions.


Empathy? That's Not a Technical Skill!

If we all recognize that the personal data we steward actually belongs to people who need to have their data treated securely, then we will make decisions that make that data more secure. But what about people who just don't have that feeling? We see attitudes like this:


"I know the data model calls for encryption, but we just don't have the time to implement it now. We'll do it later."


"Encryption means making the columns wider. That will negatively impact performance."


"We have a firewall to protect the data."


"Encryption increases CPU pressure. That will negatively impact performance."


"Security and privacy aren't my jobs. Someone needs to do those parts after the software is done."


"We don't have to meet European laws unless our company is in Europe." [I'm not a lawyer, but I know this isn't true.]


What's lacking in all those statements is a lack of empathy for the people whose data we are storing. The people who will be forced to deal with the consequences of bad data practices once all the other 10+ Ways I Can Steal Your Data I've been writing about in the eBook and this series. Consequences might just be having to reset their passwords. Bad data practices could lead to identity theft, financial losses, and personal safety issues.


Hiring for Empathy


I rarely see any interview techniques that focus on screening candidates for empathy skills or experiences. Maybe we should be adding such items to our hiring processes. I believe the best way to do this is to ask candidates to talk about:

  • Examples of times they had to choose the right type of security to implement for Personally Identifiable Information (PII)
  • A time they had to trade performance in favor of meeting a requirement
  • The roles they think are responsible for data protection
  • The methods they would use in projects focused on protecting data
  • The times they have personally experienced having their own data exposed


If I were asking these questions of a candidate, I'd be looking not so much for their answers, but the attitude they convey while answering. Did they factor in risks? Trade-offs? How a customer might be impacted?  This is what Jerry Weinberg writes about in Secrets of Consulting when he says, "Words are useful, but always listen to the music."


By the way, this concept applies to consultants as well. Sure, we tend to retain consultants who can just get things done, but they also need to have empathy to help clients make the right decisions. Consultants who lack empathy tend to not care much about your customers, just their own.


Wrapping it Up

I encourage you to read the eBook, go back through the series, then take steps to help ensure data security and empathy. Empathy is about feeling their pain and taking a stand to mitigate that pain as much as you can.


Oh, and as I said in a previous post, keeping your boss out of jail.  Do that.


UPDATE: My eBook, 10 Ways We Can Steal Your Data is now available.  Go download it.

10 Ways We Can Steal Your Data eBook cover: spaceship, robot, data center

Datachick LEGO at a SolarWinds Desk with a water cooler

In my recent post  5 More Ways I Can Steal Your Data - Work for You & Stop Working for You I started telling the story of a security guard who helped a just fired contractor take servers with copies of production data out of the building:


Soon after he was rehired, the police called to say they had raided his home and found servers and other computer equipment with company asset control tags on them. They reviewed surveillance video that showed a security guard holding the door for the man as he carried equipment out in the early hours of the morning. The servers contained unencrypted personal data, including customer and payment information. Why? These were development servers where backups of production data were used as test data.

Apparently, the contractor was surprised to be hired back by a company that had caught him stealing, so he decided since he knew about physical security weaknesses, he would focus not on taking equipment, but the much more valuable customer and payment data.


How the Heck Was He Able to Do This?


You might think he was able to get away with this by having insider help, right?  He did, sort of.  But it didn't come from the security guard.  It came from poor management practices, not enough resources, and more. I'm going to refer to the thief here as "Our Friend".


Not Enough Resources


Our Friend had insider information about how lax physical security was at this location.  There was only ever one security person working at a time.  When she took breaks, or had to deal with a security issues elsewhere, no one else was there to cover the entrance.  Staff could enter with badges and anyone could exit.  Badging systems were old and nearly featureless.  Printers and other resources available to the security group were old and nearly non-functioning.  Staff in security weren't required or tested to be security minded.


In this case, it was easy to figure out the weaknesses in this system.


Poor Security Practices


In the case of Our Friend, he was rehired by a different group who had no access to a "do not hire" list because he was a contractor, not an employee.  He was surprised at being rehired (as were others).  This culture of this IT group was very much "mind your own business" and "don't make waves".  I find that a toxic management culture plays a key role in security approaches.  When security issues were raised, the response was more often than not "we don't have time to worry about that" or "focus on your own job".


Poor Physical Security


Piggybacking or Tailgating (following a person with access through a door without scanning a badge) is a common unenforced practice in many facilities.  Sometimes employees would actually hold the door open for complete strangers.  This seems like being nice, but it's not. Another contractor, who had recently been let go, was let in several times during off hours to wander the hallways looking for his former work laptop.  He wanted to remove traces of improper files and photos.  He accomplished this by tailgating his way into the building.  This happened just weeks before Our Friend carried out his acts.


When Our Friend was rehired, there was a printout of his old badge photo hanging on the wall at the security area.  It was a low-resolution photo printed on a cheap inkjet printer running low on ink.  The guard working that day couldn't even tell that this guy had a "no entry" warning.  The badge printing software had no checks for "no new badge".


After being rehired, Our Friend was caught again stealing networking equipment and was let go.  Security was notified and another poorly printed photo was put up in the security area. Then Our Friend came back in the early morning hours on the weekend, said he forgot his badge and was issued a new one.  Nothing in the system set up an alert.


He spent some time gathering computers that were installed in development and QA labs, then some running in other unsecured areas.  He got a cart, and the security guard held the door open while he took them out to his car.  How do we know this?  There were video tapes. How do we know this? The security guard sold the tapes to a local news station. News stations love when there is video.


Data Ignorance


Ask I mentioned in the previous post, the company didn't even know the items were missing. It took several calls from the local police to get a response.  And even then the company denied anything was missing.  Because they didn't know.   Many of us knew that these computers would have production data on them because this organization used production data in their development and test processes.


But the company itself had no data inventory system. They had no way of knowing just what data was on those computers.  It was also common to find these systems had virtually no security or they had a single login for the QA environment that was written on the whiteboard in the QA labs. No one knew just what data was copied where.  Anyone could deploy production data anywhere they could find. Request for production data were pretty much allowed for anyone in IT or the rest of the company.   Requests could be done verbally.  There were no records of any request or the provision of data.  Employees were given no indication that any set of data held sensitive or otherwise protected data.


The lack of inventory let the company spokesperson say something like "These were just test devices; we have no indication that any customer data was involved in this theft".


Fixing It


I could go on with a list of tips on how to fix these issues. But the main fix, that no one wants to embrace, is to stop using production data for dev and test.  I have some more writing on this topic, but this will be my agenda for 2018.  If this company had embraced this option, the theft would have been just of equipment and some test data with no value.


The main fix that no one wants to embrace is to stop using production data for dev and test.


If we as IT professionals started following the practice of having real test data, many of the breaches we know of would not have been breaches of real data.  Yes, we need to fix physical security issues.  But let's keep production data in production.  Unless we are testing a production migration, there's no need to use production data for any reason.  In fact, many data protection compliance schemes forbid it.

Have you developed real test data, not based on just trying to obscure production data, for all your dev/text needs?

tiles spelling out DATA THEFT

In my eBook, 10 Ways We Can Steal Your Data, I reveal ways that people can steal or destroy the data in your systems. In this blog post, I'm focusing on un-monitored and poorly monitored systems.


Third-party Vendors


The most notorious case of this type is the 2013 Target data theft incident in which 40 million credit and debit cards were stolen from Target's systems. This data breach is a case study on the role of monitoring and alerting. It led to fines and costs in the hundreds of millions of dollars for the retailer. Target had security systems in place, but the company wasn't monitoring the security of their third-party supplier. And, among other issues, Target did not respond to their monitoring reports.


The third-party vendor, an HVAC services provider, had a public-facing portal for logging in to monitor their systems. Access to this system was breached via an email phishing attack. This information, together with a detailed security case study and architecture published by another Target vendor, gave the attackers the information they needed to successfully install malware on Target Point-of-Sale (POS) servers and systems.


Target listed their vendors on their website. This list provided a funnel for attackers to find and exploit vendor systems. The attackers found the right vulnerability to exploit with one of the vendors, then leveraged the details from the other vendor to do their work.


Misconfigured, Unprotected, and Unsecured Resources


The attackers used vulnerabilities (backdoors, default credentials, and misconfigured domain controllers) to work their way through the systems. These are easy things to scan for and monitor. So much so that "script kiddies" can do this without even knowing how their scripts work. Why didn't IT know about these misconfigurations? Why were default credentials left in enterprise data center applications?  Why was information about ports and other configurations published publicly? No one of these issues may have led to the same outcome, but as I'll cover below, these together formed the perfect storm of mismanaged resources to make the data breach possible.



When all this was happening, Target's offsite monitoring team was alerted that unexpected activities were happening on a large scale. They notified Target, but there was no response.


Some of the reasons given were that there were too many false positives, so security staff had grown slow to respond to all reports. Alert tuning would have helped this issue. Other issues included having too few and undertrained security staff.


Pulling it All Together


There were monitoring controls in place at Target, as well as security staff, third-party monitoring services, and up-to-date compliance auditing. But the system as a whole failed due to not having an integrated, system-wide approach to security and threat management.



How can we mitigate these types of events?


  • Don't use many, separate monitoring and alerting systems
  • Follow data flows through the whole system, not just one system at a time
  • Tune alerts so that humans respond
  • Test responders to see if the alerts are working
  • Read the SANS case study on this breach
  • Don't let DevOps performance get in the way of threat management
  • Monitor for misconfigured resources
  • Monitor for unpatched resources
  • Monitor for rogue software installs
  • Monitor for default credentials
  • Monitor for open ports
  • Educate staff on over-sharing about systems
  • Monitor the press for reports about technical resources
  • Perform regular pen testing
  • Treat security as a daily operational practice for everyone, not just an annual review
  • Think like a hacker


I could just keep adding to this list.  Do you have items to add? List them below and I'll update.

AventureWorks Sample data

In my soon-to-be-released eBook, 10 Ways We Can Steal Your Data, we talk about The People Problem, how people not even trying to be malicious end up exposing data to others without even understanding how their actions put data at risk. But in this post, I want to talk about intentional data theft.


What happens when insiders value the data your organization stewards? There have been several newsworthy cases where insiders have recognized that they could profit from taking data and making it available to others. In today’s post, I cover two ways I can steal your data that fall under that category.

1.Get hired at a company where security is an afterthought

When working with one of my former clients (this organization is no longer in business, so I feel a bit freer to talk about this situation), an IT contractor with personal financial issues was hired to help with networking administration. From what I heard, he was a nice guy and a hard worker. One day, network equipment belonging to the company was found in his car and he was let go. However, he was rehired to work on a related project just a few months later. During this time, he was experiencing even greater financial pressures than before. 

Soon after he was rehired, the police called to say they had raided his home and found servers and other computer equipment with company asset control tags on them. They reviewed surveillance video that showed a security guard holding the door for the man as he carried equipment out in the early hours of the morning. The servers contained unencrypted personal data, including customer and payment information. Why? These were development servers where backups of production data were used as test data.

Apparently, the contractor was surprised to be hired back by a company that had caught him stealing, so he decided since he knew about physical security weaknesses, he would focus not on taking equipment, but the much more valuable customer and payment data. 

In another case, a South Carolina Medicaid worker requested a large number of patient records, then emailed that data to his personal address. This breach was discovered and he was fired. My favorite quotes from this story were:

Keck said that in hindsight, his agency relied too much on “internal relationships as our security system.”




Given his position in the agency, Lykes had no known need for the volume of information on Medicaid beneficiaries he transferred, Keck said.

How could this data breach be avoided?

It seems obvious to me, but rehiring a contractor who has already breached security seems like a bad idea. Having physical security that does not require paperwork to remove large quantities of equipment in the middle of the night also seems questionable. Don't let staffing pressures persuade you to make bad rehire decisions.

2. Get hired, then fired, but keep friends and family close


At one U.S. hospital, a staff member was caught stealing patient data for use in identity theft (apparently this a major reason why health data theft happens) and let go. But his wife, who worked at the hospital in a records administration role, maintained her position after he was gone. Not surprisingly, at least in hindsight, the data thefts continued.

There have also been data breach scenarios in which one employee paid another employee or employees to gather small numbers of records to send to a third party who aggregated those records into a more valuable stockpile of sellable data.

In other data breach stories, shared logins and passwords have led to former employees stealing data, locking out onsite teams, or even destroying data. I heard a story about one employee, who was swamped with work, who provided his credentials to a former employee who had agreed to assist with the workload. That former employee used the information he was given to steal and resell valuable trade secrets to his new employer.

How can these data breaches be avoided?

In the previously mentioned husband and wife scenario, I'm not sure what the impact should have been regarding the wife’s job. There was no evidence that she had been involved in the previous data breach. That said, it would have been a good idea to ensure that data access monitoring was focused on any family members of the accused.

Sharing logins and passwords is a security nightmare when employees leave. They rarely get reset, and even when they do they are often reset to a slight variation of the former password.


This reminds me of one more much easier way to steal data, one I covered in the 10 Ways eBook: If you use production data as test and development data, it’s likely there is no data access monitoring on that same sensitive data. And no “export controls” on it, either. This is a gaping hole in data security and it’s our job as data professionals to stop this practice.

What data breach causes have you heard about that allowed people to use unique approaches to stealing or leaking data? I'd love to hear from you in the comments below.



In my soon-to-be-released eBook, 10 Ways I Can Steal Your Data, I cover the not-so-talked-about ways that people can access your enterprise data. It covers things like you're just GIVING me your data, ways you might not realize you are giving me your data, and how to keep those things from happening.


The 10 Ways eBook was prepared to complement my upcoming panel during next week's ThwackCamp on the data management lifecycle. You've registered for ThwackCamp, right? In this panel, a group of fun and sometimes irreverent IT professionals, including Thomas LaRock sqlrockstar, Stephen Foskett sfoskett and me, talk with Head Geek Kong Yang kong.yang about things we want to see in the discipline of monitoring and systems administration. We also did a fun video about stealing data. I knew I couldn't trust that Kong guy!


In this blog series, I want to talk about bit more about other ways I can steal your data. In fact, there are so many ways this can happen I could do a semi-monthly blog series from now until the end of the world. Heck, with so many data breaches happening, the end of the world might just be sooner than we think.


More Data, More Breaches

We all know that data protection is getting more and wider attention. But why is that? Yes, there are more breaches, but I also think legislation, especially the regulations coming out of Europe, such as General Data Protection Regulation (GDPR), means we are getting more reports. In the past, organizations would keep quiet about failures in their infrastructure and processes because they didn't want us to know about how poorly they treated our data. In fact, during the "software is eating the world" phase of IT professionals making software developers kings of world, most data had almost no protection and was haphazardly secured. We valued performance over privacy and security. We favored developer productivity over data protection. We loved our software more than we loved our data.


But this is all changing due to an increased focus on the way the enterprise values data.


I have some favorite mantras for data protection:


  • Data lasts longer than code, so treat it right
  • Data privacy is not security, but security is required to protect data privacy
  • Data protection must begin at requirements time
  • Data protection cannot be an after-production add-on
  • Secure your data and secure your job
  • Customer data is valuable to the customers, so if you value it, your customers will value your company
  • Data yearns to be free, but not to the entire world
  • Security features are used to protect data, but they have to be designed appropriately
  • Performance desires should never trump security requirements



And my favorite one:


  • ROI also stands for Risk of Incarceration: Keeping your boss out of jail is part of your job description



So keep an eye out for the announcement of the eBook release and return here in two weeks when I'll share even more ways I can steal your data.


As we come to the end of this series on infrastructure and application data analytics, I thought I'd share my favorite quotes, thoughts, and images from the past few weeks of posts leading up to the PerfStack release.


SomeClown leads the way in The One Where We Abstract a Thing


"Mean time to innocence (MTTI) is a somewhat tongue-in-cheek metric in IT shops these days, referring to the amount of time it takes an engineer to prove that the domain for which they have responsibility is not, in fact, the cause of whatever problem is being investigated. In order to quantify an assessment of innocence you need information, documentation that the problem is not yours, even if you cannot say with any certainty who does own the problem. To do this, you need a tool which can generate impersonal, authoritative proof you can stand on, and which other engineers will respect. This is certainly helped if a system-wide tool, trusted by all parties, is a major contributor to this documentation."


Karen:  Mean Time To Innocence! I'm so stealing that. I wrote a bit about this effect in my post Improving your Diagnostic and Troubleshooting Skills. When there's a major problem, the first thing most of us think is, "PLEASE DON'T LET IT BE ME!"  So I love this thought.


demitassenz wrote in PerfStack for Multi-dimensional Performance Troubleshooting


"My favorite part was adding multiple different performance counters from the different layers of infrastructure to a single screen. This is where I had the Excel flashback, only here the consolidation is done programmatically. No need for me to make sure the time series match up. I loved that the performance graphs were re-drawing in real-time as new counters were added. Even better was that the re-draw was fast enough that counters could be added on the off chance that they were relevant. When they are not relevant, they can simply be removed. The hours I wasted building Excel graphs translate into minutes of building a PerfStack workspace."


Karen:  OMG! I had completely forgotten my days of downloading CSVs or other outputs of tools and trying to correlate them in Excel. As a data professional, I'm happy that we now have a way to quickly and dynamically bring metrics together to make data tell the story it wants to tell.


cobrien  NPM 12.1 Sneak Peek - Using Perfstack for Networks


"I was exploring some of the data the other day. It’s like the scientific method in real-time. Observe some data, come up with a hypothesis, drag on related data to prove or disprove your hypothesis, rinse, and repeat."


Karen:  Data + Science.  What's not to love?


SomeClown mentioned in Perfstack Changes the Game


"PerfStack can now create dashboards on the fly, filled with all of the pertinent pieces of data needed to remediate a problem. More than that, however, they can give another user that same dashboard, who can then add their own bits and bobs. You are effectively building up a grouping of monitoring inputs consisting of cross-platform data points, making troubleshooting across silos seamless in a way that it has never been before."


Karen: In my posts, I focused a lot on the importance of collaboration for troubleshooting. Here, Teren gets right to the point. We can collaboratively build analytics based on our own expertise to get right to the point of what we are trying to resolve.  And we have data to back it up.


aLTeReGo in a post demo-ing how it works, Drag & Drop Answers to Your Toughest IT Questions


"Sharing is caring. The most powerful PerfStack feature of all is the ability to collaborate with others within your IT organization; breaking down the silo walls and allowing teams to triage and troubleshoot problems across functional areas. Anything built in PerfStack is sharable. The only requirement is that the individual you're sharing with has the ability to login to the Orion web interface. Sharing is as simple as copying the URL in your browser and pasting it into email, IM, or even a help desk ticket."


Karen: Yes! I also wrote about how important collaboration is to getting problems solved fast.


demitassenz shared in Passing the Blame Like a Boss


"One thing to keep in mind is that collaborative troubleshooting is more productive than playing help desk ticket ping pong. It definitely helps the process to have experts across the disciplines working together in real time. It helps both with resolving the problem at hand and with future problems. Often each team can learn a little of the other team’s specialization to better understand the overall environment. Another underappreciated aspect is that it helps people to understand that the other teams are not complete idiots. To understand that each specialization has its own issues and complexity.


Karen: Help desk ticket ping pong. If you've ever suffered through this, especially when someone passes the tick back to you right before the emergency "why haven't we fixed this yet" meeting with the CEO, you'll know the pain of it all.


SomeClown observed in More PerfStack - Screenshot Edition


"In a nutshell, what it allows you to do is to find all sorts of bits of information that you're already monitoring, and view it all in one place for easy consumption. Rather than going from this page to that, one IT discipline-domain to another, or ticket to ticket, PerfStack gives you more freedom to mix and match, to see only the bits pertinent to the problem at hand, whether those are in the VOIP systems, wireless, applications, or network. Who would have thought that would be useful, and why haven't we thought of that before?"


Karen: "Why haven't we thought of that before?" That last bit hit home for me. I remember working on a project for a client to do a data model about IT systems. This was at least 20 years ago. We were going to build an integrated IT management systems so that admins could break through the silo-based systems and approaches to solve a major SLA issue for our end-users. We did a lot of work until the project was deferred when a legislative change meant that all resources needed to be redirected to meet those requirements. But I still remember how difficult it was going to be to pull all this data together. With PerfStack, we aren't building a new collection system.  We are applying analytics on top of what we are already collecting with specialized tools.


DataChick's Thoughts


This next part is cheating a bit, because the quotes are from my own posts. But hey, I also like them and want to focus on them again.


datachick in Better Metrics. Better Data. Better Analytics. Better IT.


"As a data professional, I'm biased, but I believe that data is the key to successful collaboration in managing complex systems. We can't manage by "feelings," and we can't manage by looking at silo-ed data. With PerfStack, we have an analytics system, with data visualizations, to help us get to the cause faster, with less pain-and-blame. This makes us all look better to the business. They become more confident in us because, as one CEO told me, "You all look like you know what you are doing." That helped when we went to ask for more resources."


Karen: We should all look good to the CEO, right?


datachick ranted in 5 Anti-Patterns to IT Collaboration: Data Will Save You


"These anti-patterns don't just increase costs, decrease team function, increase risk, and decrease organizational confidence, they also lead to employee dissatisfaction and morale. That leads to higher turnover (see above) and more pressure on good employees. Having the right data, at the right time, in the right format, will allow you to get to the root cause of issues, and better collaborate with others faster, cheaper, and easier.  Also, it will let you enjoy your 3:00 ams better."


I enjoyed sharing my thoughts on these topics and reading other people's posts as well. It seems bloggers here shared the same underlying theme of collaboration and teamwork. That made this Canadian Data Chick happy. Go, everyone. Solve problems together.  Do IT better.  And don't let me catch you trying to do any of that without data to back you up. Be part of #TeamData.



I've worked in IT for a long time (I stopped counting at twenty years.  Quite a while ago.)  This experience means that I generally do well in troubleshooting in data--related areas.  Other areas like networking and I'm pretty much done at "do I have an IP address" and "is it plugged in?"


This is why team collaboration on IT issues, as I posted before, is so important.


What Can Go Wrong?


One of the things I've noticed is that while people can be experts in deploying solutions, this doesn't mean they are great at diagnosing issues. You've worked with that guy.  He's great at getting things installed and working.  But when things go wrong, he just starts pulling out cables and grumbling about other people's incompetence.  He keeps making changes and does several at the same time.  He's a nightmare.  And when you try to step in to help him get back on a path, he starts laying blame before he starts diagnosing the issue. You don't have to be that guy, though, to have challenges in troubleshooting.


Some of the effects that can contribute to troubleshooting challenges:


Availability Heuristic


If you have recently solved a series of NIC issues, the next time someone reports slow response times, you're naturally going to first consider a NIC issue.  And many times, this will work out just fine.  But if it constrains your thinking, you may be slow to get to the actual cause.  The best way to fight this cognitive issue is to gather data first, then assess the situation based on your entire troubleshooting experience.


Confirmation Bias


Confirmation Bias goes hand in hand with availability heuristic. Once you have narrowed the causes you think are causing this response time metric, your brain will want you to go look for evidence that the problem is indeed the network cards.   The best way to fight this is to recognize when you are looking for proof instead of looking for data.  Another way to overcome confirmation bias is to collaborate with others on what they are seeing.  While groupthink can be a issue, it's less likely for a group to share the same confirmation bias equally.


Anchoring Heuristic


So to get here, you have limited your guesses to recent issues, you have searched out data to prove the correctness of your diagnosis and now you are anchored there.  You want to believe.  You may start rejecting and ignoring data that contradicts your assumptions. In a team environment, this can be one of the most frustrating group troubleshooting challenges. You definitely don't want to be that gal.  The one who won't look at all the data. Trust me on this.




I use intuition a lot when I diagnose issues.  It's a good thing, in general.  Intuition helps professionals take a huge amount of data and narrow it down to a manageable set of causes. It's usually based on having dealt with similar issues hundreds or thousands of times over the course of your career.  But intuition without follow up data analysis can be a huge issue.  This often happens due to ego or lack of experience.  Dunning Kruger syndrome (not knowing what you don't know) can also be a factor here.


There are other challenges in diagnosing causes and effects of IT issues. I highly recommend reading up of them so you can spot these behaviours in others and yourself.


Improving Troubleshooting Skills


  1. Be Aware.
    The first thing you can do to improve the speed and accuracy of your troubleshooting is to recognize these behaviours when you are doing them.  Being self-aware, especially when you are under pressure to bring systems back online or have a boss pacing behind your desk asking "when will this be fixed?" will help you focus on the right things.  In a truly collaborative, high trust environment, team members can help others check whether they are having challenges in diagnosing based on the biases above.
  2. Get feedback.
    We are generally luck in IT that we, unlike other professions,  can almost always immediately see the impact of our fixes to see if they actually fixed the problem.  We have tools that report metrics and users who will let us know if we were wrong.  But even post-event analyses, documenting what we got right, what we got wrong can help us improve our methods
  3. Practice.
    Yes, every day we troubleshoot issues.  That counts as practice.  But we don't always test ourselves like other professions do.  Disaster Recovery exercises are a great way to do this, but I've always thought we needed troubleshooting code camps/hackathons to help us hone our skills. 
  4. Bring Data.
    Data is imperative to punching through the cognitive challenges listed above.  Imagine diagnosing a data-center wide outage and having to start by polling each resource to see how it's doing.  We must have data for both intuitive and analytical responses.
  5. Analyze.
    I love my data.  But it's only and input into a diagnostic process.  Metrics, considered in a holistic, cross-platform, cross team view is the next step.  A shared analysis platform makes combining and overlaying data to get to the real answers makes all this smoother and faster.
  6. Log What Happened. 
    This sounds like a lot of overhead when you are under pressure (is your boss still there?), but keeping a quick list of what was done, what your thought process was, what others did can be an important part of professional practice.  Teams can even share the load of writing stuff down.  This sort of knowledgebase is also important for when your run into the rare things that that have a simple solution but you can't remember exactly what to do (or even not to do).

A person with experience can be a experienced non-expert. But with data, analysis and awareness of our biases and challenges in troubleshooting, we can get problems solved faster and with better accuracy. The future of IT troubleshooting will be based more and more on analytical approaches.


Do you have other tips for improving your troubleshooting and diagnostic skills?  Do you think we should get formal training in troubleshooting?

In our pursuit of Better IT, I bring you a post on how important data is to functional teams and groups. Last week we talked aboutnti-patterns in collaboration, covering things like data mine-ing and other organizational dysfunctions. In this post we will be talking about the role shared data, information, visualizations, and analytics play in helping ensure your teams can avoid all those missteps from last week.


Data! Data! Data!

These days we have data. Lots and lots of data. Even Big Data, data so important we capitalize it!. As much as I love my data, we can't solve problems with just raw data, even if we enjoy browsing through pages of JSON or log data. That's why we have products like NPM Network Performance Monitor Release Candidate , SAM Server & Applications Monitor Release Candidate and DPADatabase Performance Analyzer RC,  to help us collect and parse all that data.  Each of those products have specialized metrics they collect, meaning they apply to them and visualizations to help specialized SySadmins to leverage that data. These administrators probably don't think of themselves as data professionals, but they are. They choose which data to collect, which levels to be alerted on, and which to report upon. They are experts in this data and they have learned to love it all.

Shared Data about App and Infrastructure Resources

Within the SolarWinds product solutions, data about the infrastructure and application graph is collected and displayed on the Orion Platform. This means that cross-team admins share the same set of resources and components and the data about their metrics. Now we havePerfStack Livecast with features to do cross-team collaboration via data. We can see entities we want to analyze, then see all the other entities related them. This is what I call the Infrastructure and Application Graph, which I'll be writing about later. After choosing Entities, we can discover the metrics available for each of the entities and choose the ones that make the most sense to analyze based on the troubleshooting we are doing now.




Metrics Over Time


Another data feature that's critical to analyzing infrastructure issues is the ability to see data *over time." It's not enough to know how CPU is doing right now, we need to know what it was doing earlier today, yesterday, last week, and maybe even last month, on the same day of the month. By having a view into the status of resources over time, we can intelligently make sense of the data we are seeing today. End-of-month processing going on? Now we know why there might be slight spike in CPU pressure.


Visualizations and Analyses


The beauty of Perfstack is that by choosing these Entities and metrics we can easily build data visualizations of the metrics and overlay them to discover correlations and causes. We can then interact with the information we now have by working with the data or the visualizations. By overlaying the data, we can see how statuses of resources are impacting each other. This collaboration of data means we are performing "team troubleshooting" instead of silo-based "whodunits." We can find the issue, which until now might have been hiding in data in separate products.




So we've gone from data to information to analysis in just minutes. Another beautiful feature of PerfStack is that once we've built the analyses that show our troubleshooting results, we can copy the URL, send it off to team members, and they can see the exact same analysis -- complete with visualizations -- that we saw. If we've done similar troubleshooting before and saved projects, we might be doing this in seconds.

Save Project.png

This is often hours, if not days, faster than how we did troubleshooting in our previous silo-ed, data mine-ing approach to application and infrastructure support. We accomplished this by having quick and easy access to shared information that united differing views of our infrastructure and application graph.


Data -> Information -> Visualization -> Analysis -> Action


It all starts with the data, but we have to love the data into becoming actions. I'm excited about this data-driven workflow in keeping applications and infrastructure happy.

Karen Lego Figures (c) Karen Lopez InfoAdvisors.com

As promised in my previous post on Better IT, in this series I will be talking about collaboration. Today I'm sharing with you anti-patterns in collaboration.

Anti-pattern - Things you shouldn't be doing because they get in the way of success in your work, or your organization's efforts.  Antonym of "pattern."

In my troubled project turnaround work, when I start to talk about collaboration, I usually get many eye rolls. People think we're going to start doing team-building exercises, install an arcade game, and initiate hourly group hugs. (Not that these would be so bad.)  But most collaboration missteps I see are the result of anti-patterns that show up in how teams work. So in this post, let's look at the not-so-great-things that will get your team and your organization into trouble.


IT admins who don't know who is responsible for what, or can't find them

This is often the case in geo-diverse teams, spread over several time zones, and teams with a high staff turnover. Their processes (their "pattern") is to go on a "responsibility safari" to find the person and their contact information for a resource. On one project, it took me almost a month to find the person, who lived on another continent, who was responsible for the new networks we were going to deploy to our retail locations. By the time I found him, he was planning on moving to another company within a week. Having to hunt down people first, then their tools, then their data, is both costly and time-consuming, which delays one's ability to resolve issues. Having to find people before you find data is not the right way to manage.


IT admins who collect duplicate data about resources and their metrics, often in difficult to integrate formats and units of measure

This is almost always the result of using a hodgepodge of tools across teams, many of which are duplicate tools because one person has a preference of toolsets. This duplication of tools leads to duplication of data.  And many of these tools keep their data locked in, with no way to share that data with other tools. This duplication of data and effort is a huge waste of time and money for everyone. The cost of incompatible tool sets producing data in incompatible formats and levels of granularity is large and often not measured. It slows down access to data and the sharing of data across resource types.


IT pros who want to keep their data "private" 

This dysfunction is one my friend Len Silverston calls "data mine-ing," keeping data to yourself for personal use only. This is derived from the fact that data is indeed power. Keeping information about the status of the resources you manage gives you control of the messaging about those systems. This is a terrible thing for collaboration.


Data mine-ing - Acting in a manner that says, "This data is mine."

- Len Silverston

Agile blocking is horrible

A famous Agilista wants people to report false statuses, pretend to do work, tell teams that "all is good" so he can keep doing what he is doing without interruption. He also advocates for sharing incorrect data and data that makes it look like other teams are to blame. I refuse to link to this practice, but if you have decent search skills, you can find it. Teams that practice blocking are usually in the worst shape possible, and also build systems that are literally designed to fail and send their CEO to jail.  It's that bad. Of all these anti-patterns, this is the most dangerous and selfish.


IT admins who use a person-focused process

We should ensure that all of our work is personable. And collaborative. But "person-focused" here means "sharing only via personal intervention." When I ask them how they solve a problem, they often answer with, "I just walk over to the guy who does it and ask them to fix it." This is seen as Agile, because it's reactionary, and needs no documentation or planning. It does not scale on real-life projects. It is the exact opposite of efficiency. "Just walking over" is an interruption to someone else who may not even manage one of the actual resources related to the issue. Also, she might not even work in the same building or country.  Finally, these types of data-less visits increases the us-versus-them mentality that negatively impacts the collaboration success. Sharing data about an instance is just that: data. It's the status of a set a resources. We can blame a dead router without having to blame a person. Being able to focus on the facts allows us to depersonalize the blame game.


Data will save you

These anti-patterns don't just increase costs, decrease team function, increase risk, and decrease organizational confidence, they also lead to employee dissatisfaction and morale. That leads to higher turnover (see above) and more pressure on good employees. Having the right data, at the right time, in the right format, will allow you to get to the root cause of issues, and better collaborate with others faster, cheaper, and easier.  Also, it will let you enjoy your 3:00 ams better.


Are there other anti-patterns related to collaboration that you've seen when you've tried to motivate cross-team collaboration?  Share one in the comments if you do.

A few years ago I was working on a project as a project manager and architect when a developer came up to me and said, "You need to denormalize these tables…" and he handed me a list of about 10 tables that he wanted collapsed into one big table. When I asked him why, he explained that his query was taking four minutes to run because the database was "overnormalized." Our database was small: our largest table had only 40,000 rows. His query was pulling from a lot of tables, but it was only pulling back data on one transaction.  I couldn't even think of a way to write a query to do that and force it to take four minutes. I still can't.


I asked him to show me the data he had to show me the duration of his query against the database. He explained that he didn't have data, he had just timed his application from button push to results showing up on the screen. He believed that because there could be nothing wrong with his code, then it just *had* to be the database that was causing his problem.


I ran his query against the database, and the results set came back in just a few milliseconds. No change to the database was going to make his four-minute query run faster. I told him to go find the cause that was happening between the database and the application. It wasn't my problem.

He eventually discovered that the issue was a complex one involving duplicate IP addresses and other network configuration issues in the development lab.


Looking back on that interaction, I realize that this is how most of us in IT work: someone brings us a problem, ("the system is slow"), we look into our tools and our data and make a yes-or-no answer about whether we caused it. If we can't find a problem, we close the ticket or send the problem over to another IT group. If we are in the database group, we send it over to the network or storage guys. If they get the report, they send it over to us. These sort of silo-based responses take longer to resolve, often lead to a lot of chasing down and re-blaming. It costs time and money because we aren't responding as a team, just a loose collection of groups.

Why does this happen?

perfstacksingle.pngThe main reason we do this is because typically we don't have insights into anyone else's systems' data and metrics. And even if we did, we wouldn't understand it. Then we throw in the fact that most teams have their own set of specialized tools and that we don't have access to. I had no access to network monitoring tools nor permissions to run any.  It wasn't my job.


We are typically measured and rewarded based on working within our own groups, be it systems, storage, or networks, not on troubleshooting issues with other parts of infrastructure.  It's like we build giant walls around our "stuff" and hope that someone else knows how to navigate around them. This "not my problem' response to complex systems issues doesn't help anyone.



What if it didn't have to be that way?


Another contributing factor is the intense complexity of the architecture of modern application systems. There are more options, more metadata, more metrics, more interfaces, more layers, more options than ever before. In the past, we attempted to build one giant tool to manage them all. What if we could still use specialty tools to monitor and manage all our components *and* pull the graph of resources and their data in one place so that we could analyze and diagnose issues using a common and sharable way?


True collaboration requires data that is:

  • Integrated
  • Visualized
  • Correlated
  • Traceable across teams and groups
  • Understandable


That's exactly what SolarWinds' PerfStack does. PerfStack builds upon the Orion Platform to help IT pros troubleshoot problems in one place, using a common interface, to help cross-platform teams figure out where a bottleneck is, what is causing it and get on to fixing it.



From <https://thwack.solarwinds.com/community/solarwinds-community/product-blog/blog>


PerfStack combines metrics you choose from across tools like Network Performance Monitor Release Candidate @network  and Server & Applications Monitor Release Candidate from the Orion Platform into one easy-to-consume data visualization, matching them up by time. You can see in the figure above how it's easy to spot a correlated data point that is likely the cause of less-than-spectacular performance your work normally delivers. PerfStack allows you to highlight exactly the data you want to see, ignore the parts that aren't relevant, and get right to the outliers.


As a data professional, I'm biased, but I believe that data is the key to successful collaboration in managing complex systems. We can't manage by "feelings," and we can't manage by looking at silo-ed data. With PerfStack, we have an analytics system, with data visualizations, to help us get to the cause faster, with less pain-and-blame. This makes us all look better to the business. They become more confident in us because, as one CEO told me, "you all look like you know what you are doing." That helped when we went to ask for more resources


Do you have a story?


Later in this series, I'll be writing about the nature of collaboration and how you can benefit from shared data and analytics in delivering better and more confidence-instilling results to your organization. Meanwhile, do you have any stories of being sent on a chase to find the cause of a problem?  Do you have any great stories of bizarre causes you've found to a systems issue?


In previous posts, I've written about making the best of your accidental DBA situation.  Today I'm going to give you my advice on the things you should focus on if you want to move from accidental DBA to full data professional and DBA.


As you read through this list, I know you'll be thinking, "But my company won't support this, that's why I'm an accidental DBA." You are 100% correct.  Most companies that use accidental DBAs don't understand the difference between developer and DBA, so many of these items will require you to take your own initiative. But I know since you are reading this you are already a great candidate to be that DBA.




Your path to becoming a DBA has many forks, but I'm a huge fan of formal training. This can be virtual or in-person. By virtual I mean a formal distance-learning experience, with presentations, instructor Q&A, hands-on labs, exams and assignments. I don't mean watching videos of presentations. Those offerings are covered later.


Formal training gives you greater confidence and evidence that you learned a skill, not that you only understand it. Both are important, but when it comes to that middle-of-the-night call alerting you that databases are down, you want to know that you have personal experience in bringing systems back online.



Conferences are a great way to learn, and not just from invited speakers. Speaking with fellow attendees, via the hallway conferences that happen in conjunction with the formal event,  gives you the opportunity to network with people who boast a range of skill levels. Sharing resource tips with these folks is worth the price of admission.


User Groups and Meetups

I run the Toronto Data Platform and SQL Server Usergroup and Meetup, so I'm a bit biased on this point. However, these opportunities to network and learn from local speakers are often free to attend.  Such a great value! Plus, there is usually free pizza. Just saying. You will never regret having met other data professionals in your local area when you are looking for you next project.


Online Resources

Online videos are a great way to supplement your formal training. I like Pluralsight because it's subscription-based, not per video. They offer a free trial, and the annual subscription is affordable, given the breadth of content offered.


Webinars given by experts in the field are also a great way to get real-world experts to show and tell you about topics you'll need to know. Some experts host their own, but many provide content via software vendor webinars, like these from SolarWinds.



Blogs are a great way to read tips, tricks and how tos. It's especially important to validate the tips you read about. My recommendation is that you validate any rules of thumb or recommendations you find by going directly to the source: vendor documentation and guidelines, other experts, and asking for verification from people you trust. This is especially true if the post you are reading is more than three months old.


But another great way to become a professional DBA is to write content yourself.  As you learn something, get hands-on experience using it, write a quick blog post. Nothing makes you understand a topic better than trying to explain it to someone else.



I've learned a great deal more about databases by using tools that are designed to work with them. This can be because the tools offer guidance on configuration, do validations and/or give you error messages when you are about to do something stupid.  If you want to be a professional DBA, you should be giving Database Performance Analyzer a test drive.  Then when you see how much better it is at monitoring and alerting, you can get training on it and be better at databasing than an accidental DBA with just native database tools.



The most important thing you can do to enhance your DBA career is to get hands-on with the actual technologies you will need to support. I highly recommend you host your labs via the cloud. You can get a free trial for most. I recommend Microsoft Azure cloud VMs because you likely already have free credits if you have an MSDN subscription. There's also a generous 30-day trial available.

I recommend you set up VMs with various technologies and versions of databases, then turn them off.  With most cloud providers, such as Azure, a VM that is turned off has no charge except for storage, which is very inexpensive.  Then when you want to work with that version of software, you turn on your VM, wait a few minutes, start working, then turn it off when you need to move on to another activity.


The other great thing about working with Azure is that you aren't limited to Microsoft technologies.  There are VMs and services available for other relational database offerings, plus NoSQL solutions. And, of course, you can run these on both Windows and Linux.  It's a new Microsoft world.


The next best thing about having these VMs ready at your fingertips is that you can use them to test new automation you have developed, test new features you are hoping to deploy, and develop scripts for your production environments.


Think Like a DBA, Be a DBA

The last step is to realize that a DBA must approach issues differently than a developer, data architect, or project manager would. A DBA's job is to keep the database up and running, with correct and timely data.  That goal requires different thinking and different methods.  If you don't alter your problem-management thinking, you will likely come to different cost, benefit, and risk decisions.  So think like a DBA, be a DBA, and you'll get fewer middle-of-the-night calls.

stencil.linkedin-post (1).jpg

In my previous posts, I shared my tips on being an Accidental DBA - what things you should focus on first and how to prioritize your tasks.  Today at 1PM CDT, Thomas LaRock, HeadGeek and Kevin Sparenberg, Product Manager, will be talking about what Accidental DBAs should know about all the stuff that goes on inside the Black Box of a database.  I'm going to share with you some of the other things that Accidental DBAs need to think about inside the tables and columns of a database.


I'm sure you're thinking "But Karen, why should I care about database design if my job is keeping databases up and running?"  Accidental DBAs need to worry about database design because bad design has significant impacts on database performance, data quality, and availability. Even though an operational DBA didn't build it, they get the 3 AM alert for it.



People use tricks for all kinds of reasons: they don't fully understand the relational model or databases, they haven't been properly trained, they don't know a feature already exists, or they think they are smarter than the people who build database engines. All but the last one are easily fixed.  Tricky things are support nightmares, especially at 3 AM, because all your normal troubleshoot techniques are going to fail.  They impact the ability to integrate with other databases, and they are often so fragile no one wants to touch the design or the code that made all these tricks work. In my experience, my 3 AM brain doesn't want to see any tricks.




Tricky Things

Over my career I've been amazed by the variety and volume of tricky things I've seen done in database designs.  Here I'm going to list just 3 examples, but if you've seen others, I'd love to hear about them in the comments. Some days I think we need to create a Ted Codd Award for the worst database design tricks.  But that's another post...


Building a Database Engine Inside Your Database


You've seen these wonders…a graph database build in a single table.  A key-value pair (or entity attribute value) database in a couple of tables. Or my favourite, a relational database engine within a relational database engine.  Now doing these sorts of things for specific reasons might be a good idea.  But embracing these designs as your whole database design is a real problem.  More about that below.


Wrong Data Types


One of the goals of physical database design is to allocate just the right amount of space for data. Too little and you lose data (or customers), too much and performance suffers.  But some designers take this too far and reach for the smallest one possible, like INTEGER for a ZIPCode.  Ignoring that some postal codes have letters, this is a bad idea because ZIPCodes have leading zeros.  When you store 01234 as an INTEGER, you are storing 1234.  That means you need to do text manipulation to find data via postal code and you need to "fix" the data to display it.


Making Your Application Do the Hard Parts

It's common to see solutions architected to do all the data integrity and consistency checks in the application code instead of in the database.  Referential integrity (foreign key constraints), check constraints, and other database features are ignored and instead hundreds of thousands of lines of code are used to ensure these data quality features. This inevitably leads to data quality problems.  However, the worst thing is that these often lead to performance issues, too, and most developers have no idea why.

Why Do We Care?


While most of the sample tricks above are the responsibility of the database designer, the Accidental DBA should care because:


  • DBAs are on-call, not the designers
  • If there are Accidental DBAs, it's likely there are Accidental Database Designers
  • While recovery is job number one, all the other jobs involve actually getting the right data to business users
  • Making bad data move around faster isn't actually helping the business
  • Making bad data move around slower never helps the business
  • Keeping your bosses out of jail is still in your job description, even if they didn't write it down


But the most important reason why production DBAs should care about this is that relational database engines are optimized to work a specific way - with relational database structures.  When you build that fancy Key-Value structure for all your data, the database optimizer is clueless how to handle all the different types of data. All your query tuning tricks won't help, because all the queries will be the same.  All your data values will have to be indexed in the same index, for the most part.  Your table sizes will be enormous and full table scans will be very common.  This means you, as the DBA, will be getting a lot of 3 AM calls. I hope you are ready.


With applications trying to do data integrity checks, they are going to miss some. A database engine is optimized to do integrity checks quickly and completely. Your developers may not.  This means the data is going to be mangled, with end users losing confidence in the systems. The system may even harm customers or lead to conflicting financial results.  Downstream systems won't be able to accept bad data.  You will be getting a lot of 3 AM phone calls as integration fails.


Incorrect data types will lead to running out of space for bigger values, slower performance as text manipulation must happen to process the data, and less confidence in data quality.  You will be getting a lot of 3 AM and 3 PM phone calls from self-serve end users.


In other words, doing tricky things with your database is tricky. And often makes things much worse than you anticipate.


In Thwack Camp today, sqlrockstar Thomas and Kevin will be covering the mechanics of databases and how to think about troubleshooting all those 3 AM alerts.  While you are attending, I'd like you to also think about how design issues might have contributed to that phone call.  Database design and database configurations are both important.  A great DBA, accidental or not, understands how all these choices impact performance and data integrity.


Some tricks are proper use of unique design needs. But when I see many of them, or over use of tricks, I know that there will be lots and lots of alerts happening in some poor DBA's future.  You should take steps to ensure a good design lets you get more sleep.  Let the database engine do what it is meant to do.




In my previous post, I wrote about becoming an Accidental DBA whether or not you had that title formally.  I described the things a Minimalist DBA should focus on before jumping into performance tuning or renaming that table with the horribly long name (RETAIL_TRANSACTION_LINE_ITEM_MODIFIER_EVENT_REASON I <3 you.)   In today's post I want to cover how you, personally, should go about prioritizing your work as a new Accidental DBA.  You


  Most accidental DBAS perform firefighter-like roles: find the fire, put it out, rush off to the next fire and try to fight it as well. Often without the tools and help they need to prevent fires. Firefighting jobs are tough and exhausting.  Even in IT.  But they never allocate time to prevent fires, to maintain their shiny red fire trucks, or to practice sliding down that fire pole.


How to Prioritize your Accidental DBA Work


  1. Establish a good rule of thumb on how decisions are going to be made.  On a recent project of mine, due to business priorities and the unique type of business, we settled on Customer retention, legal and application flexibility as our priorities.  Keep our customers, keep our CIO out of jail, and keep in business. Those may sound very generic, but I've worked in businesses where customer retention was not a number one priority. In this business, which was membership and subscription based, we could not afford to lose customers over system issues.  Legal was there to keep our CIO and CEO out of jail (that's what ROI stands for: Risk of Incarceration).  Application flexibility was third because the whole reason for the project was to enable business innovation to save the company.

    Once you have these business priorities, you can make technical and architectural decisions in that context.  Customer retention sounds like a customer service issue, but it's a technical one as well.  If the system is down, customers can not be customers.  If their data is wrong, they can't be customers.  If their data is lost, they can't be customers. And so on.  Every decision we made first reflected back to those priorities.

  2. Prioritize the databases and systems.  Sure, all systems are important.  But they have a priority based on business needs. Your core selling systems, whatever they might be, are usually very high priority.  As are things like payroll and accounting.  But maybe that system that keeps track of whether employees want to receive a free processed meat ham or a chunk of processed cheese over the holidays isn't that high on the list.  This list should already exist,  at least in someone's head.   There might even be an auditor's report that says if System ABC security and reliability issues aren't fixed, someone is going to go to jail.  So I've heard.  And experienced. 

  3. Automate away the pain…and the stupid.  The best way to help honor those priorities is to automate all the things. In most cases, when an organization doesn't have experienced or dedicated DBAs, their data processes are mostly manual, mostly reactive, and mostly painful.  This is the effect of not having enough knowledge or time to develop, test, and deploy enterprise-class tools and scripts.  I understand that this is the most difficult set of tasks to put at a higher priority if all the databases are burning down around you. Yes, you must fight the fires, but you you must put a priority on fire reductions.  Otherwise you'll just be fighting bigger and more painful fires.

    Recovery is the most important way we fight data fires.  No amount of performance tuning, index optimization, or wizard running will bring back lost data.  If backups are manual, or automated and never tested, or restores are only manual, you have a fire waiting to happen. Head Geek Tom LaRock sqlrockstar says that "recovery is Job #1 for a DBA".  It certainly is important. A great DBA automates all backups and recovery. If you are recovering manually, you are doing it wrong.

      Other places where you want automation is in monitoring and alerting.  You want to know something is going on even before someone smells smoke, not when users are telling you the database is missing.  If your hard drive is running out of space, it is generally much faster to provision more resources or free up space than it is to recover a completely down system.  Eventually you'll want to get to the point where many of these issues are taken care of automatically.  In fact, that's why they invented cloud computing.

Get Going, the Alarm Bell is Ringing!


Become the Best DBA: A Lazy DBA. Lazy DBAs automate the stuff out of everything.  And lazy DBAs know that automating keeps databases from burning. They automate away the dumb mistakes that happen when the system is down, they automate test restores,  they automate away the pain of not knowing they missed setting a parameter when they hit ENTER.  They know how to get to the fire, they know what to do and they fix it.

The Best DBAs know when database are getting into trouble
, long before they start burning down.

The Best DBAs don't panic
.  They have a plan, they have tools, they have scripts.  When smoke starts coming out of the database, they are there.  Ready to fight that fire.  They are ready because they've written stuff down. They've trained.  They've practiced.  How many clicks would it take you to restore 10 databases?  Would you need to hit up Boogle first to find out how to do a point-in-time restore? Do you know the syntax, the order in which systems have to be restored? Who are the other people you have to work with to fix this fire?


As a new DBA, you should be working on automation every day, until all that work frees up so much of your time you can work on performance tuning, proper database design, and keeping your database fire truck shiny.




Congratulations! You are our new DBA!


Bad news: You are our new DBA!


I'm betting you got here by being really great at what you do in another part of IT.  Likely you are a fantastic developer. Or data modeler.  Or sysadmin. Or networking guy (okay, maybe not likely you are one of those…but that's another post).  Maybe you knew a bit about databases having worked with data in them, or you knew a bit because you had to install and deploy DBMSs.  Then the regular DBA left. Or he is overwhelmed with exploding databases and needs help. Or got sent to prison (true story for one of my accidental DBA roles). I like to say that the previous DBA "won the lottery" because that's more positive than LEFT THIS WONDERFUL JOB BEHIND FOR A LIFE OF CRIME.  Right?


I love writing about this topic because it's a role I have to play from time to time, too.  I know about designing databases, can I help with installing, managing, and supporting them?  Yes. For a while.


Anyway, now you have a lot of more responsibility than just writing queries or installing Oracle a hundred times a week.  So what sorts of things must a new accidental DBA know is important to being a great data professional?  Most people want to get right in to finally performance tuning all those slow databases, right?  Well, that's not what you should focus on first.


The Minimalist DBA


  1. Inventory: Know what you are supposed to be managing.  Often when I step in the fill this role, I have to support more servers and instances that anyone realized were being used.  I need to know what's out there to understand what I'm going to get a 3 AM call for.  And I want to know that before that 3 AM call. 
  2. Recovery: Know where the backups are, how to get to them, and how to do test restores. You don't want that 3 AM call to result in you having to call others to find out where the backups are. Or to find out that that there are no backups, really.  Or that they actually are backups of the same corrupt database you are trying to fix.  Test that restore process.  Script it.  Test the script.  Often.  I'd likely find one backup and attempt to restore it on my first day of the job.  I want to know about any issues with backups right away.
  3. Monitor and Baseline: You need to know BEFORE 3 AM that a database is having problem. In fact, you just don't want any 3 AM notifications.  The way you do that is by ensuring you know not only what is happening right now, but also what was happening last week and last month.  You'll want to know about performance trends, downtime, deadlocks, slow queries, etc.  You'll want to set up the right types of alerts, too.
  4. Security: Everyone knows that ROI stands for return on investment.  But it also stands for risk of incarceration.  I bet you think your only job is to keep that database humming.  Well, your other job is to keep your CIO out of jail.  And the CEO.  Your job is to love and protect the data.  You'll want to check to see how sensitive data is encrypted, where the keys are managed and how other security features are managed.  You'll want to check to see who and what has access to the data and how that access is implemented.  While you are at it, check to see how the backups are secured.  Then check to see if the databases in Development and Test environments are secured as well.
  5. Write stuff down: I know, I know.  You're thinking "but that's not AGILE!"  Actually, it is.  That inventory you did is something you don't want to have to repeat.  Knowing how to get to backups and how to restore them is not something you want to be tackling at 3 AM.  Even if your shop is a "we just wing it" shop, having just the right amount of modeling and documentation is critical to responding to a crisis.  We need the blueprints more than just to build something. 
  6. Manage expectations: If you are new to being a DBA, you have plenty to learn, plenty of things to put in place, plenty of work to do.  Be certain you have communicated what things need to be done to make sure that you are spending time on the things that make the most sense.  You'll want everyone to love their data and not even have to worry that it won't be accessible or that it will be wrong.


These are the minimal things one needs to do right off the bat.  In my next post, I'll be talking about how to prioritize these and other tasks.  I'd love to hear about what other tasks you think should be the first things to tackle when one has to jump into a an accidental DBA role.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.