As you've been following along, I started this series with Data Is Power .In my original graphic I listed QUESTIONS, DATA, ANSWERS as the information pipeline we need to keep systems humming. I provided a list of questions I might ask in Question Everything. A few of you jumped in with some more great questions.
One comment provides the perfect segue to this week's post:
There is a mass of information, we try and teach our clients they do need to know everything - otherwise will just be swamped and noise. Only the essentials, this saves a lot of unnecessary data - hulattp
I think the key where here is know. As a data evangelist I'm a bit biased towards making data-driven decisions. That means collecting data even before I need it. Once an incident is underway we may not be able to understand what happened without the right data. And that's the hard part: how do you know what data to collect before you know you will need it?
Types of Data Collections
- Inventory data: The catalog of what resources our systems are using. Which servers? Databases? SANs? Other Storage? Cloud resources? How long have they been around?
- Log data: The who, when, where, how, why, to what, by what, from what data.
- External systems conditions: What else was going on? Is it year end? Are there power outages? Was a new deployment happening? Patches? New users? All kinds of things can be happening outside our systems. What is the weather doing right now (really!)?
- Internal conditions: What was/is resource utilization at a point in time? What is it normally? What is our baseline for those things? What about for today/this month/this date/this day of week? What systems have the most pain points?
That's a lot of data. Too much data for someone to know. But having that raw data lets us answer some of the questions that we collected in Question Everything.
When we are diagnosing an issue (and batting away our Pointy-Haired Boss asking "how is it going?"), having that data is going to help. Having historical data is going to help even more. If production files are missing, we can replace them with a backup. But if an automated process is deleting those files, we haven't fixed the problem. We've just entered into a whack-a-mole game with a computer. And I know who is going to win that one.
So we need to find ways to make that data work for us. But it can't do that if we aren't collecting it. And we can't do it if we rely on data about the system right now.
The task we are doing also impacts the timeliness of the data we need. There's a huge difference in what data we need depending on whether we are doing operational or remediation work. We don't just sit down and start pouring through all the data. We need to use it to solve the problems (questions) we have right now. I think of these as time zones in the data.
|Activity||Data Time Zone|
|Operations (plate spinning)||Now & Future|
|Diagnostics (firefighting, restoration)||Recent and now|
|Strategic (process and governance)||Recent and past|
- Keeping the plates spinning: Our normally job, running around keeping everything going like clockwork. Keeping the plates spinning so they don't break. I these cases, we want data that is looking forward. We are looking for green across all our dashboards. We want to know if a resource is having issues (disk space, timeouts, CPU utilization, etc.) We aren't looking back at what happened last week. We won't actually have future data, but we can start predicting where problems are likely to pop up so that we can prioritize those activities.
- Firefighting: Ideally, we want to know there's a fire long before the spark, but we don't always have that luxury. We want to look at current data and recent data so that we know where to start saving systems (and sometimes even people). We aren't here to redesign the building code or architectural practices. We need to put out the fire and save production data. We need to get those systems plates back spinning. In database management, this might be rolling back changes, rebooting servers or restoring data. It's fixing the problem and making safe the data. We get systems back up and running. We need data to confirm we've done that. Maybe we put in place some tactical changes to mitigate more 3 AM calls. But we have to get up and do more plate spinning in another hour.
- Strategic responses: We can't be firefighting all day, everyday. Keeping those plates spinning means having time to make strategic responses. Changing how, when, where, why, and who does things. Making improvements and keeping things going. This is where we really start mashing up the trends of the data collections. What is causing us pain and therefore user pain? What is costing the company money? What is costing your manager money?
Questions for You
What other data time zone perspectives are there? Is there an International Date Line for data timezones? What about a Daylight Saving Time scheme for these time zones? Do these time zones vary by job title?
Next week I'll talk about how these data collections and data timezones impact how we use the data and how we consume it. In other words, how we take raw data and make it powerful.