Skip navigation
1 2 3 Previous Next

Geek Speak

1,836 posts

zen_stones_by_undeadstawa-d36h8mn.jpg

(Zen Stones by Undeadstawa on DeviantArt)

 

Over the years, I've observed that despite running multiple element and performance management systems, most organizations still don't truly understand their IT infrastructure. In this post I'll examine how it's possible to have so much information on hand yet still have a large blind spot.

 

Discovery

 

What does discovery mean to you? For most of us I'm guessing that it involves ICMP pings, SNMP community strings, WMI, login credentials and perhaps more in an attempt to find all the manageable devices that make up our infrastructure: servers, hypervisors, storage devices, switches, routers and so forth. We spin up network management software, perhaps a storage manager, virtualization management, performance management, and finally we can sleep safely knowing that we have full visibility and alerting for our compute, storage and networking infrastructure.

 

At this point I'd argue that the infrastructure discovery is actually only about 50% complete. Why? Because the information gathered so far provides little or no data that can be used to generate a correlation between the elements. By way of an analogy you could say that at this point all of the trees have been identified, labeled and documented, but we've yet to realize that we're standing in the middle of a forest. To explain better, let's look at an example.

 

Geographical Correlation

Imagine you have a remote site at which we are monitoring servers, storage, printers and network equipment. The site is connected back to the corporate network using a single WAN link, and—horrifyingly—that link is about to die. What do the monitoring systems tell us?

 

  • Network Management: I lost touch with the edge router and six switches.
  • Storage Management: I lost touch with the storage array.
  • Virtualization Management: I lost touch with these 15 VMs.
  • Performance Management: These elements (big list) are unresponsive.

 

Who monitors those systems? Do the alerts all appear in the same place, to be viewed by the same person? If not, that's the first issue, as spotting the (perhaps obvious) relationship between these events requires a meat-bag (human) to realize that if storage, compute and network all suddenly go down, there's likely a common cause. If this set of alerts went in different directions, in all likelihood the virtualization team, for example, might not be sure whether their hypervisor went down, a switch died, or something else, and they may waste time investigating all those options in an attempt to access their systems.

Centralize your alert feeds

Suppressing Alerts

If all the alerts are coming into a single place, the next problem is that in all likelihood the router failure event led to the generation of a lot of alerts at the same time. Looking at it holistically, it's pretty obvious that the real alert should be the loss of a WAN link; everything else is a consequence of losing the site's only link to the corporate network. Personally in that situation, I'd ideally like the alert to look like this:

 

2016/07/28 01:02:03.123 CRITICAL: WAN Node <a.b.c.d> is down. Other affected downstream elements include (list of everything else).

 

This isn't a new idea by any means; alert suppression based on site association is something that we should all strive to achieve, yet so many of us fail to do so. One of the biggest challenges with alert monitoring is being overwhelmed by a large number of messages, and the signal to noise ratio makes it impossible to see the important information. This is a topic I will come back to, but let's assume it's a necessary evil.

Suppress unnecessary alert noise

Always On The Move

In addition to receiving several hundred alerts from the devices impacted by the WAN failure, now it seems the application team is troubleshooting an issue with the e-commerce servers. The servers themselves seem fine, but the user-facing web site is generating an error when trying to populate shipping costs during the checkout process. For some reason the call to the server calculating shipping costs isn't able to connect, which is odd because it's based in the same datacenter as the web servers.

 

The security team is called in and begins running a trace on the firewall, only to confirm that the firewall is correctly permitting a session from the e-commerce server to an internal address on port tcp/5432 (postgres).

 

The network team is called in to find out why the TCP session to shipsrv01.ecomm.myco.corp is not establishing through the firewall, and they confirm that the server doesn't seem to respond to ping. Twenty minutes later, somebody finally notices that the IP returned for shipsrv01.ecomm.myco.corp is not in the local datacenter. Another five minutes later, the new IP is identified as being in the site that just went down; it looks like somebody had moved the VM to a hypervisor in the remote site, presumably by mistake, when trying to balance resources across the servers in the data center. Nobody realized that the e-commerce site had a dependency on a shipping service that was now located in a remote site, so nobody associated the WAN outage with the e-commerce issue. Crazy. How was anybody supposed to have known that?

 

It seems that despite having all those management systems I'm still a way from having true knowledge of my infrastructure. When I post next, I'll look at some of the things I'd want to do in order to get a better and more holistic view of my network so that I can embrace the inner peace I desire so much.

sqlrockstar

The Actuator - July 27th

Posted by sqlrockstar Employee Jul 27, 2016

Just when you thought 2016 couldn't get crazier you wake up to find that Verizon has bought Yahoo and that you are more interested in reading about the drone that delivered a Slurpee. Welcome to my world.

 

Here are the items I found most amusing from around the Internet. Enjoy!

 

Verizon to Purchase Yahoo’s Core Business for $4.8 Billion

I'm shocked Yahoo is worth even that much. I'm also hoping that someone will give me $57 million to do nothing.

 

Canadian Football League Becomes First Pro Football Organization To Use Sideline Video During Games

As our technology advances at an ever increasing pace, and is applied in new situations, it is up to someone in IT to make it all work. It's all about the data folks, as data is the most valuable asset any company (or team) can own.

 

Nearly Half of All Corporate Data is Out of IT Department’s Control

Honesty, I think that number is much higher. 

 

GOP delegates suckered into connecting to insecure Wi-Fi hotspots

I am certain the GOP leaders were tech savvy enough not to fall for this trick, right?

 

Snowden Designs a Device to Warn if Your iPhone’s Radios Are Snitching

Showing what he's been doing with his free time while living in exile, Snowden reveals how our phones have been betraying us for years.

 

Status Report: 7 'Star Trek' Technologies Under Development

With the release of the new Star Trek movie last week I felt the need to share at least one Star Trek link. But don't get your hopes up for warp drive or transporters anytime soon.

 

I wanna go fast: HTTPS' massive speed advantage

"If you wanna go fast, serve content over HTTPS using HTTP/2."

 

Watch The First Slurpee Delivery By Drone

Because who doesn't love a Slurpee in the summertime?

 

Meanwhile, in Redmond:

a - 1 (3).jpg

turkducken_fire_rz.jpgI need to deep fry a turbaconducken.

 

This isn't a want, no. This is a primal need of mine.

 

I feel so strongly about this that it's on my bucket list. It is positioned right below hiring two private investigators to follow each other, and right above building an igloo with the Inuit.

 

Deep frying a turkey is a dangerous task. You can burn your house down if you are not careful. Why take the risk? Because the end result, a crispy-juicy turkey bathed in hot oil for 45 minutes, is worth the effort. Or so I've been told. Like I said, it's on my bucket list.

 

Being the good data professional that I am I started planning out how to prepare for the day that I do, indeed, deep fry my own turkey. As I laid out my plans it struck me that there was a lot of similarity between both an exploding turkey and the typical "database is on fire" emergency many of us know all too well.

 

So here's my list for you to follow for any emergency, from exploding turkeys to databases catching fire and everything in between. You're welcome.

 

Don't Panic

 

People who panic are the same people who are not prepared. A little bit of planning and preparation go a long way to helping you avoid "panic mode" in any emergency situation. Whenever I see someone panicking (like ripping out all their network cables just because their mouse isn't working) it is a sure sign that they have little to no practical experience with the situation at hand.

 

Planning will help you from feeling the need to panic. If your database is on fire you can recover from backups, because you prepared for such a need. And if your turkey explodes you can always go to a restaurant for a meal.

 

Rely on all your practice and training (you have practiced this before, right)? Emergency response people train in close to real life situations, often. In fact, firefighters even pay people to burn down their spare barns.

 

Go to your checklist...you do have a checklist, right? And a process to follow? If not you may find yourself in a pile of rubble, covered in glitter.

 

Assess the Situation

 

Since you aren't panicking you are able to calmly assess the situation. A turkey on fire inside your oven would require a different response than a turkey that explodes in a fireball on your deck and is currently burning the side of your house. Likewise, an issue with your database that affects all users will require a different set of troubleshooting steps than an issue affecting only some users or queries.

 

In order to do a proper assessment of the situation you will be actively gathering data. For database servers you are likely employing some type of monitoring and logging tools. For turkeys, it's likely a thermometer to make certain it has completely thawed before you drop it into the hot oil.

 

You also need to know your final goal. Perhaps your goal is to stop your house from being engulfed in flames. Perhaps your goal is to get the systems back up and running, even if it means you may have some data loss.

 

Not every situation is the same. That's why a proper assessment is necessary when dealing with emergencies...and you can't do that while in a panic.

 

Know Your Options

 

Your turkey just exploded after you dropped it into a deep fryer. Do you pour water on the fire quickly? Or do you use a fire extinguisher?

 

Likewise, if you are having an issue with a database server should you just start rebooting it in the hopes that it clears itself up?

 

After your initial assessment is done you should have a handful of viable options to explore at that point. You need to know the pros and cons for each of these options. That's where the initial planning comes handy, too. Proper planning will reduce panic, allow you to assess the situation, and then you can understand all your viable options along with the pros and cons. See how all this works together?

 

It may help for you to phone a friend here. Sometimes talking through things can help, especially when the other person has been practicing and helping all along.

 

Don't Make Things Worse

 

Pouring water on the grease fire on your deck is going to make the fire spread more quickly. And running 17 different DBCC commands isn't likely to make your database issue any better, either.

 

Don't be the person that makes things worse. If you are able to calmly assess the situation, and you know your options well, then you should be able to make an informed decision that doesn't make things worse. Also, don’t focus on blame. Now isn't the time to worry about blame. That will come later. If you focus on fault, you aren’t working on putting out the fire right now. You might as well grab a stick and some marshmallows for making s’mores while your house burns to the ground.

 

Also, a common mistake here is done by people who try to do many things at once, specifically for database issues. If you make multiple changes then you may never know what worked, or the changes you make may cancel each other out leaving you still with a system offline. Know the order of the actions you want to take and do them one at a time.

 

And it wouldn't hurt you to take a backup now, before you start making changes, if you can.

 

Learn From Your Mistakes

 

Everyone makes mistakes, I don't care what their marketing department may tell you. Making mistakes isn't as big of a concern as not learning from your mistakes. If you burned your house down the past two Thanksgivings, don't expect a lot of people showing up for dinner this year.

 

Document what you’ve done, even if it is just a voice recording.  You might not remember all the details afterwards, so take time to document events while they are still fresh in your memory.

 

Review the events with others and gather feedback along the way as to how things could have been better or avoided. Be open to criticism, too. There's a chance the blame could be yours. If that's the case, accept that you are human and lay out a training plan that will help you to avoid making the same mistake in the future.

 

I'm thankful that my database server isn't on fire. But if it was, I know I'd be prepared.

 

keep-calm-man-fire.jpg

Many agencies are already practicing excellent cyber hygiene; others are still in implementation phases. Regardless of where you are in the process, it is critical to understand that security is not a one-product solution. Having a solid security posture requires a broad range of products, processes and procedures.

 

Networks, for example, are a critical piece of the security picture; agencies must identify and react to vulnerabilities and threats in real time. You can implement automated, proactive security strategies that will increase network stability and have a profound impact on the efficiency and effectiveness of the overall security of the agency.

 

How can agencies leverage their networks to enhance security? Below are several practices you can begin to implement today, as well as some areas of caution.

 

Standardization. Standardizing network infrastructure is an often-overlooked method of enhancing network performance and security.

 

Start by reviewing all network devices and ensure consistency across the board. Next, make sure you’ve got multiple, well-defined networks. Greater segmentation will provide two benefits: greater security, as access will not necessarily be granted across each unique segment, and greater ability to standardize, as segments can mimic one another to provide enhanced control.

 

Change management. Good change management practices go a long way toward enhanced security. Specifically, software that requires a minimum of two unique approvals before changes can be implemented can prevent unauthorized changes. In addition, make sure you fully understand the effect changes will have across the infrastructure before granting approval.

 

Configuration database. It’s important to have a configuration database for backups, disaster recovery, etc. If you have a device failure, being able to recover quickly can be critical; implementing a software setup that can do this automatically can dramatically reduce security risks. Another security advantage of a configuration database is the ability to scan for security-policy compliance.

 

Compliance awareness. Compliance can be a complicated business. Consider using a tool that automates vulnerability scanning and FISMA/DISA STIG compliance assessments. Even better? A tool that also automatically sends alerts of new risks by tying into the NIST NVD, then checking that information against your own configuration database.

 

Areas of caution:

Most security holes are related to inattention to infrastructure. In other words, inaction can be a dangerous choice. Some examples are:

 

Old inventory. Older network devices inherently have outdated security. Invest in a solution that will inventory network devices and include end-of-life and end-of-support information. This also helps forecast costs for new devices before they quit or become a security liability.

 

Not patching. Patching and patch management is critical to security. Choose an automated patching tool to be sure you’re staying on top of this important task.

 

Unrestricted bring-your-own-device policies. Allow BYOD, but with restrictions. Separate the unsecure mobile devices on the network and closely monitor bandwidth usage so you can make changes on the fly as necessary.

 

There is no quick-and-easy solution, but tuning network security through best practices will not only enhance performance, but will also go a long way toward reducing risks and vulnerabilities.

 

Find the full article on Government Computer News.

In my previous post, I listed some best practices for help desk IT pros to follow to save time resolving issues. The responses I received from that post made me realize that the best solution for one IT organization may not necessarily be the same for another. An organization’s size, business model, functional goals, organizational structure, etc. create unique challenges for those charged with running the help desk function, and these factors directly affect IT support priorities.

 

With this knowledge in mind, I decided to take a different approach for this post. Below, I have listed some of the easy ways that help desk organizations – irrespective of their differences – can improve their help desk operations through automation to create a chaos-free IT support environment.

 

  1. Switch to centralized help desk ticketing
    Receiving help desk requests from multiple channels (email, phone, chat, etc.), and manually transferring them onto a spreadsheet creates a dispersed and haphazard help desk environment. Switching to a centralized help desk ticketing system will help you step up your game and automate the inflow of incidents and service requests.
  2. Automate ticket assignment and routing
    Managing help desk operations manually can lead to needless delays in assigning tickets to the right technician, and potential redundancy if you happen to send the same requests to multiple technicians. To avoid this, use a ticketing system that helps you assign tickets to technicians automatically, based on their skill level, location, availability, etc.
  3. Integrate remote support with help desk
    With more people working remotely, traditional help desk technicians have to adapt and begin to resolve issues without face-to-face interactions. Even in office settings, IT pros tend to spend about 30% of their valuable time visiting desks to work on issues. By integrating a remote support tool into your help desk, you can resolve issues remotely, taking care of on- and off-site problems with ease.
  4. Resolve issues remotely without leaving your desk
    A recent survey by TechValidate states that 77% of surveyed help desk technicians feel that using remote support decreased their time-to-resolution of trouble tickets. Using the right remote support tool helps you easily troubleshoot performance issues and resolve complex IT glitches without even leaving your desk.

 

These are some of the simple yet powerful ways that organizations can create a user-friendly help desk. Are you managing your help desk the easy way or the hard way?

 

Four Easy Ways to Create a Chaos-free Help Desk.png

 

To download this infographic, click here. Share your thoughts on how you reduce workload and simplify help desk support in the comments section.

https://twitter.com/leonadato/status/753622503512117248"Sore throat from talking, sore feet from walking, sore face from smiling. Must be @CiscoLive."

 

Before I dig in to what I saw, as well as what I think about what I saw, I have to take a moment to shout out my gratitude and thanks to the amazing SolarWinds team that assembled for this convention. In fact, I had so much shouting to do that I wrote a whole separate post about it that you can read here. I hope you'll take a moment to share in the sheer joy of being part of this team.

 

But remember to come back here when you're done.

 

Okay, you're back now? Let's dive in!

 

CLUS is About Connections

As much as you may think Cisco Live is about networking (the IT kind), technology, trends, and techniques, the reality is that much of the attraction for a show like this is in making personal connections with groups of like-minded folks. While that's true of most conventions, Cisco Live, with 27,000 attendees this year, offers a larger quantity and wider variety of people with the same range of experience, focus, and background as you. You can sit at a table full of 33-year-old voice specialists who started off as Linux® server admins. It might be a bit of a trick to find them in the sea of humanity, socks_kilt.jpgbut they are there. Usually they are attending the same sessions you are; you just have to look around.

 

Beyond the birds-of-a-feather aspect, Cisco Live gives you a chance to gather with people who share your particular passion - whether it's for a brand, technology, or technique - and learn what's in store in the coming months.

 

And what would geek-based IT culture be if all of that social interaction didn't include some completely off-the-wall goofiness? An offhand joke last year blossomed into the full-fledged #KiltedMonday, with dozens (if not hundreds) of attendees sporting clan colors and unshaved legs to great effect.

 

Speaking of legs, many people's legs were also festooned with a riot of colors, as #SocksOfCLUS, also started to take hold. You might even say it got a leg up on the convention this year. (I'll be here all week, folks.)

 

 

The Rise of DevNet

During the show, news broke that cloud-based development environment Cloud9 had been acquired by Amazon Web Services® (AWS), which prompted my fellow Head Geek Patrick Hubbard to tweet:

Next, @awscloud grabs @Cloud9. One day all will be #AWS and #Azure. Learn #DevOps fellow geeks.

 

That truth was already clearly being embraced by the folks at Cisco®.

 

Over the last three Cisco Live events I've attended, the growth of the area devoted to DevNet - the network engineer-flavored DevOps  - has grown substantially. It’s expanded from a booth or two at Cisco Live 2015, to a whole section of the floor in Berlin 2016, to a huge swath of non-vendor floor space last week. Two dozen workstations were arranged around a model train, and attendees were encouraged to figure out ways to code the environment to change the speed and direction of the train. Fun!

 

Meanwhile, three separate theaters ran classes every hour on everything from programming best practices to Python deep dive tutorials.

 

I found this much more engaging and effective than the other statements that network engineers need to learn to code due to the pressure of a SPECIFIC technology (say SDN). While it might be true, I much prefer that the code be presented FIRST, and then let IT pros figure out what cool things we want to do with it.

devnet1.jpg devnet2.jpg devnet3.jpg

 

 

Still Figuring IT Out

It is clear that the convention is both emotionally and financially invested in the evolving trends of SDN, IoT, and cloud/hybrid IT. Vast swaths of the show floor are dedicated to it, and to showing off the various ways it might materialize in real, actual data centers and corporate networks.

 

But the fact is that none of those things are settled yet. Settling, perhaps. Which is fine. I don't need a convention (much less a single vendor) to announce, "And lo, it was good" about any technology that could change the landscape of infrastructures everywhere.

 

For things like SDN and IoT, Cisco Live is the place you go once a year to check in, see how the narrative has changed, and elbow the person next to you at the session or display and say, "So, are you doing anything with this? No? Me, neither.”

 

The View from the Booth

Back at Starship SolarWinds (aka our booth), the story was undeniably NetPath™. People sought us out to see it for themselves, or were sent by others (either back at the office or on the show floor) to come check it out. The entire staff demonstrated the belle of the NPM 12 ball constantly throughout the day, until the second day, when we had booth visitors start demo-ing it THEMSELVES to show OTHER visitors (who sometimes turned out to be people they didn't even know). The excitement about NetPath was that infectious.

 

We also witnessed several interactions where one visitor would assure another that the upgrade was painless. We know that hasn’t always been the case, but seeing customers brag to other customers told us that all of our due diligence on NPM 12, NCM 7.5, NTA 4.2, and SRM 6.3 (all of which came out at the beginning of June) was worth all the effort.

 

Not that this was the only conversation we had. The new monitoring for stacked switches was the feature many visitors didn't know they couldn't live without, but left texting their staff to schedule the upgrade for. The same goes for Network Insight - the AppStack-like view that gives a holistic perspective on load balancers like F5®s and the pools, pool members, and services they provide.

 

We also had a fair number of visitors who were eager to see how we could help them solve issues with automated topology mapping, methods for monitoring VDI environments, and techniques to manage the huge volume of trap and syslog that larger networks generate.

 

And, yes, those are all very network-centric tools, but this is Cisco Live, after all. That said, many of us did our fair share of showing off the server and application side of the house, including SAM, WPM, and the beauty that is the AppStack view.

 

Even more thrilling for the SolarWinds staff were the people who came back a second time, to tell us they had upgraded THAT NIGHT after visiting us and seeing the new features. They didn’t upgrade to fix bugs, either. They couldn’t live another minute without NetPath, Network Insight (F5 views), switch stack monitoring, NBAR2 support, binary config backup, and more.

 

We all took this as evidence that this was one of the best releases in SolarWinds history.

 

CLUS vs SWUG, the Battle Beneath the Deep

In the middle of all the pandemonium, several of us ran off to the Shark Reef to host our first ever mini-SWUG, a scaled-down version of the full-day event we've helped kick off in Columbus, Dallas, Seattle, Atlanta, and, of course, Austin.

 

Despite the shortened time frame, the group had a chance to get the behind-the-scenes story about NetPath from Chris O'Brien; to find out how to think outside the box when using SolarWinds tools from Destiny Bertucci (giving them a chance to give a hearty SWUG welcome to Destiny in her new role as SolarWinds Head Geek); and hear a detailed description of the impact NetPath has had in an actual corporate environment from guest speaker Chris Goode.

 

The SolarWinds staff welcomed the chance to have conversations in a space that didn't require top-of-our-lungs shouting, and to have some in-depth and often challenging conversations with folks that had more than a passing interest in monitoring.

 

And the attendees welcomed the chance to get the inside scoop on our new features, as well as throw out curveballs to the SolarWinds team and see if they could stump us.

 

#NoRegrets

Cisco Live was a whirlwind three days of faces, laughter, ah-ha moments, and (SRSLY!) the longest walk from my room to the show floor INSIDE THE SAME HOTEL that I have ever experienced. I returned home completely turned around and unprepared to get back to work.

 

Which I do not regret even a little. I met so many amazing people, including grizzled veterans who’d earned their healthy skepticism, newcomers who were blown away by what SolarWinds (and the convention) had to offer, and the faces behind the twitter personas who have, over time, become legitimate friends and colleagues. All of that was worth every minute of sleep I lost while I was there.

 

But despite the hashtag above, of course I have regrets. Nobody can be everywhere at once, and a show like Cisco Live practically requires attendees to achieve a quantum state to catch everything they want to see.

  • I regret not getting out of the booth more.
  • Of course, THEN I'd regret meeting all the amazing people who stopped in to talk about our tools.
  • I regret not catching Josh Kittle's win in his Engineering Deathmatch battle.
  • I regret not making it over to Lauren Friedman's section to record my second Engineers Unplugged session.
  • And I regret not hearing every word of the incredible keynotes.

 

Some of these regrets I plan to resolve in the future. Others may be an unavoidable result of my lack of mutant super-powers allowing me to split myself into multiple copies. Which is regrettable, but Nightcrawler was always my favorite X-Man, anyway.

 

#SquadGoals for CLUS17:

Even before we departed, several of us were talking about what we intended to do at (or before) Cisco Live 2017. Top of the list for several of us was to be ready to sit for at least one certification exam at Cisco Live.

 

Of course, right after that our top goal was to learn how to pace ourselves at the next show.

 

Somehow I think one of those goals isn't going to make it.

 

accidentalDba.png

 

Congratulations! You are our new DBA!

 

Bad news: You are our new DBA!

 

I'm betting you got here by being really great at what you do in another part of IT.  Likely you are a fantastic developer. Or data modeler.  Or sysadmin. Or networking guy (okay, maybe not likely you are one of those…but that's another post).  Maybe you knew a bit about databases having worked with data in them, or you knew a bit because you had to install and deploy DBMSs.  Then the regular DBA left. Or he is overwhelmed with exploding databases and needs help. Or got sent to prison (true story for one of my accidental DBA roles). I like to say that the previous DBA "won the lottery" because that's more positive than LEFT THIS WONDERFUL JOB BEHIND FOR A LIFE OF CRIME.  Right?

 

I love writing about this topic because it's a role I have to play from time to time, too.  I know about designing databases, can I help with installing, managing, and supporting them?  Yes. For a while.

 

Anyway, now you have a lot of more responsibility than just writing queries or installing Oracle a hundred times a week.  So what sorts of things must a new accidental DBA know is important to being a great data professional?  Most people want to get right in to finally performance tuning all those slow databases, right?  Well, that's not what you should focus on first.

 

The Minimalist DBA

 

  1. Inventory: Know what you are supposed to be managing.  Often when I step in the fill this role, I have to support more servers and instances that anyone realized were being used.  I need to know what's out there to understand what I'm going to get a 3 AM call for.  And I want to know that before that 3 AM call. 
  2. Recovery: Know where the backups are, how to get to them, and how to do test restores. You don't want that 3 AM call to result in you having to call others to find out where the backups are. Or to find out that that there are no backups, really.  Or that they actually are backups of the same corrupt database you are trying to fix.  Test that restore process.  Script it.  Test the script.  Often.  I'd likely find one backup and attempt to restore it on my first day of the job.  I want to know about any issues with backups right away.
  3. Monitor and Baseline: You need to know BEFORE 3 AM that a database is having problem. In fact, you just don't want any 3 AM notifications.  The way you do that is by ensuring you know not only what is happening right now, but also what was happening last week and last month.  You'll want to know about performance trends, downtime, deadlocks, slow queries, etc.  You'll want to set up the right types of alerts, too.
  4. Security: Everyone knows that ROI stands for return on investment.  But it also stands for risk of incarceration.  I bet you think your only job is to keep that database humming.  Well, your other job is to keep your CIO out of jail.  And the CEO.  Your job is to love and protect the data.  You'll want to check to see how sensitive data is encrypted, where the keys are managed and how other security features are managed.  You'll want to check to see who and what has access to the data and how that access is implemented.  While you are at it, check to see how the backups are secured.  Then check to see if the databases in Development and Test environments are secured as well.
  5. Write stuff down: I know, I know.  You're thinking "but that's not AGILE!"  Actually, it is.  That inventory you did is something you don't want to have to repeat.  Knowing how to get to backups and how to restore them is not something you want to be tackling at 3 AM.  Even if your shop is a "we just wing it" shop, having just the right amount of modeling and documentation is critical to responding to a crisis.  We need the blueprints more than just to build something. 
  6. Manage expectations: If you are new to being a DBA, you have plenty to learn, plenty of things to put in place, plenty of work to do.  Be certain you have communicated what things need to be done to make sure that you are spending time on the things that make the most sense.  You'll want everyone to love their data and not even have to worry that it won't be accessible or that it will be wrong.

 

These are the minimal things one needs to do right off the bat.  In my next post, I'll be talking about how to prioritize these and other tasks.  I'd love to hear about what other tasks you think should be the first things to tackle when one has to jump into a an accidental DBA role.

sqlrockstar

The Actuator - July 20th

Posted by sqlrockstar Employee Jul 20, 2016

I'm back from the family London vacation and ready to get to work. I was unplugged for much of last week, focusing on the world around me. What I found was a LOT of healthy discussions about #Brexit, David Cameron leaving, and if you can eat a Scotch egg cold (HINT: You can, but you shouldn't, no matter what the clerk at Harrod's tells you.)

 

With almost 2,000 unread items in my RSS feeder I had lots of material to root through looking for some links to share with you this week. Here are the ones I found most amusing from around the Internet. Enjoy!

 

PokemonGO and Star Trek TNG's 'The Game'

Happy to see I wasn't the only one who made this connection, but what I'd like next is for someone to make an augmented reality game for finding bottlenecks in the data center.

 

Security Is from Mars, Application Delivery Is from Venus

I liked the spin this article took on the original book theme. Looking forward to the follow-up post where the author applies the business concepts of cost, benefits, and risk to their marriage.

 

Microsoft Wins Landmark Email Privacy Case

Reversing a decision from 2014 where the US government thought it was OK to get at data stored outside our borders, this ruling is more in line with current technology advances and, let's be honest, common sense.

 

How boobytrapped printers have been able to infect Windows PCs for over 20 years

I seem to recall this being a known issue for some time now, so I was shocked to see that the patch was only just released.

 

6 Workplace Rules that Drive Everyone Crazy

"Underwear must be worn at all times." Good to know, I guess, but is this really a problem for an office somewhere?

 

Microsoft Apologizes for Inviting "Bae" Interns to Night of "Getting Lit" in Lame Letter

Another public misstep for Microsoft with regards to social values. This letter is more than just poorly worded, it underlines what must be considered acceptable behavior for Microsoft employees.

 

High traffic hits the operations team

The Olympics are just around the corner!

 

Wonderful view for King Charles I last week, looking from Trafalgar Square down to Big Ben glowing from the setting sun:

IGNR9895 copy.jpg

As network engineers, administrators, architects, and enthusiasts we are seeing a trend of relatively complicated devices that all strive to provide unparalleled visibility into the inner workings of applications or security. Inherent in these solutions is a level of complexity that challenges network monitoring tools, it seems that in many cases vendors are pitching proprietary tools that are capable of extracting the maximum amount of data out of a specific box. Just this afternoon I sat on a vendor call in which we were doing a technical deep dive of a next-generation firewall with a very robust feature set with a customer. Inevitably the pitch was made to consider a manager of managers that could consolidate all of this data into one location. While valuable in its own right for visibility, this perpetuates the problem of many “single panes of glass”.

 

I couldn’t help but think, what we really need is the ability to follow certain threads of information across many boxes, regardless of manufacturer—these threads could be things like application performance or flows, security policies, etc. Standards-based protocols and vendors that are open to working with others are ideal as it fosters the creation of ecosystems. Automation and orchestration tools offer this promise, but add on additional layers of intricacy in the requirements of knowing scripting languages, a willingness to work with open source platforms, etc.

 

Additionally, any time we seem to abstract a layer or simplify it, we lose something in the process—this is known as generation loss. Generation loss is the result of compounding this across many devices or layers of management tends to result in data that is incomplete or worse inaccurate, yet this is the data that we are intending to use to make our decisions.

 

Is it really too much to ask for simple and accurate? I believe this is where the art of simplicity comes into play. The challenge of creating an environment in which the simple is useful and obtainable requires creativity, attention to detail, and an understanding that no two environments are identical. In creating this environment, it is important to address what exactly will be made simple and by what means. With a clear understanding of the goals in mind, I believe it is possible to achieve these goals, but the decisions on equipment, management systems, vendors, partners, etc. need to be well thought through and the right amount of time and effort must be dedicated to it.

IT pros face a near-constant deluge of trouble tickets every day, which leaves very little time to analyze where the workday actually goes. This post gives you a glimpse into a critical part of a day in the life of a support professional, where service request management and resolution take place.

Image.png

 

I JUST FIX. I DON’T FORMULATE.

There are small organizations where ticketing management is pretty much nonexistent. In this instance, the IT admin ends up juggling multiple requests received through various disparate channels (phone, email, chat, in-person requests, etc.), trying to multitask and solve them all. This may sound superhuman, but everyone who’s been there and done that knows it’s extremely time-consuming and difficult. Without a system in place for managing and tracking these service requests, it takes a ridiculous amount of time to simply prioritize and tackle all the tickets at hand. At the end of the day, it’s just grappling with SLA delays, incomplete service fulfilment, dealing with irate customers, and being that overly-stretched IT pro who is lost in a maze of uncategorized ticket anarchy. This, of course, leads to technician and customer dissatisfaction in most cases.

 

Without proper IT service management processes and techniques in place, hiring additional staff to assist won’t help much. They will still get swamped with tickets and end up having to put out fires all over the place, too.

 

Where time is lost:

  • Incident management
  • Service request tracking
  • Ticket prioritization and categorization
  • Technician assignment and escalation
  • Communication and follow-up with end-users via multiple channels

 

VISIT-AND-ASSIST SUPPORT IS PRETTY MUCH MY DAY

Even if you have proper ticketing management practices in place, you could still be wasting time on actual problem resolution if there isn’t technology assistance for the support staff. When handling Level 2 and Level 3 support, all while attending to desktop support requests, IT pros usually lose a lot of time visiting end-user workstations and resolving issues there. This has been the traditional IT way, and it’s fairly simple in small companies with few employees. But, as your network and user base grows, your visit-and-assist method will prove less productive. When support agents aren’t equipped with the right tools for remote troubleshooting, you can definitely expect less productivity, specifically with the number of tickets closed per day.

Instead, consider implementing a self-service system to reduce the more frustrating requests to unlock end-user’s accounts, or reset their passwords. In these cases, even the least tech-savvy user can access and use a centralized knowledge management and self-service portal that empowers him to help himself, freeing the  IT pro to spend time on more important tickets.

 

Where time is lost:

  • Physically visiting end-user desks to resolve issues.
  • Repeatedly fixing simple and recurring tickets.

 

Time is definitely of the essence for the support department. Until we understand the value of process and technology, and implement steps to enhance both, we will continue to be burdened with productivity issues (for technicians), satisfaction issues (for customers), or worse, end up wasting time.

 

Help desk and remote support tools address both these challenges. For smaller organizations where budgetary constraints supplement the issues of time management, simple and cost-effective help desk and remote support tools could help you save deep pocket burns, and optimize time for both effective service management and efficient resolution.

Application performance management continues to be so hard because applications are becoming increasingly complex, relying on multiple third-party components and services that are all added to the already-complicated modern application delivery chain that includes the applications and the backend IT infrastructure that supports them —plus all the software, middleware and extended infrastructure required for performance. This modern application stack, or AppStack, is often managed the old-fashioned way — in silos, without holistic visibility of the entire stack at once.

 

When the cloud gets added to the mix, none of the traditional AppStack goes away, but it’s no longer managed by the hosting agency in an on-premises datacenter. Nevertheless, IT is still accountable for application performance. So, what can application administrator do to ensure consistent top performance in a hybrid cloud environment? Here are a few things to consider:

 

Add a step. In a hybrid cloud system, instead of trying to pinpoint where and what the problem is, the first step is quickly and definitively determining who owns the problem -- IT or the cloud vendor.

 

Get involved in the development phase. Cloud monitoring and management decisions need to be made when the hybrid cloud environment is being created. Don’t get stuck trying to manage applications that weren’t designed for the cloud. By getting involved in the development phase, administrators can ensure they have control over application performance.

 

Manage cloud and on-premises performance data to determine the root of the application issue. IT needs to determine whether the application performance issue is with the software or the configuration. To achieve full visibility into the performance issue, a monitoring system must be in place to definitively determine where the problem lies.

 

Plan for the worst-case scenario to prevent it from happening. Early on, administrators should think through and plan for the worst-case scenarios in a hybrid cloud environment to spot problems before they arise and to be prepared should they actually occur. Make sure, for example, that critical systems can failover to standby systems and data is backed-up and archived according to agency policies.

 

And let’s not forget the network’s role. Application delivery is only as good as the network, and cloud or hybrid apps need to traverse a path across your network to the cloud provider and back. Visibility of the network path can assist in troubleshooting.

 

Every technological shift comes with a unique set of complexities and challenges – and the need for new understanding. The hybrid cloud is no different. However, gaining visibility into application performance can help agencies reap the benefits of hybrid cloud environments and ensure application performance remains consistently strong.

 

Find the full article on Government Computer News.

Leon Adato

Cisco Love for Cisco Live

Posted by Leon Adato Expert Jul 18, 2016

"Sore throat from talking, sore feet from walking, sore face from smiling. Must be @Cisco Live!"

 

I have a whole post about what I saw at Cisco Live and what I think about what I saw, which will be posted in the next few days, but as I started writing that article, I realized that I had a lot of ink to spill about our amazing SolarWinds team and the experience I had working with them during the convention this year. I realized that it deserves its own dedicated space, which is what this post is about.

 

I hope you'll take the time to read this preface, share my gratitude, and maybe even leave thoughts of your own in the comments below if you were able to experience any of our shows.

 

The first thing that plucked at my heartstrings was the booth. Veteran event specialists Cara Prystowski and Helen Stewart executed masterfully, including a completely new (and jaw-droppingly awesome) booth design with the largest team ever assembled. Words cannot do it justice, so here are some pictures:

CnGpXBhVUAAfOTv.jpg CnMRvqQVIAAIxMi.jpg

 

The team this year boasted an incredible mix of skills and perspectives. We had our usual complement of engineers and geeky types, including NPM Product Managers Chris O'Brien and Kevin Sparenberg; Sales Engineers Sean Martinez, Andrew Adams, Miquel Gonzalez, and David Byrd; Product Marketing Managers Robert Blair and Abigail Norman; and even some of our depth technical leads, including Product Strategist Michal Hrncirik, Principal Architect (and resident booth babe) Karlo Zatylny, and Technical Lead Developer Lan Li.

 

This was a powerhouse of a team, and there wasn't a single question that people brought to the booth that couldn't be addressed by someone on staff.

 

Of course, Head Geeks Patrick Hubbard and Destiny Bertucci (not to mention Yours Truly) were there, as well, adding a unique voice and vision for where technology is heading. I can’t stress this enough: Destiny, our newest addition to the Head Geek team, kicked serious geek ****, in the booth, on the trade show floor, and at our SWUG. She is a juggernaut.

 

But the awesome didn't stop there.

 

Wendy Abbott was working the crowd like the THWACK rock star she is, helping people understand the value of our online community, as well as the spreading the word about our upcoming THWACKcamp, happening September 14th and 15th. She also split her time between the booth and helping the inimitable Danielle Higgins run our first-ever "mini-SWUG": A SolarWinds User Group for Cisco Live attendees and Las Vegas locals. It was a total blast.

 

Our Social Media Ninja Allie Eby was also on hand, helping direct visitors and field questions while capturing and promoting the ideas and images swirling around our booth on Twitter, Facebook, and LinkedIn.

 

Support can be an unsung and under-developed area in any tech organization. But Jennifer Kuvlesky was on hand to show off the new Customer Success Center, helping existing customers and people new to our company understand that we are committed to providing in-depth technical information about our products and the problems they solve.

 

We even had senior UX designer Tulsi Patel with us, to capture ideas and reactions about our new Orion UI, as well as the features and functions of our products. How's that for responsive? We actually were able to take user feedback from the booth as people were interacting with our products!

 

Jonathan Pfertner and Caleb Theimer, crack members of our video team, were tirelessly capturing video footage, testimonials, customer reactions, and action shots of the show. Look for some of those images coming soon.

 

Even our management team was firing on all cylinders. Jenne Barbour (Sr. Director of Marketing and un-official Head Geek wrangler) and Nicole Eversgerd (Sr. Director of Brand and Events) put in as many hours, logged as many carpeted miles, and radiated as many geek-friendly smiles as the rest of the team.

 

Finally, we had a top notch onsite setup crew. Patrick mentioned that it was the first time he was able to just walk in and turn on the PCs. No small thanks for that goes to Kong and his help provisioning systems last weekend.

 

It was, as Patrick mentioned on Twitter the "best #CLUS ever.”

 

Now that you have an appreciation for just how awesome the SolarWinds crew was, sit tight for my thoughts about what I saw this year, and what it means for us as IT pros.

jdgreen

An Introduction to ELK Stack

Posted by jdgreen Jul 15, 2016

Wth the pace of business today, it’s easy to lose track of what’s going on. It’s also becoming increasingly difficult to derive value from data quickly enough that the data is still relevant. Oftentimes companies struggle with a situation where by the time the data has been crunched and visualized in a meaningful way, the optimal window for taking action has already come and gone.

 

One of the strategies that many organizations are using to make sense of the vast amounts of helpful data their infrastructure generates is by collecting all the logging information from various infrastructure components, crunching it by correlating time stamps and using heuristics that take relationships between infrastructure entities into account, and presenting it in a report or dashboard that brings the important metrics to the surface.

 

ELK Stack is one way modern organizations choose to accomplish this. As the name (“stack”) implies, ELK is not actually a tool in itself, but rather a useful combination of three different tools – Elasticsearch, Logstash, and Kibana – hence ELK. All three are open source projects maintained by Elastic. The descriptions of each tool from Elastic on their website are great, so I have opted not to re-write them. Elastic says they are:

 

  • Elasticsearch: A distributed, open source search and analytics engine, designed for horizontal scalability, reliability, and easy management. It combines the speed of search with the power of analytics via a sophisticated, developer-friendly query language covering structured, unstructured, and time-series data.
  • Logstash: A flexible, open source data collection, enrichment, and transportation pipeline. With connectors to common infrastructure for easy integration, Logstash is designed to efficiently process a growing list of log, event, and unstructured data sources for distribution into a variety of outputs, including Elasticsearch.
  • Kibana: An open source data visualization platform that allows you to interact with your data through stunning, powerful graphics. From histograms to geomaps, Kibana brings your data to life with visuals that can be combined into custom dashboards that help you share insights from your data far and wide.

 

Put simply, the tools respectively provide fast searching over a large data set, collect and distribute large amounts of log data, and visualize the collected and processed data. Getting started with ELK stack isn’t too difficult, but there are ways that community members have contributed their efforts to make it even easier. Friend of the IT community Larry Smith wrote a really helpful guide to deploying a highly available ELK stack environment that you can use to get going. Given a little bit of determination, you can use Larry’s guide to get a resilient ELK stack deployment running in your lab in an evening after work!

 

Alternatively, if you’re looking to get going on an enterprise-class deployment of these tools and don’t have time for fooling around, you could consider whether hosted ELK stack services would meet your needs. Depending on your budget and skills, it could make sense to let someone else do the heavy lifting, and that’s where services like Qbox come in. I’ve not used the service myself and I’m not necessarily endorsing this one, but I’ve seen manages services like this one be very successful in meeting other pressing needs in the past.

 

If you check this out and ELK Stack doesn’t meet your data insight requirements, there are other awesome options as well. There’s also the ongoing debate about proprietary vs. open source software and you’ll find that there are log collection/search/visualization tools for both sides of the matter. If you’re looking for something different, you may want to consider:

Warstory

 

In a network far far away on a sunny Friday afternoon something was broken badly and nobody was there to repair it.I was called in as an external consultant to help the local IT to fix the problem. I had never seen that network before and the description of the error was only saying "outage in the network".  When I arrived at the car park , a lot of sad looking duct_taped_fibre.jpgemployees where leaving the office building.

The first thing that I always do in these situations is to ask for the network documentation. There was an uncomfortable long silence after I had asked the question. Finally somebody said: "yeah our documentation is completely outdated and the guy that had known all the details  about the network has just left the company..." The monitoring looked like a F1 race win in Monza, everything was blinking red. The monitoring would really help, but unfortunatly it is also down. When you don´t know what to look for it is like looking for a needle in a haystack. In an outage situation like this, a proper documentation and working monitoring would have helped by reducing the time to find the actual problem. Instead of debugging the actual problem, you spend an enormous amount of time exploring the network. Desperatly trying to find out in which general direction you should do further troubleshooting. You will probably also get side trapped by minor misconfigurations, bad network designs and other odd details. Things that have been there for many years, but are not causing the problem that you try to fix right now. It is hard to figure out  what the actual problem is in these situations. You get also constant pressure  from management during the outage. While you are actually still exploring the network you also have to report what could have caused the outage. To summarize the situation without a valid documentation and a non working monitoring you have some time consuming challenges to bring back the network to live. It is not an ideal position and you should try everything to avoid that.

 

 

Lessons to be learned

 

To have an up-to-date documentation helps a lot. I know to keep the documentation up-to-date is a lot of work.  Many network engineers are constantly fire fighting and have the pressure of rolling out new boxes for a "high priority" project.  It helps to implement the monitoring and documentation into the rollout workflow, so that no device can be added without documentation and monitoring. In my little example from the beginning somebody, was trying to troubleshoot the initial problem and with the "try and error method" disconnected the production virtualization Host on that the monitoring was running. To avoid these situations it makes sense to have the monitoring system on a separate infrastructure which works independently even when there is an outage in the production environment. For the documentation sometimes less is better. You need a solid ground level. For example a good diagram that shows the basic network topology.  Because documentations are outdated in the moment somebody has finished them it is better to look at live data when it is possible. I am always unsure if a maybe outdated documentation is showing the correct switch  types or interfaces. In the monitoring you can be sure that these informations have been live polled and are automatically updated if somebody is making even a minor change like a software update. Some problems can be fixed fast and some are more of a long term effort. For example the mysterious "performance problem". These Tickets circulate around all IT departments and nobody could find anything. Here it helps to layout the complete picture. Find out all the components that are included and their dependencies of each other. This can be a very time consuming job but sometimes it is the only way to figure out what is really causing the "performance problems". With that knowledge integrate into the monitoring you get the live data for the involved systems. I had great success with that method to fix long term problems and have afterwards the capabilities to monitor that issue just in case it will show up again.

sqlrockstar

The Actuator - July 13th

Posted by sqlrockstar Employee Jul 13, 2016

Still in London, and enjoying every minute of time well spent with family. Even on vacation I still find time to read the headlines.

 

So, here is this week's list of things I find amusing from around the Internet. Enjoy!

 

Savvy Hackers Don't Need Malware

Interesting article on how companies will be incorporating analytics into their security measures which makes me wonder "wait, why weren't you already doing this?"

 

A.I. Downs Expert Human Fighter Pilot In Dogfight Simulation

Forget self driving cars, we are about to have self flying fighter pilots. I'm certain this won't lead to the end of humanity.

 

Security insanity: how we keep failing at the basics

Wonderful article by Troy here, showing more examples of basic security failures from some large companies and websites.

 

The U.S. MiseryMap of Flight Delays

Valuable data for anyone that travels, ever.

 

Here’s How Much Energy All US Data Centers Consume

It is surprisingly low, IMO. I'd like to see some data on how much capacity and throughput these data centers are providing over time as well.

 

CRM and Brexit: teachable moment

I enjoyed how the author broke down #Brexit and related it to software. We've all seen both sides of this equation.

 

Is Your Disaster Recovery Plan Up-To-Date?

I've always found summertime as a good time to review your DR plans, and this article is a nice reminder.

 

Why yes, Lego Daughter loves Harry Potter, why do you ask?

kings-cross.jpg

Filter Blog

By date:
By tag: