cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Working smarter, not harder in IT

"Do you think the guys running Azure or AWS care if a server gets rebooted in the middle of the day?" I asked the Help Desk analyst when he protested my decision to reboot a VM just before lunch.

"Well, uhh. No. But we're not Azure," He replied.

"No we're not. But we're closer today than we have ever been before. Also, I don't like working evenings." I responded as I restarted the VM.

The help desk guy was startled, with more than a little fear in his voice, but I reassured him I'd take the blame if his queue was flooded with upset user calls.

Such are the battles one has to fight in IT Environments that are stuck in what I call the Old Ways of IT. If you're in IT, you know the Old Ways because you grew up with them like I did, or because you're still stuck in them and you know of no other way.

The Old Ways of doing IT go something like this:

  • User 1 & User 2 call to complain that Feature A is broken
  • Help desk guy dutifully notes feature A is busted, escalates to Server Guy
  • Server Guy notices Feature A is broken on Server A tied to IP Address 192.168.200.35, which is how User 1 & User 2 access Feature A
  • Server Guy throws up his hands, says he can't fix Server A without a Reboot on Evening 1
  • Help Desk guy tells the user nothing can be done until Evening 1
  • User1 & User 2 hang up, disappointed
  • Server Guy fixes problem that evening by rebooting Server A

I don't know about you, but working in environments stuck in the Old Ways of IT really sucks. Do you like working evenings & weekends? I sure don't. My evenings & weekends are dedicated to rearing the Child Partition and hanging out with the Family Cluster, not fixing broke old servers tied to RFC-1918 IP addresses.

As the VM rebooted, my help desk guy braced himself for a flood of calls. I was tempted to get all paternalistic with him, but I sat there, silent. 90 seconds went by, the VM came back online. The queue didn't fill up; the help desk guy looked at me a bit startled. "What?!? How did you...but you rebooted...I don't understand."

That's when I went to the whiteboard in our little work area. I wanted to impart The New Way of Doing IT upon him and his team while the benefits of the New Way were fresh in their mind.

"Last week, I pushed out a group policy that updated the url of Feature A on Service 1. Instead of our users accessing Service 1 via IP Address 192.168.200.35, they now access the load-balanced FQDN of that service. Beneath the FQDN are our four servers and their old IP addresses," I continued, drawing little arrows to the servers.

"Because the load balancer is hosting the name, we can reboot servers beneath it at will," the help desk guy said, a smile spreading across his face. "The load balancer maintains the user's session...wow." he continued.

"Exactly. Now  you know why I always nag you to use FQDN rather than IP address. I never want to hear you give out an IP address over the phone again, ok?"

"Ok," he said, a big smile on his face.

I returned to automating & building out The Stack, getting it closer to Azure or AWS.

The help desk guy went back to his queue, but with something of a bounce in his step. He must have realized -the same way I realized it some years back- that the New Way of IT offered so much more the the Old Way. Instead of spending the next 90 minutes putting out fires with users, he could invest in himself and his career and study up a bit more on load balancers. Instead of rebooting the VM that evening (as I would have had him do it), he could spend that evening doing whatever he liked.

As cliche as it sounds, the new way of IT is about working smarter, not harder, and I think my help desk guy finally understood it that day.

A week or two later, I caught my converted help desk guy correcting one of his colleagues. "No, we never hand out the IP address, only the FQDN."

Excellent.

13 Comments
MVP
MVP

Alternatively, the service desk hears about the server reboot from the change system which gives them details of which customer services are potentially affected by the change.

The same change record also lets them see the expected impacts on the services and any mitigation strategies or workarounds that have previously been worked out.

They also see the approval from the service owner so have no worries about any potential outages, because they can see it has already been properly vetted.

In the unlikely event of an issue, a user calling into the service desk is immediately given the workaround to allow them to carry on and they continue with their work, unaware of the IT dept continuing to work for them.

No surprises, no drama. Customer keeps working and Servicedesk has all the information they need to give to the customers.

Level 15

Nice anecdote!

I come across things like this every once in a while as I am travelling between clients. Usually, I can get someone to give me 5 minutes of their time to try and win them over with some flash of automation. But every so often, the true dinosaurs are found and are just not open to the new world order.

Fun times for sure. For me, I came up with a mentor who made it abundantly clear that the BEST IT admins are inherently lazy and always looking for a way to automate all the things.

MVP
MVP

Ahh...good post.  Hopefully you have a correct and robust DNS...

Level 9

Great alternate scenario mcam....

I'm assuming your scenario is based in whole or part on the ITIL framework?

I've only ever been in environments that are struggling to organize themselves along ITIL v3, so I'm biased...but isn't it interesting that in both scenarios the unspoken secondary (or maybe primary, really) goal is to minimize the risk & impact of IT guys making mistakes?

MVP
MVP

that is what I thought initially

Now I realize its really about helping IT folks understand that business is the point of IT, not the other way around

Once you understand that, thinking of what you do is a service that your business uses is much easier.

Level 9

Nice post agnostic_node1

This article reminds me of why I am growing to love Powershell more and more.

Level 12


....and correctly applied certs.

MVP
MVP

this is true too..

Level 9

DNSSEC?

Of course, an offlined Root with Subordinate CA handing out certs, publishing revocations, and responding to queries is important...if I recall, this particular application was internal only, but that doesn't prevent a mitm attack I'll grant you.     

Also, I like your name Jwilson2013

MVP
MVP

Had a few clients recently with some wonky DNS - caused a few issues with SolarWinds. They're now going through and fixing things up.

MVP
MVP

I bet....a correct and robust DNS is so very important.  In the past once we got DNS correct we set up a caching DNS server on one of our

monitoring servers so all the DNS calls are basically local and don't over stress the primary DNS.  It worked well.

Level 7

Yes you can say that again. Now that most microsoft solution are built around powershell.

Level 12

These are all ideas on building fault tolerant applications.  By reducing the single points of failure, you also increase the ability to handle failures on specific components.  In your example, feature A being broken on server A, but not server B means you can handle the issue on server A without impacting user perception.  There are lots of other things "we" do in IT to facilitate this as well:

  • RAID controllers (disk failures can be handled without causing any down time)
  • Teamed NICs (bad cable? Upstream switch failure? Failed NIC? No problem)
  • Load balancers (or as a lot of vendors are now calling them Application Delivery Controllers {ADC}; Allows you to handle failure of single/multiple servers without a hiccup)
  • MPIO (multiple fiber/links to your SAN, nothing makes a server unhappy like pulling away it's data stores)
  • Virtualization (nothing beats noticing a hardware fault that requires you to perform offline maintenance, and being able to move the workload off of the host without any visible downtime)
  • Clustering of services (such as MS SQL)

We do a lot of these to save the application in the event of failure, but they help us work smarter when we need to.  I regularly reboot machines out of maintenance window as necessary (with proper change controls) to resolve issues.  Or a simple request from the network team to have the node removed from the load balancer allows us to do in-depth troubleshooting without impacting the rest of the application.

Work smarter, not harder. Build your environment to survive failures. Applications should try again (some issues are fleeting). Document as you go, you never know when the proverbial bus will arrive.

And remember to take vacations, a RAID controller can help you survive a drive failure, but doesn't do much for your health if your stressed all the time.

About the Author
A 16 year veteran of SMB IT shops, Jeff has seen it all and likes to share. Broadly-skilled just like the IT Generalist he thinks of when writing, yet deep enough on virtualization & storage to hold his own, Jeff practices good IT at home so that he can excel at work.