Showing results for 
Search instead for 
Did you mean: 
Create Post

Calling All Super SysAdmins: When Have You Come to the Rescue?

Level 15

They say love is a battlefield, but as any SysAdmin knows—so is the data center! With an endless supply of “evil villains,” like zombie VMs, data loss, performance bottlenecks, application downtime, service outages and more, that could be lurking in the dark corners of a data center, you SysAdmins are the first line of defense for today’s businesses and end users. Yes, SysAdmins are truly the unsung heroes of modern business. And as your role continues to evolve with the growth of hyperconvergence, hybrid IT and cloud computing among other new technology trends, so too do the threats and challenges you must defeat day in and day out.

So, to help recognize all of you hero SysAdmins and to start celebrating System Administrator Appreciation Day on Friday, July 29th, we’d like to hear about your greatest SysAdmin moment in history: a time YOU came to the rescue and saved the day. And don’t forget to tell us what your corresponding fictional superhero identity would be—we know you’re out there, Captain Uptime and Incredible Server Girl!

Use the comments section below to share your heroic tale and superhero identity by July 1, and we’ll give you the key to the city, er, I mean 250 thwack points as a token of our appreciation.

Level 13

There are so many to choose from, how do I write about just one?

How about the day I used a file recovery tool to recover personal pictures off a corporate iphone 6 after it was securely wiped via a Blackberry BES server?

or the day (way back in 1996) I used Novell's deleted.sav directory to recovery a client's POS database that they accidentally deleted?

Another one is when I threw one of my network administrators into the deep end of the ocean to see if they could swim (aka, fix our lotus notes server which wouldn't boot). I knew this individual could do it, but my predecessor never gave them the change to prove them-self.

I think the best one, is when I convinced a manufacturer that a pee - soaked (yes, that is not a typo) laptop was covered under our extended accidental water damage warranty.

Call me Captain-I've-Seen-It-All

How about the time I discovered our WAN provider had accidentally fat fingered a VLAN port range on their WAN service switch, and a city's internal government structure, and their state's remote connectivity, suddenly was reachable without authentication from my organization's internal network?  And they had no passwords on their routers!

Or perhaps more actual "Saving The Day" happened when a Nexus 5020 was being rebooted for code upgrade and it ran head first into a bad lot of Chinese chips that caused it to fail. The replacement came in 4 hours, and within hours the second VSS-paired 5020 ALSO failed, due to the same chip set issue.  That one's replacement arrived in 22 hours, and our data center limped along on one 5K until I got resilience restored.  I discovered how many servers hadn't had resilience implemented on them during this event--whether single-NIC devices hadn't been retired yet, or dual-NIC devices hadn't had both NIC's plugged in, or dual-NIC devices didn't have LACP properly implemented to provide resilience--I found and dealt with all those issues for supposedly mission critical systems.

What about the time I found and foiled a DDOS attack against our firewall that leveraged the NTP-123 vulnerability, which created a reflection attack using private businesses to overload a South American corporation's Internet presence?  NPM showed the rising ISP throughput was actually four times what we contracted for, and the ISP wouldn't do anything to filter out the attack on their side.  Heads rolled when I pointed out the problem and their inability to help defeat the problem.

When I was little, I thought John Carter (of Barsoom) was pretty cool, so I decided "Agent Rick Carter" could be a fun handle for my secret super hero self.  Then I met someone named Rick Carter and quickly decided to stay myself.  Instead I invented a nom de plume based on a fictional Finnish Frisbee Olympics champion, and I smile when I sign that name to e-mails to my oldest friends, who knew me back in the day when I was a competitive Frisbee player.  A translation of his name might be "The Lion of White Water."

Anytime we help the end users, those folks are often so gracious and grateful to be included in the solution, or to have someone share the problem and the fix or an ETR with them.

I remembered a time when resurrecting Sneaker-Net was greatly appreciated by a group . . .

Once I worked where one department was dependent on an in-house network fax service for critical materials.  An extended network maintenance window resulted in the second department becoming started and frantic for supplies--how could they place their orders to the supply department, when the process required computers?  They were in a panic.

I took a calm cleansing breath and said "They're just down the hall from you.  Could you carry a requisition over to them so your team can receive their supplies and continue doing their work?"

You could've heard a pin drop.

"YES, YES, that's a WONDERFUL idea--THANK YOU!!" and they hung up the phone quickly . . .

Sneaker-Net to the rescue, and I was their hero for the day.

Level 12 about in limerick form?

There once was a user who liked to complain

"Hey, the network is slow!" went his tired refrain.

I laughed at him and gave a derisive snort,

"Quiet, you, or I'll admin down your switch port!"

And I never heard from that user again!

Or this one, maybe a touch closer to reality...

A long time ago, a user's laptop was dead

"It fell out of my car," that's what he said.

Mumbling and rolling my eyes,

I said, "You've won a prize!"

And gave that user an etch-a-sketch instead.

Okay, I give.  Perhaps I'm just too jaded from all the years.  In any case, I'll claim to be the mysterious Fifth Octect (5/8 for short).

Level 14


Probably the best was being out sick one day and having my boss call me to tell me our European HQ in London was unable to ship product and could I call and talk to them.

I did and quickly learned that their VMS system had died and the system disk was toast. I asked if they had done a standalone backup on the system disk as I had reminded them to do about 5 times before.... "Gee no... we haven't".

It is now 3 PM... my boss asks if I can fix or rebuild it.... I can..... Next flight to London 7:30 PM... Needless to say a quick stop by the office to grab some notes and tapes (plus a bottle of Pepto!!!) and I was headed "across the pond". Four (4) days later, two (2) DOA hard drives(shipped to London from the continent)  and we were moving product. (Got a huge attaboy from my boss and CEO)...

Before boarding the plane back the European GM brought me to a pub.. and had me fully "relaxed" for the ride home!!!

There are some parts missing to that story, I suspect.  It would be nice to hear that "corrections" were applied to ensure your recommended stand alone backup was regularly performed . . .

Level 14

they were....and reported to me on a scheduled basis...

and yes... there are parts of the story... that are better told over adult beverages...

Level 12

I like turtles

TMNT's?  Or for the points?

Level 9

Technology can do some amazing things, Like being able to communicate with someone any where in the world. It sort of gives many the impression hey if we send someone over there and test the application we should all be good if we all decide to move our office to that new location, no engineering required I tested it myself said the vice president. So now I have inherited a performance problem with hourly status calls because a critical application is too slow to be usable with customers on the phone. Customer representatives are having to take notes on paper then transcribe this information into the system after the call. Network wise it's not a bandwidth issue. it does not even seem to be a latency issue even though it was my first thought. Gathering some information about the application it's just a web form, running in internet explorer, then seeing an actual user performing the job, Two web browser windows open, One to the application in question the other to intranet site for procedures. Add lotus notes, and messaging app called sametime, the call center connector software, a word or excel document and the typical antivirus all installed on desktop pc running windows XP with 512k of memory. I asked the Vice President what type of lap top he had, of course it was a high end device with 2 gigs of memory. I suggested how about adding some memory to one of these machines??  No more issues with that one. Turned out to be a simple fix compared to the proposed network upgrades that were being planned.

Level 13

maybe his hero name is turtles?     

Could be!

Level 9

Probably the time that a bitter web admin deleted all DNS address details of a client's websites before we took them on as a client of our own.

Cue a flurried few hours trawling through their decommisioned 2003 server's local exchange for any email record of the details, and their sites were back online before the close of the day!


My favorites are always the ones where we are able to avoid outages by properly identifying a problem before it affects the end user.

Level 8

Came in with reports of the entire network being down in a building.   It was interesting to see the led lights on the switches all in sync dancing to the broadcast storm that was in full effect.  Someone had installed a HUB in a shared office ,trying to help out by creating additional switch ports, and created a loop.  Couple this with systems which broadcast a significant about a traffic doing what they do, and it was a perfect storm.

Level 12

Yeah it is hard to pick just one. Would have to say when a company was on site doing a presentation for the C levels and some of the board members, after hours of course. I get a call that they are having problems so I come right in. The company has a laptop and some software they use for their presentations. The laptop only has HDMI on it, no VGA or Display Port. They are all panicking because they tv was only setup with VGA and Display Port cables. I go and find a HDMI cable, pull the tv off the wall, plug it in, hook the laptop up, and everyone is breathing a little easier. They thought I worked a miracle, like I beamed the cable down from the Enterprise or something, and was willing to take something mounted to the wall down to hook the magical cable up to it. Just another day at the office though.


Well, just at the end of last month my boss decided to turn off the router we used to connect to an MPLS provider in the past.   Of course the reason we disconnected the MPLS is we transitioned to using high-speed DIA links and used that very same router to home the 125+ IPSEC/GRE VPN tunnels to these sites.  Of course when I informed him about that fact, when he turned it back on it was toast.  A known memory problem with Cisco 7206 routers.  Of course the router was 6+ years past EOL AND didn't have a service contract on it, so there was no way we could purchase one for it either.  I work remotely, about 2000 miles away from our datacenter, so the guys there pulled out a spare router, which also died.   We then tried to use a 7604 we had sitting around, but it didn't have the hardware needed to do crypto.   I had recently transitioned all but 5 tunnels from another 7206 we were using over to a DMVPN solution I was working on, so I quickly transitioned those over and we moved that 7206 over to take its place.  It died also...

So, about a month prior I was told NOT to work on the DMVPN solution because someone would rather transition us to a Fortinet solution, which I'm not a fan of.   So I kept working on it in order to meet other deadlines I had in transitioning a recent acquisition of nearly a hundred sites off the current 3rd party provider they had.  So I had a choice, either transition them over to DMVPN, or to transition them to another static IPSEC/GRE router we had up for some other sites...

Since doing it the "old" way with the statically configured IPSEC/GRE sites required modifying both the head-end router and the remote router quite a bit with each and every site and tunnel we put up, I didn't want to do this.  We would have had to generate and set crypto keys and maps, ACL's and such, this was all statically configured!  We would have been lucky to do 4 sites an hour per-network engineer, and of the three not out on vacation, I was the only one that had been employed with the company for any period of time.   None of the new engineers had ever even rebuilt a tunnel at all, even though I'd been urging the boss to put them in the support queue and help out others with problems.   So, at 120 sites with 3 engineers bringing online lets say 3 an hour, that would have conservatively taken us 2 days, while I would guess at probably even longer.

Luckily these sites had backup links, which are mainly used for Guest Wireless traffic and can get to be quite slow.  But, it afforded us an easy way to get certificates from within our network.   So 2 guys set off on doing that, while another 2 worked on the DMVPN tunnels.    This problem started late in the afternoon the night before and it wasn't until about 8pm EST that they figured out we didn't have a router to take the place of the failed one.  I had been working on some sites that only had a link to the router that failed, manually bringing up a tunnel with them, getting certificates for them and getting them on DMVPN.   So we didn't really start work on the 120 or so routers until early the next morning.   We had tunnels to all the routers up after about 6 hours into the next day.   Rather than days of work, we got most of it done in a fraction of a day...

The day after that, I took off early at noon due to working so much the previous couple of days.   About 15 minutes later they call saying that one of the sites had moved and they couldn't get the new router shipped out to connect.  The problem was, it had the old configuration on it without a cert.  So, I tell my boss that he just has to get them to bring up a temporary tunnel the old way, get the cert, then deploy DMVPN.   If you remember I said I'd hope we could deploy 3 of these statically configured tunnels an hour if needed, they worked on it for almost 6 hours before they gave in and called me back in to fix it!    From 3 and hour to one every 6 hours really blows the time estimates I gave, huh?

So, just one example of saving the day!    Bringing 120 sites up on a DMVPN solution that wasn't even supposed to be in production in 6 hours!!   🙂   Now its passing terabytes of traffic daily without a hitch...    (and if you're wondering, they still want to replace it with fortinet! smh...)

Level 13


Heaven save us from ignorant but well-meaning superiors!

I had one, long ago, who believed a hardened Free-BSD firewall was the cause of certain problems, and while I was out taking lunch, he wanted to reboot it.  Of course he had no passwords, no logical access.  But he had the key to the server room, and proceeded to power it off manually with the power button.  When he powered it back on, all was corrupted, nothing flowed to the internet or through the DMZ's.  Frantically calling me, he reported the firewall was down.  My initial troubleshooting questions weren't getting helpful responses from him, until he finally admitted what he'd done.

Fortunately I had great support from the vendor, and freshly backed-up tapes with which to rebuilt the poor box.

The same person called me a different time while I was out of town on vacation--"Get in here immediately, the whole network has crashed!"   Really?  Thirty-three sites supporting 14,000 users, all went down at once?

More probing revealed his work computer was hung, and a reboot fixed it.  We never got around to renaming his user to "Chicken Little", but some folks in the department had suggested it as a fitting handle.

I guess I "saved the day" in both instances, but there was no joy or recognition or bonus for doing one's job, even if it was recovering from  a Director's errors.

But it makes for good stories years later!


Level 8

I'd have to say that rescueing a pc from a crypto-type malware, which actually had easy-to-follow instructions to remove, somehow saving all the data.  Not sure how it didnt encrypt, but im not asking questions.  Fairly nasty, but nonetheless, there was the walkthru on google, and it actually worked!!!

Level 9

I'm quite fond of reminiscing about the day I put an end to a Ransomware siege.

The only info I had to go on was that 30,000 critical files were encrypted, each with it's own ransome note.  We had no idea what the source of the infection on the network was, or if there was anything happening that hadn't yet been reported.  Cue shutdown of all servers across the network.

One server was brought up in isolation, a DC that had a replication of the DFS filestore hosted locally on it.  After spending (read as wasting) lots of time looking at the encrypted files and searching the web with a fine-tooth comb to find out if there was a decryption tool available, I realised that the key to this problem, lay in the ransom notes.  Each ransom note was created by one user.  We now had the source on the network.

A quick deletion of all the encrypted files and the associated ransome notes followed by a restore of all the original data from the previous nights backup, and we were back in business.

Malware Assassin.  Or... An IT superhero has no name.

Level 12

This one time at work....

How about having a local office tech pull drives out of an array and not label them, thereby not putting them back in the correct sequence. Super-Admin comes in and has to restore the entire volume of data from tape.  2 days worth of restores. 

Level 9

The Friday afternoon of the Memorial Day 1998 weekend someone in a neighboring cage spilled a Big Gulp into our SAN. We spent the entire three day weekend restoring databases from tape. Client was happy, wife was not.

That's unfortunate, and possibly a sign of inexperience or insufficient training and testing.

Array manufacturers should have  had this problem and it automatic resolutions down pat long ago.  A solution where drive position is irrelevant, where the drive has an internal ID that the array recognizes automatically and reads/writes to/from that position, no matter the bay or slot the drive is installed.

Expect human error, design to compensate for it and move forward.

Level 9

I was starting my second season running the Antarctic Muon And Neutrino Detector Array (AMANDA) at the South Pole, and some of my German colleagues came down ready to upgrade their part of the experiment by removing a slave processor from a VME card cage that read 700 high speed analog data channels from the photomultiplier tubes buried in the ice 4000'-6500' below the building (the experiment worked by picking up flashes of light when neutrinos would occasionally collide with water molecules) and replacing it with a "bridge" card to directly map four dozen A-to-D cards into the server's memory space.  The critical part was that the old server was running RedHat Linux (they tested on Suse, back at their home institution) and they couldn't get the driver for the bridge card to compile against that kernel, so I said, "let me take a look at that driver..."  "We spent a month on it and it just won't work on RedHat."  "Let me look at it anyway."

I grabbed the code, compiled it, looked at the error it threw -  "See!  There!"  - and read the driver source.  It looked like a minor change in an mmap call, which I fixed and wrapped in an #ifdef and recompiled.  10 minutes of work and zero errors.  I insmodded the driver and said, "try that."  Worked fine for the next 3 years until they retired the experiment for its successor.

I love Open Source.  You can't do that when you can't open the hood and see the guts.

We were in the middle of a Production Migration when the SQL DBA rings me up to tell me "oops, I ran your ODS Scripts against your production database"  We were already hours into the migration and did not have time to do a database restore and meet our deadline and be up and operational at the start of business hours.  A few CTRL F's later, we are scripting the statements to fix the database and we make the deadline with enough time to spare for 4 hours of sleep before business hours.

Star Lord Peter C Quill or Star Lord Peter SQL


   Many moons ago when I was working for Digex, the first managed website provider, I had a client that was having all sorts of issues with their new Cisco HA firewalls (circa 2007). The pair kept losing connection to each other some the secondary node kept trying to be primary. My Network Security team (firewalls, IDS, DNS, etc.) spent hours trying to figure out why the firewalls were not liking each other. No hardware failures, no errors in the logs other than "heartbeat lost". Bupkis!

   And then I remembered a little nugget that a firewall engineer told me about two prior while specing out a tech refresh for another client's ecommerce website: the heartbeat connection has to be at a speed equal to, or faster, than the fastest connection. I told the technician working the ticket this and Wallah! The heartbeat was 100/Full, and the public and private interfaces were 1GB/Full. That was it! We lucked out because these Cisco devices came with 12 ports standard so the heartbeat cable was moved to a 1GB port and everything came up roses. The Network Security team was embarrassed that a Technical Account Mgr like myself identified the fix, but I told them I didn't care. I just wanted the client to stop yelling at me.  🙂


As a field technician at the European Space Agency I was called in to a conference room full of scientists and engineers. They couldn't work out how to use the new fancy LCD projector that sat on top of the overhead prpjector (this was in the mid 90's). After a couple moments of evaluating the new technology, I decided I would try plugging it in to the wall. Just goes to show that even rocket scientists have trouble with the tech we support.

Level 20

Stopping code red worm back in the day was pretty neat... it blasted traffic out so much that one infected machine could pretty much bring down the entire segment.  Nothing like a hardware sniffer to find the MAC and IP address of the infected machine and then either shutdown the interface (if you could even get logged into the switch and router because we didn't have layer 3 switches back then) or old fashioned way start pulling userland cables out at the switch until all the pretty green solid lights went away...

Level 20

Linux rocks!


Not sure this classifies as saved the day...but it saved the company a bunch of $$$$ in penalties when the company was going through PCI audits for certification.

Back in a former life the company I worked for was working on their PCI compliance right after we finished up SOX compliance fun.

We had over 5000 5GT vpn appliances in our stores.  Each had a unix based OS and therefore had a root account.  PCI required each and every one

to have a different root password with the passwords changed every 90 days and also changed within 48 hours of being used.  No one person could know them

and they could be stored clear text nor could the network team manage the passwords. 

Now this was around 2006 and there wasn't any software to handle this.  This got dumped on our group as the most likely suspect of being able to deal with it.

We all scoffed at the idea but after a few days of it bouncing around the back of the head with all the loose screws a lesson from my perl class kept popping up.

So I goofed around with the concept and crypt(3).  In the end we had a list of a store designator and a string that corresponded with a date of the last password change.

The script would take the store designator prepend and append some string data and using a specific salt string would create a 13 character mixed case alpha numeric string.

Then the date would be prepended to that string and run through crypt with a different salt string to produce the password.

When a password was requested to a tech to do work the request would come via email and I would provide the resulting password string with no other info via IM.

The generation of the password would create a flag file with the store designator.  A perl script would run once a date looking for those flag files.  Once the file age exceeded

48 hours the script would log onto the 5GT and change the password updating the data file with the current date and delete the flag file. 

Thus the log of when the password was last changed (compliance log) was also the hidden key create the password.

The best part came to updates that had to be made from time to time.  The network team would provide the command line command to make the change.  I'd imbed them into

a perl script and it would log onto each 5GT and make the changes. Being that no one was "hands on" no password change was required.  80% of the 5GT's could have a configuration

update made within 4-5 hours.

Impressive geek skills!  That's a job I'd just say "Ugh!" at and start looking for someone with skills similar to yours, to do the work.  Clint Eastwood said it, I took it to heart:  "A man's got to know his limitations."

Wow!  Food & beverages are strictly verboten from our data centers and network rooms.  Was such a policy in place at your location--even though it was from a neighboring cage?  If not, I hope that policy is deployed now.  How about a bill for recovery services & downtime impact--was it sent to the company of the Big Gulp spiller?


There was a time not that long ago...oh wait maybe it was, when smoking in the data center was okay.

My how time flies over the decades...

When "the good old days" were really "the bad old days."  Not being a smoker, I was always annoyed and irritated when my parents smoked.  Later, playing sax and electric bass in nightclubs, being exposed to that constant smoke was health-impacting.  Even living with a smoker for a few years definitely was bad for my health.

When Minnesota was an early-adopter of smoke free public places, I found it much more comfortable to dine out--no more smoke in the restaurants.  But at the same time, there were fewer paying gigs at clubs because bars were smoke free and some people stopped going out because they weren't comfortable without a cigarette in social situations, or they were just plain addicted.

Eventually that changed, bars and nightclubs saw more people coming back without need to smoke, and the gigs picked up again.

But smoke in the work place--and especially the data center--just doesn't make sense now that we know the impact it makes on health and equipment.

Level 7

My favorite superhero job started one night at 9:00 when I received a phone call at home saying that I had to rush
out to a famous resort site 2 hours away and resolve a network down situation. A film crew from Hollywood was filming a movie on the resort’s famous golf course, and they needed to send the next day’s script revisions back and forth to the writers who were back in Hollywood. The scripts, of course, being fairly large documents. I arrived on the site and was taken to a conference room and introduced to the movie’s head of production, who was in as foul of a mood as I
have ever seen a customer. She demanded to know why I hadn’t fixed the problem yet. (I had been on site for less than 5 minutes at this point). I assured her I would get the problem resolved as soon as possible, and placed a call to the
AT&T support group. (AT&T being the ISP and Network monitoring vendor). The tech informed me that he had multiple equipment failures spread throughout the resort (many square miles of property), and started giving me a list of
equipment sites he wanted me to drive to so that I could start working on the various pieces of equipment. It seemed dubious to me that all those pieces of equipment would have failed on the same evening, so I started to think of what
else could cause these symptoms, including the failed equipment. Now this was back in the early days where a dual-DSL internet feed was considered a fast connection. I reasoned that an overload of the bandwidth would not only cause
the failure of trying to transmit large files, but would also interfere with the communications to the various pieces of network equipment, making it look like they had failed, when it was actually a matter of not being able to
communicate to them. So I asked the tech to look at the QOS and see if there was anything eating up the bandwidth of the main DSL connection. He said that yes, he had one user downloading from a music site and 2 users downloading from
pornographic sites. I asked him to blacklist those MAC addresses and to hold until I got back to him. I then explained to the film’s head of production, (who had been standing right in front of me all of this time with a very
impatient expression and attitude), that we had discovered 3 users who were using up all of the bandwidth by downloading off of music and pornographic sites. I told her that the ISP was in the process of cutting off their
connection, and that the network would hopefully be up shortly. At that moment, we heard angry shouts from the conference room next door, which was also reserved for the film crew. I looked at the head of production and shouted out
to the room next door and asked them what the problem was. They replied that they had just been cut off from the network and had lost their connections. I then told the production manager to retry sending her script on her laptop. It
went through in a flash. Her face got red, and she gave a mumbled thank you. She then excused herself and went into the adjoining conference room and closed the door behind her. Needless to say, there were loud words emanating from the
room, and the lady’s language could have put any drunken sailor to shame. The resort's management was thankful beyond words.

I'll never forget sometime between 1990 and 2000, having to troubleshoot how exactly my dad's pc got a virus. It was also my main gaming machine for all the telnet muds I played back then, so I was quite strongly obligated to search and resolve the virus issues.  Many searches later over lycos and yahoo, I found the solution! I had fixed the home PC's virus issues by downloading the fix from an AV vendor which included some settings made via the trend micro tool that became hijackthis.

I suppose I was the original <PK>boy wonder.

Level 21

Well, I certainly don't know if this was my greatest moment or if I have even really had a greatest moment; however:

We had a client that kept having their IIS server crash causing their sit to go down.  Our technical folks started working with their tech folks to try and solve the problem and many of the solutions included throwing more resources at the system to see if that would save the problem, needless to say it didn't.  Once I was pulled in I spent about an hour going over all of the data I had from the system in Orion seeing what I could correlate to the time the issues were happening and I ultimately ended up finding huge spikes in current connections to IIS thanks to a IIS SAM template we had applied to the system.  It turned out that the clients web code would choke if the number of current connections went above a specific level.

I guess you can call me "Super Sleuth" since I spend a lot of time sleuthing through data to find a root cause.

Level 8

Crypto virus attack. Those are almost built to prove SysAdmins know what they're doing. Small Dental Practice. We narrowed it down to patient zero very quickly and took offline. It had already encrypted most of their file server already though. Restored the latest backup, which felt super bad at first, because it was 2 weeks old, but they had been in hiatus for 10 days. Luckily, their practice management software wasn't affected. Prepared for the worst, hope for the best, and the end result was almost perfect.

Level 9

For me, one of the greatest SysAdmin moments happened almost a decade ago. I was working for a municipality with a staff of about 2800. At that time, all of our HR and Finance data was running on Informix on HP-UX. I was adding redundancy to our HP RX8600 server by adding two cells, i.e. 'blades' for RISC9000. Or rather the HP technician was. We had made sure that everything was well backed up multiple times (15 LTO1 tapes per full backup), and shut everything down. The tech finished his job, powered it up, and all seemed  ok.  He left. The next day, our DBA  started complaining about something called 'fuzzy checkpoints'.  In essence, Informix was notifying him that write operations were taking too long.  Soon users began to complain. Normally, those checkpoints were fractions of a second, but some were now 30 seconds, a minute or even more.  The only change was the additional cell boards in the 8600 chassis.

I immediately called HP, and got what I would consider classic tier 1 support.  "Nothing we can do - it wasn't us. You need to patch the OS, and ONLY after that will we look at it!"   I tried my best, but they wouldn't budge.  Something inside me set off alarm bells.  I met with our CIO and gave him the update.  I emphasized that I felt it a very real possibility that if I patched the server per HP's insistence, we could experience a major failure and severe outage. I got him to call HP as well, with similar result. He asked for worst case scenario, and I speculated the potential for a disk crash that could leave us decommissioned for days - during a week when payroll had to be run.  I got his sign off to 'DO IT'.

I patched the OS without much difficulty, albeit a much slower than normal process. Then it was time to restart the system. HP-UX has an elaborate shut down process, where it provides feedback on what is happening.  It got to the point where it was stopping Informix..... Nothing.  a couple of minutes passed, and then the system forcefully shut Informix down - Corrupting all tables.  While the system rebooted and came up to HP-UX, the entire HR/Payroll system was now destroyed.

No Problem - We have Backups, right? Made minutes before shutting down, and many other generations as well!  15 tapes were loaded into the tape library, and with 6 tape drives to restore from, the process still was running 5 hours later. Restoring Log file 240-something and then - error reading tape...Job aborted.  To make it worse, some of the key files we needed were on later tapes so we couldn't get the database operational. Restore attempt # 2 - failed Log 220 something. By that time, it was evening, and the DBA spent the night trying to restore from backups. Different tape sets - same result. 5 attempts later we were no closer to being operational. The next day produced similar results - interspersed with the same problems of system slowness. I took my turn the next night trying to find a good backup. By the next morning, we notified the CIO that we MIGHT NOT BE ABLE to restore the HR/Financial system. That depressing news left me sitting, eyes closed, thinking through everything that we had tried. My thoughts drifted to the tape drives. Metal heads pressing against tape, reading magnetic blips as electrical pulses. In my head, I saw it - perhaps a tape head was out of alignment - writing data in a way that perhaps caused the other drives to fail when they attempted to read the offending pattern.  I shut down all drives except for one and kicked off a restore. 6 hours later, failure.  The next drive turned on. Fail.  Repeat. On the third attempt after nearly 8 hours, the restore SUCCESSFULLY completed. It was now 3 days into the outage, and we were operational again - with the most current data - nothing was lost!

We still had the slowness issue. I traced the steps of the HP tech. The rather large cells that he added slid in through the back of the chassis. Dangerously close the the SAN fibre cables. I closely examined them, but they looked fine and I dropped them back into place. At that moment, the DBA burst into the datacenter. "WHAT DID YOU DO? ITS WORKING NORMALLY AGAIN!" Now I knew, or at least suspected something was wrong with the fiber. My boss came in asking what I did, and I gently grabbed the orange fiber and said, I just moved these a bit. Enter DBA again - Its broken again!  Now I knew. I replaced the fiber immediately, and everything returned to normal.  It took a magnifying glass and examined the faulty fiber. A barely visible crush spot could be seen. Apparently not severe enough to cause errors to be reported, but instead enough to cause packets to retransmit.

And the best part - everyone got paid on time!  I guess that makes me Captain FixIt

Today I had two moments of recognition and appreciation.

Two weeks ago a rush / high-priority project was pushed onto my team. 

My boss:  "How long will it take to get a new secure network environment built and handed off to the SysAdmins?"

Me, thinking I'm going to be on vacation next week, but back the following week . . .  "We can have that for them in two weeks from today, or sooner."

Boss:  "That's too long--our team can do it in a week!"

Me, thinking Yes, we should be able to, but I won't be the one doing it--another Analyst will, and he's going to be working it remotely from almost a 5-hour drive away.  IF all goes well, yes, we should be able to deliver in a week.  "OK, if you say so . .."

Me back from vacation last week, talking to my remote peer:  "How's the new secure environment going?"

Him:  "The DC Ops crew has the new hardware in and powered and patched, but I can't get the HA going between two DC's, so I can't hand it off to the SA's yet."

Me:  "What can I do to help?"

Him:  "I don't think there's anything yet.  I'll call you if I have ideas."

A day later, Me:  "Any progress?"

Him:  "Not yet."

Today, me:  "What Layer 1 troubleshooting have you done, since HA connectivity is absent?"

Him:  "We've tried all the dark fiber between the two DC's, swapped out GBIC's, swapped out patch cables, still no link.  But all the other devices using that dark fiber are up and running, so it can't be the bundles of fiber."

Me: "I'll head up there and find the issue.  We've got nothing if we have no L1."

Two hours later I'd identified multiple failed dark fiber strands, but also one good pair.  Link came up and HA connectivity was established and the project began moving again.

Tomorrow will be the "two weeks" in which I said we could complete the project.  But I'm not an "I told you so" kind of guy.

A little help from a Fluke single-mode fiber test light showed most of 24 dark strands were not passing light, and also showed the strands/pairs that were good.  Using the right tools and seeing the problem in person was the way to see and fix the problem immediately.  My good coworker 5-hours away had configured and managed everything correctly, but he couldn't do the L1 testing in person, he could only see L1 had yet to be established, and the local DC staff weren't familiar with fiber and safely testing it.

When I returned to the office I found another site 3-hours away had multiple medical wireless devices unresponsive after a late night AP migration to new HA controllers.  Only one brand and model of device was affected.  The local support person was on the phone with the vendor, and I was called in to help.

Prime showed the history of the devices, and that they weren't connected.  The Controller showed their MAC addresses and former IP addresses.  The AP's showed when the devices went off line, and reported they were attempting to reconnect, but had poor credentials.

I provided the correct credentials to the local support person and he reentered them and got one running on the right SSID.  But it still wasn't pingable.  I had him blow away the DHCP static and dynamic settings and reboot, and just like that, the device got the right address from Infoblox and began responding to pings.

Surprisingly, suddenly more devices of that same brand and model began responding to the server wirelessly, without any action taken on anyone's part.  It was much like the bad old days when a Token was caught in the Ethernet converting router.  Somehow, resetting one problem wireless device got the rest of them working.  And it proved to be a recovery process of the remainder that weren't up yet.  Happy support and local users were the result.  Yes, there's no explanation--YET!

Level 12

Well back in 2009 I was in the Marine Corps and deployed to Iraq. I was mostly in charge of the base's network, but also anything else that broken or other people could fix.

We had been have power issues with our generators so our UPS trailer and rack UPS were all low on battery. One night the generators die again, by this time shutting down all non critical services could be done without much thought except the whole datacenter with dark less than 5 minutes after the generator alarm tripped. We were dead in the water, no power and no phones for hours.

When the generators got fixed and we could turn everything on I found out that our primary classified SAN had apparently forgot what a disk even was. It apparently all gone and even NetApp techs had no idea how to recover. Just imagine losing information that could possible mean life or death and no hope of restoring, We could restore most of the data from off base storage, but that would take 48 hours or so.

After about 8 hours, an uncountable number of RipIt energy drinks, and threatening the SAN with a large wrench kept for that purpose I was able to explain to the SAN that disks are things that hold data and it should try to read the data. It came to its senses and started to rebuild itself with no reported data loss. I can't for the life of me remember how I did it, but I did receive a Navy Achievement Medal, was never without RipIts again, and was called "The Brain" (as in Pinky and the Brain) for the rest of my enlistment.

When should we expect our 250 pts?

Level 11

Vaping also leaves a residue, I wouldn't recommend that in a data center or near sensitive equipment either.

Our organization treats vaping the same as smoking cigarettes--verboten on campus anywhere, and especially within data centers.


vaping only allowed  in the smoking area outside the building in designated space.

Also know as conference room 1-S.

Level 15

Points have been awarded!

Dang how did I miss this, bummer.