cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Holiday Glitch Guide: We’re Making a List and Checking it Twice

Community Manager

The holidays are in full swing and while we hope this time of year will only bring good tidings and cheer, sometimes it doesn’t always go our way. Maybe the store sold out of the last item on your special someone’s wish list, or the holiday roast overcooked. Regardless, the best we can do when our holiday plans go awry is to learn and prepare for next time so our holidays can continue without a hitch!

The same can be said for the technology solutions we develop and use—sometimes they don’t perform as perfectly as intended. So we’re making a list, and checking it twice. We want to hear from you about times this year when your hybrid or cloud applications acted more naughty than nice.

We want to know about the glitches that almost turned you into a grinch this past year. What web application problems did you spend the most time troubleshooting and debugging? Was it downtime? Slow speeds? Whatever it is, we want to know the common issues that impacted application performance.

Let us know what the common glitches you experienced in 2019 were, by December 12, and we’ll give you 250 THWACK points in exchange!

28 Comments

We use SAP, which is hosted by SAP in the SAP datacenter called HEC ( HANA Enterprise Cloud) it's in Virginia.  So i like to say we are all going to Hec in Virginia.   makes me chuckle.   We continue to experience issues because we have no and i mean zero control of what SAP does.  We didn't know this when we signed up, but if they makes changes, reconfigure services or anything which they claim they need to do to maintain integrity, well, its well with in their service agreement.  We see nearly daily at not so convenient times, a total loss of SAP and everything stops and then just like that starts again.  I cannot even begin to count how many man hours have been lost.   We monitor what they will let us, but all in all its been a nightmare before Christmas event.... 

What we need is a way to get a Solarwinds agent in the server farm they use for our SAP environment.   But that is a topic for another day....

Merry Christmas, and happy holidays.  

Level 14

If I'm being honest, we've had one hell of a time keeping stability and performance up to par on our environment.  Truth be told, it's more of a story of outgrowing hardware more than anything....but that was my biggest gripe over this past year.  Thankfully Christmas came early for me when we were able to upgrade our systems across the board (hw/sw) and things have improved so much better! 

Level 13

For us it was a cloud integration service we use to stuff data into our CRM from various data feeds.  I suppose it's really hybrid because there is an on-prem piece.  At any rate, it's been glitchy enough to be annoying without actually failing completely.  At the moment (knock on wood) it's fine but since we never did isolate the actual issue we're not completely sure we've got it fixed.  Spent way too much time, effort and frustration on it this year.

Level 9

The biggest "glitch" we had this year was spanning-tree misconfiguration.  The previous engineer did not configure spanning tree properly.  So when we added in new switches, it caused a spanning-tree convergence and took the network down briefly.

Moving the files and e-mail for 17,000 employees from internal DC's to the cloud proved to be a less-than comfortable experience.  Due in great part to proofs of concept demonstrated with small numbers and sizes of files, but failing when scaling to an Enterprise sized solution.  Running out of NAT/PAT pools on firewalls during the data transfers, running out of physical firewall resources, huge slowness in Outlook and Word performance compared to when we hosted it internally.  Two years later it's still an issue, and we're looking to see if changing firewall platforms (brands and models) will improve performance to something closer to what we had pre-cloud.  All that slowness is an unadvertised / unexpected result of gaining all the "benefits" of the cloud.

Level 12

At my work it is strictly forbidden to use any type of external HD or jump drive (USB) no matter what.  You can get a memo for said use but its like non exist with all the work around you have to do, meetings after meetings after meetings.  With no end in site. 

For example:

iTunes needed to reset iPhones that are locked out.  Have to be able to connect the cell phone to the system to do a factory reset.......  how do you connect a cell phone to a computer.... via USB - still trying to go forward with this, but I know its just going to be meeting meeting and even more meetings....

I see the light but the journey has been "fun"

Level 10

We have had fun dealing with all the issues of implementing Cisco ISE. Not really ISE's glitches, but HP! HP has no discernible mapping of their OUI's so dynamic mapping via MAC addresses is impossible. Printers and laptops being dynamically added to the wrong groups has caused months of headaches. We're almost through it though.

Level 13

VMware Datastore bug was the biggest pain!

With certain patch levels, your VM's start getting increased datastore latency. Solarwinds was showing it for ages. But no one believed me, so they looked at everything apart from what I was telling them. Network, Storage, Disk, VM's in vCentre, testing it all but it came back clean every time.

Its only when me and 1 other forced them to look directly at the VM's to Datastore path and produced the latency while on the Host, they listened. Took over 2 months for slowness.

Level 8

The greatest challenge to land on my desk this year was very unstable Cisco Video conferencing performance, which turned out to be caused by the calls being routed to the other side of the world, even if you making a call within the same site. A relatively simple migration to a cloud service resolved the issued and finally my phone has gone quiet. (according to the exec VC is the most critical service in the company!)

Earlier this year, our Meraki cloud dashboard decided that it needed a vacation and started randomly sending out notifications, which were anything from "Your license is expiring" to "Such and such switch has gone offline" (when it really hadn't).  Bad part about the Meraki cloud is that we have no control over it.  So, I sent a note to Meraki and they looked into it.  Turns out that there was a patch that had not been applied to our cloud and once that was done, everything was aces.  Haven't had the Gremlins rear their ugly heads since.  (Knock on wood, don't feed them after midnight, etc.)

Level 9

My biggest glitch(s) were over the deployment of Versa SD-WAN over 20 odd sites in the US.  Carrier issues, config issues, general lack of understanding by our "Partner" on deploying the service...  It was a challenge I'm glad I'm done with.

MVP
MVP

I take care of our monitoring solution that is in place mostly SolarWinds Orion but we do have other small pieces apart from it, I need to make sure everything is in place before i go on a vacation (first thing on the list is to get a backup resource who can manage the basic stuff while I am away ).

Health checks, stats, any ongoing issues, custom monitors, all new requests in pipeline, no new upgrade changes etc, especially the scheduled tasks which needs to trigger bi weekly or monthly, no last min dashboards or custom alerts or custom reports or testing .....................................

Ticketing and all 3rd party integrations are working as expected, CMDB is fine, your business layer looks good

After all this if something really goes down then LOGIN hmmm.......................

Level 9

The biggest glitch this year was in the switching of our Payroll system, the project wasn't run by IT, was run by HR & Finance, we strongly suggested they run in parallel for a couple pay periods to validate everything was accurate, but the vendor told them it would be fine.  4 paychecks later and they still don't have all the bugs worked out, but hey the project was a success because it was implemented on time.  Geez.

Worst part, most of the organization thinks it was IT's fault, even though we were stonewalled out of the project.

Level 11

The biggest glitch/headache was forgetting that our hub Meraki was accepting beta firmware. We did in the very beginning need beta for a feature but could have gotten off once that beta became stable. We forgot the check box and got a beta release that broke a few things. After trying and trying we just had to roll back. Nothing that broke was critical just annoying.

We had a vendor sell us a product that they knew wouldn't scale to our environment, but they took the "It's better to beg forgiveness" mindset in order to secure the sale. They also added hardware to the order that wasn't necessary and also included faulty hardware at every site that they won't replace. Their product took down our entire enterprise network three times and our only technical contact was a workstation technician who had been promoted to "engineer" for this project. He did not understand networking, virtualization, or enterprise technology at all and we felt terrible that he had been thrown under the bus by this company. I very nearly lost my mind.

Level 8

A certain ESXi patch along with supposedly some specific Dell firmwares caused some of our ESXi hosts to become zombies. VMs ran but the host was basically offline. No snapshots, no console... Had to hard bounce the host when that happened. Not a fun few months.

Level 7

Company-wide emails disappearing out of everyone's inboxes... Turns out if a user presses "Report Phish" using our email filter, it quarantines the email for that recipient and removes it from their mailbox... But on company-wide emails, everyone is the recipient, so the email filter was quarantining from everyone's mailboxes if even one user hit "Report Phish"...

Level 8

Biggest "glitch" this year was user configs being deleted by our UCaaS provider and getting blamed for it for almost a month while they tried to fix it. I had to go in and completely restructure our account and when they found out a tech support agent had caused the issue, there was no apology. Needless to say they are on thin ice with us and we will be looking for another provider soon.

Users have office 365 installed locally on their computers, and connect to a fully on premise Exchange environment. SharePoint is in the cloud and and a "latency" issue stopped outlook from opening if they had previously connected to specific resource types.

Level 10

Purchased airline tickets to spend with daughter... booked a wrong flight- PM not AM. Any canceled it and bought another set- trying to save money- paid twice- working on the reimbursement of the ticket/insurance and seat assignment. not good...

Level 11

This last weekend, Black Friday 2019, I flew to Europe to help move a data center to a new location. Flew all night, landed in London at 6:50 AM and in the old data center by 11 AM UK time. We finished reconstructing what was left of the office without a data center so they could continue to work with a new core switch setup. No problems.

Next, we traveled an hour to the new data center and started reconstructing the racks. Luckily, the movers were very professional and were able to move the racks as assembled units, which save a massive amount of time on both ends.

However, the Unix system cluster wouldn't fire up. It had lost its profile. No worries, fire up the other half of the cluster but it also gives the no profile messages. At 2 AM we finally get a tech on-site and he immediately tells us the error message is a red herring. He spends half an hour and finds the issue. A cable inside the chassis that was never touched and you can't even see, had shaken loose in transit. He powered on the system and it came up. The cluster partner also powered up.

The expert's final determination: The system will run on one power supply but it won't power up on one. The cluster will run on one chassis but it won't power up unless it is 100% healthy.

Level 16

My team is working on what is something like the Y2K glitch in one of our tools right now. Needs to be fixed before January 1st.

Level 12

Merging 4 different hospital systems into one.  All sorts of odd things, but we got it all sorted.

Level 9

The biggest issue we have experience to-date was Cyber Monday. The employees that was at work spent most of the morning surfing the web shopping. This impacted out Azure databases and slowed down the network. A quick modification of the firewall resolved the issue.

Level 8

I remember one time I wanted to upgrade to Thwack backpack 1.0, but I didn't have enough Thwack points. I saved for a long time, and just as I had enough the Thwack backpack was out of stock! What was happening, was there a new version coming? I foolishly spent my points elsewhere, classic mistake - well documented. But then Thwack backpack 2.0 was released! I started saving again. I finally had enough points, but then the error re-occured although this time it was different. The Shop said I was out of shipping area, and that they current do not deliver to the United Kingdom! No DR was available, what a disaster! There is no known fix currently, I have to wait for a new version of Thwack backpack - maybe 3.0? Lesson learnt though, I will save my points and not adopt other new releases so quickly this time.

Level 11

Our biggest glitch was likely moving to a new ITSM tool.  Not so much that there were applications issues, but the project was a short timeline, so not all can be thought of or tested.  Of course with something like this there is an adjustment period for all users.  There will be additional work and development that will be done ongoing to relieve pain points.  From an application point there is slow connection to the application prior to the SAML authentication.

Level 12

There were some basics that were common, times when there was high CPU, Memory, Disk Space, etc. The alert thresholds were not set properly, users contacting us before our monitoring tools contacted us.

Level 12

moving to Amazon WorkDocs what a challenging task early morning data transfers where some of it wouldnt upload as expected and would take hours to complete.