Issue 07 – Our servers crashed!

Our servers crashed 🤯

Yelled my colleague at 5 AM on a phone call. 

If you are someone who worked on a SaaS product this is not something you haven’t experienced before. 

There can be multiple reasons why this happened. It can be a bug, server load, bot attack, etc.

Problem? 

Client patience, and team unavailability. 

When this happened it was 5 AM, and we did a night shift till 3 AM so it was impossible for me to start an early shift.

It was a 2 hr virtual event on the SaaS product that we created which crashed after 1 hour of the event starting with 300k+ users on the platform waiting 😭

We had 6 man team where everyone is doing a 24H shift & sleeping in 2 hr gaps. No one asked them, it’s pure passion and excitement for our first virtual event on our own baby platform 🤩

Yes, you can suggest we should have a backup team in place who can take on the morning shift. But if you are a small business you know you have lots of constraints and a tight budget to work on 🤦🏻‍♂️

I tried to call the lead engineer, but his phone was switched off which is even worse.  

Luckily, our backup engineer who knows some basic DevOps, & backend picked up the call and joined me in the mission. 

There was another problem. 

The platform was on SQL based database, and there was no issue with the code. To fix the problem we need to make the whole platform down, and downtime of another 30 mins which can frustrate event organizers. 

In addition, to that, we require 1 experienced DevOps engineer right now to set up load balancers and some server config 😮

(Instant panic attack started)

How the hell do I hire a new engineer at 7 AM the morning? 

But a choice has to be made here, either we postpone the event which we prepared for 3 months or we fix the problem and resume it. 

Postpone is no choice as stated by the client and we can’t tell them we are short on 1 person which can make things worse for us later. You know contracts, legal notices, etc. 

I took the challenge and start posting job ads on my network, and social media. Most people were unavailable or the ones available don’t want to take the risk. 

After struggling for 45 mins, I found our guardian angel 😍

He agreed to help, there were minor issues, and we got live in the next 20 mins. Phew! 

After this incident, we never heard any server complaints on any SaaS and our guardian angel did a phenomenal job 🤗

This incident taught us why one should not take DevOps lightly, and underrated services like QA, Cybersecurity, and technical management which make or break SaaS. 

Most SaaS clients I worked with put these services on least priority or the last. 

If you are working on your own SaaS product, and want to be a unicorn someday and don’t feel to invest in QA, DevOps, Cybersec, etc. then reconsider.  

Often these minor issues take 2-3 hrs to solve which can make you lose your customers. 

Do you want that? 

So next time if your tech guy recommends something, think twice🙂

Leave the first comment