Tyler Ward ·
Dec 7, 2017 · 3 min read min read
At amazee.io we value transparency in all things. We’d like to take some time to explore why outages happen and explaining our exact processes to make sure they are solved quickly.
There are three main categories of why outages can happen, and only one of them is directly within our control. But no matter why your site is down, we will be right next to you doing everything we can to get things up and running again.
The first thing that can cause an outage is a problem with the services we, and therefore our clients, rely on. This could be a something like a hardware with one of our infrastructure partners, intercontinental networking issues, or even DNS outages . In these cases we see a problem pretty quickly, thanks to our monitoring and notification systems. At this point we report the problem to the partner and notify any clients that might be affected. We report it on our status page which is also broadcast on our status twitter account. From there we will focus on ongoing monitoring, keeping you informed and being ready to jump into action if we can do anything to help get things back on track.
The second category of things that can cause an outage is things our clients do. We monitor every site we host and our systems let us know quickly if there’s a problem. If we see a site is down we check for error codes, and then check for normal activities that would cause it like deployment or maintenance. But if no routine updates are scheduled we will reach out to the client directly. Luckily, with our chat support, this is as easy as sending a message in Slack. If the outage wasn’t expected we can move right into support mode. Perhaps the customer was trying to roll out a major new feature to their site or application and things didn’t go as smoothly as planned. . If it doesn’t work, we are always happy to help.
The third kind of outage-causing problems are things that are our fault. While we’d like to point out this doesn’t happen often, it can sometimes happen and we make sure we bring all of our resources and expertise to bear to communicate with clients clearly and solve the problem quickly. Our first step in these situations is to post to our status page which is also pushed to the status twitter. We create an incident on the page and continuously update it as we delve into finding out what caused the problem. From there we go to work!
Sometimes the root of the problem is simple, such as a typo in some new code that was not picked up during testing, and can be easily solved by one engineer. If it is a more complicated issue, we can call everyone in, if necessary there can be four or five engineers on a video call together, sharing screens, getting it worked out, a virtual think tank of digital problem solving. Once we’ve sorted out the issue, we update the incident on our status page and then monitor our systems closely for stability.
If the problem was our fault, and affected our clients, we conduct a post mortem, a document that clearly outlines what happened, when we let people know, how we addressed the problem, and what we are doing to prevent similar problems in the future.
One common way we do this is by asking “why” up to five times. For example:
Nginx stopped working on a server — why?
There was an error in a config file — why?
There was a typo in the api — why?
There was no test for that setting — why?
From here, our course of action is clear, we implement a test for that field in the future. Sometimes the solution is to add another check to our monitoring systems. Whatever the problem is we do our best to make sure it’s avoidable in the future.
While many companies have status pages and issue post mortems, we reach out to our affected clients individually to make sure our communication and support is personal, accessible, and constant. Questions about support or offerings from amazee.io? Hop into our Slack — we’re always there to help