We would like to share an episode that occurred with a client. He suffered a crash on his website’s server and later told us the story.
Amazon usually sends emails when it performs maintenance on an instance. This email arrives with a title like:
“Amazon EC2 Instance scheduled for retirement [AWS Account: xxxxxxxx]”
Generally it is a request to turn the server off and on (it is not a reboot) since the ‘host’ it is on is going to be retired. This restart causes it to start on another, new, physical host.
“We received this email to carry out this procedure on our website and to our surprise, even after several ‘shut down’ and ‘forced shutdown’ commands, the server did not stop. We waited 10 minutes and the server continued with the status ‘shut down’.”
What to do in a situation where the server is no longer accessible?
- If you have technical support hired, you can open a ticket asking to analyze why the server was not turned off. If it is ‘Developer’ support, the best response time is 12 hours (although every time we have needed support, this time has been much shorter);
- If you don’t have support, the best way is to post on the forums;
- But since the site was critical and we couldn’t waste time, we decided to restore the backup.
We have our backups performed by the Cloud8 scheduler. Cloud8 organizes backups by date and saves the settings.
When restoring the backup, simply select the original Elastic IP and request the creation of the new server. Cloud8 created the server with the same settings as before and then associated the IP. The server was back online within a few minutes.
After more than 45 minutes, the old server finally stopped and could then have been restarted.
Lessons learned
- always have a backup. It doesn’t matter if it’s a server copy like Amazon does or an internal script or agent that copies the data to another location;
- save the server configurations. In a crisis and urgent situation, the last thing you want to do is look for the server settings (IP, Security Group, Access Key, VPC). The urgency is to get it back on the air and in the RIGHT way;
- have an independent copy external to the server. That is, do not depend on an extra disk connected to the server. If a problem occurs and you are unable to access the disk or the server fails to stop, the disk could remain in a ‘disconnecting…’ state for several minutes;
- Always separate web servers and databases. In this case, the client did not have critical data on this server and therefore the creation of a new one did not result in data loss. Even so, it is still possible to start the server that suffered the crash, after it is resolved and resynchronize the data since the last backup;
- ideal world: have your website/system redundant and resilient. There are several how-to articles (see Web Application Hosting: Best Practices )