Amazon has posted an announcement regarding what happened last weekend with their S3 storage service and the downtime of nearly 8 hours. Our sister site CenterNetworks covered the outage extensively. Overall the downtime ran over 9 hours from 8:40am Pacific Time to 5:00pm Pacific Time. They call it an "availability event" – I need to add this to my list of synonyms for the words dead, down, outage and not working.
Here’s their final conclusion:
We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.
As cloud computing becomes more mainstream, will we see more downtime as more developers move to this type of hosting solution over more traditional options?