We got our first (of many) client notification about 7AM on Saturday. The calls and emails were simple -- ''Out web site is not available. Any idea why?"
Our web servers are located in a building across town from our primary office; so we jumped in the SUV and drove over and immediately KNEW exactly what happened. When we walked into the server room we heard a terrible whining sound coming from our primary web server. At that time we knew that we had a hard disk crash.
That would not have been a major problem in most instances because we did a full backup at midnight. For most web developers, this would have been a non-event. A matter of simply
- Removing the failed hard drive and replacing it with an in-house spare
- Reformatting the new hard drive
- Reloading the system image from the backup last night.
We did all that; but then we entered the beautiful "real world". Remember, our database back ends are the major differentiator between our sites and those of the competition. Once we got the server back up, we had to locate and reapply all all updates to all the databases affected since the backup. That sounds easy; but the major issue is to ensure that updates don't get affected by transactions that may be occurring in real time.
That being the case, a few web sites were down -- at least for updates -- for several hours. We're looking at ways to prevent anything like this from happening in the future. It was too traumatic an event to have it happen every weekend.
We're now in a more secure and more redundant environment for those sites hosted in house. We've also moved a few sites previously hosted in house to a server at a major ISP -- with guaranteed redundancy, RAID and large staffs.
No comments:
Post a Comment