#TSQL2sDay Disaster Recovery lessons learned the hard way

T-SQL Tuesday is a great concept from Adam Machanic (Blog | Twitter), to have SQL bloggers all discuss a common theme on the same day once a month. This month the topic is Disaster Recovery. And yes, I do capitalize those two words, they are that important.

This month’s T-SQL is being hosted by Allen Kinsel (Blog | Twitter), and you can read about it here. This is my first T-SQL Tuesday entry, and I thought I’d share some things I learned the hard way.

My first disaster recover took place early in my DBA career. At that time, we had about 45 business offices with an instance of SQL Server 7 running in each, and we had just one user database. We were taking nightly full backups, plus transaction log backups every 4 hours. Our busiest time of the month was during the first week, when most of the financial procedures were run. One night close to quitting time (and why do disasters never happen between 9 to 5?) one of the servers began to show symptoms of corruption. After some troubleshooting and discussion we determined that the best course of action was to restore the last full backup and all subsequent log backups. We though we would only lose about a hour of data, not great but acceptable at the time.

The restore of the full backup went smoothly. However the first log backup failed, as did all later log backups. To our dismay we found out that the first log backup was also corrupted and we couldn’t restore it. We had to finalize the full backup and tell the office that they just lost a day’s work during the busiest part of the month. Needless to say, they were not happy.

Which brings me to my first DR hard lesson learned: TEST YOUR BACKUPS!!!

Backups are useless unless they can be used for restores. We had become complacent because we were not seeing issues. The backup jobs were completing without errors. We thought everything was OK because we didn’t know better at the time. I don’t need to tell you how wrong we were. Even a single lost day of data can be expensive to a company. Since then I make a point of testing our backups by restoring them on a testing server. I choose a server at random,  then restore all our databases. I try to do this at least once a week. Since we converted to SQL 2005 I also made sure we were using page checksums to verify the backups. See this article in SQL Server Magazine by Paul Randal (Blog | Twitter) for a better description on how to verify your backups.

The second lesson I learned from a disaster came about three years ago. We had just finished migrating those 45 individual SQL 7 servers into 4 SQL 2005 servers. The new data servers were more powerful than the old SQL 7 boxes, and now we had begun to use SAN storage for both the databases and the location of the backups. About 4 months after we migrated our last SQL 7 server into a regional SQL 2005 we performed a firmware upgrade to a SAN controller in one of our data centers.

As you can probably guess by now, the upgrade didn’t go as planned and it introduced corruption in all databases on the servers attached to the affected SAN. What made  matters worse were that the backups were also on the SAN, and the SAN was unusable. We were backing up the database backup files to tape, but the tape schedule only ran once a day and didn’t contain all the backups needed for restoring the servers. Luckily we were able to finally get the full backups off the tape restored to a USB drive, and differential database backups were on a second local storage drive. By shipping the USB drives to a secondary site we were able to bring the databases back online with only minimal loss of data.

And that’s the second DR hard lesson I learned: MAKE SURE YOU CAN ACCESS YOUR BACKUPS!!!

Having backups don’t help you if you can’t access them. In our case we had put most of our backups on the same SAN as the database files; when we lost one we lost them all. We we also negligent in backing up the backup files to tape; only scheduling the tape backup to run once a day meant we couldn’t get all the backup files we needed.

Now I’m  not going to tell you to how you should take your database backups. You’re all smart DBA’s, and I don’t know what your disaster recovery budget is or what your comfort level for loss of data is. But I will make a few recommendations, in case you’ve just been lucky up to now:

  • Verify those backup files. They’re no good if there’re unreadable.
  • Make sure you can access your backups. You can’t restore from a backup if you can’t get to the backup. Put them in multiple sites, even mirroring them to an offsite standby site if you can.
  • Get together with all teams and put together a comprehensive disaster recover plan if you don’t have one already. Come up with plans for as many scenarios as you can think of. Don’t worry about being too “off the wall”, things happen.
  • Hold a “Lesson Learned” meeting with everyone involved if you’re unlucky enough to experience a disaster, the sooner after the event the better. Think of everything that went right and why, and everything that went wrong, and how you can benefit from the experience.