Sunday, July 20, 2008

Disaster Recovery: Knowing what you know

Imagine an asteroid hits your place of work and you can't recover anything. What do you do?


After our fire we realized that we had a lot of documentation that was stored in an electronic-only form. We had backup tapes off site, but we did not have a tape drive to recover them. So whatever method you use to store your documentation, make sure it is accessible in the worst case scenario. Tape or other electronic backups may not be enough.

Think about every service in your infrastructure and plan for what you would do if any service were unavailable.


Once we started bringing servers online we discovered that there were situations we hadn't even considered. In our case we were using the Windows certificate authority to authenticate computers against the domain controller. This we knew, but what we didn't realize is that without the CA the other servers could not talk to each other. It was a tense 4 hours while we waited for the CA server to come out of the cleaning process. While we were waiting I researched and documented the process for removing the CA from our environment and set up some VM's and tested it so I would have at least a passing familiarity with the scenario. Luckily we didn't have to do it, but it is something we should have been aware of much sooner than this. Try taking different servers offline and seeing how much of your infrastructure is survivable.

Communicate when you need to, do what you have to.


Some decisions can be made in a vacuum. There are huge lists of things to be done, and some should be common sense. You see a stack of empty boxes. Ask if they can be broken down and taken to the dumpster. You're in IT and you see servers stacked for testing. Nobody is around and you're done with your last task. Get to testing. I was bringing our NAS appliance on line and it crashed, then came up with an error. I didn't run to the system admin. I researched my options for recovering, including looking up support information online and calling the vendor then spending two hours reinstalling the OS. The point I'm trying to make is if you need information ask for it, but don't ask when the task is obvious. If you're not sure, find something that is.

Speak with one voice.


One of our biggest problems was that everyone thought they were in charge. We had priorities for bringing servers online and getting users set up and they were preempted at every turn. We established a chain of command and it was not adhered to. Our efforts were severely hampered by this lack of consistency. Everyone has to move in lock step with each other or things fall apart quickly. This isn't the time for politics or empire building.

3 comments:

  1. I have sent both of your postings to my key IT and DR team members. Thats 100 people that I have told that this is "must read material".

    ReplyDelete
  2. I'm glad you're finding this useful. I am sharing this precisely because I hope some others can learn from our mistakes as well as our progress.

    ReplyDelete
  3. Charles, this is an excellent series, and I, like the previous commenter have also mailed it to our other senior operations people (we're all IT in one sense or another). Many thanks for sharing.

    ReplyDelete