Wednesday, October 24, 2007

Recipe for Exchange Flambe

A flambe consists of a filling of some sort, something flammable, and an ignition source.

Ingredients



The gooey filling
2 Windows 2003 Servers running Exchange 2003 SP2, clustered
1 shared external hard drive array (RAID 5)

The flammable parts
0 IT staff who were involved in setting up current environment
0 documentation of current environment

The ignition source
1 recently hired Network Administrator who is on his way out after three months because he's a dumbass
1 new Network Administrator (to replace the above)
1 incompetent support person

Assembly Instructions



We're going to assemble our flambe in layers. First a bit of prep work
  1. Old Network Administrator gives new Network Administrator access to the domain administrator account. On his first day.
  2. Old Network Administrator leaves New Network Administrator to poke around Active Directory to learn the structure of things.
  3. It is Old Network Administator's last day, so he is gone. Old Network Administrator didn't set up any of the current environment, but he had some knowledge of it.
  4. New Network Administrator realizes most servers have some level of local security implemented, so he downloads and installs a software package on all the servers.
At this point we have an unstable environment, and nobody but the New Network Administrator has any idea that anything has changed. So now we need to start working on the gooey filling.
  1. Something in the software breaks the driver for the card that connects the Exchange server to the external RAID array.
  2. The RAID array starts flashing a light indicating a connectivity error.
There are a few directions to go from here:
  • Call support
  • Reboot the servers, then call support if it doesn't reconnect
  • Power down the servers, reboot the array, reboot the servers, then call support if it doesn't reconnect
Sometimes the obvious choice is the worst. Time to add some fuel.
  1. New Network Administrator calls HP Support.
  2. Support identifies one drive has a significant number of errors.
  3. New Network Administrator is instructed to upgrade the firmware on the drive.
  4. After the firmware is upgraded the drive is offline.
  5. It is pulled and reseated to allow the array to reinitialize it.
  6. The RAID controller doesn't recognize the new firmware, and it is flagged as failed.
Connectivity to the RAID array is spotty, and the servers are failing back and forth every few seconds. The Quorum, Transaction and Data drives are on different arrays and end up getting mismatched. Let's add another layer of fuel.
  1. New Network Administrator discovers the hardware failure in Device Manager and applies all outstanding Windows Updates.
  2. Because the servers are failing back and forth at a rate of about every five seconds, the connectivity to the array is sporadic and this causes the drives to start thrashing.
  3. With one drive already in a failed state, another one starts flashing a warning.
The only thing left to do is light it up!
  1. New Network Administrator pulls the drive flashing a warning. While another drive is offline.
And there you have it: one very dead Exchange 2003 cluster! All it takes is a little bit of knowledge, a lot of ignorance, and an incompetent support person.

An HP support rep finally came on site, dug through some logs, and determined that the firmware that was installed was completely wrong for our environment. Also, there is a known problem with one particular Windows update that breaks the installer for the drivers, so new drivers can't be installed and the old ones can't be reinstalled.

He replaced the failed drive, let the array rebuild, replaced the drive that had the bad firmware, and by some miracle we were able to recover three of our five Exchange storage groups. Now the New Network Administrator and the PC Tech are piecing together backups and trying to reconstruct messages from logs on our SMTP relay.

4 comments:

  1. Miss Domino yet? This is the point at which a "no shared" cluster like the one Domino implements is worth its weight in gold.

    ReplyDelete
  2. Oh, man. I'd laugh, but it would be mean.

    People can screw up anything, especially in that sort of "perfect storm" scenario, but it's true (as you know) that recovery in a Domino environment would have been soooo much easier. Array having trouble? Ok, I'll take that cluster member offline and fix it, while the other cluster members handle the load.

    But as I said, people can screw up anything - don't get me started on people who cluster multiple Domino servers running all data drives off the same SAN. *rolls eyes*

    ReplyDelete
  3. A lack of sufficiently educated admins, mixed with poor company procedures, which includes a lack of documentation, will get anyone into trouble on any platform and/or software flavour.

    Other than that, I think your flambe would have tasted better with a side of meringue ;-)

    ReplyDelete
  4. @Sean - I struggle with Outlook daily, so I'm definitely missing Notes/Domino. I never thought much about the clustering in Domino because it's just a few clicks and it just works. I've never had to take a cluster member offline for maintenance other than upgrades. Failovers just work, too.

    @Rob - I've been laughing for the past three days just to keep from crying. ;-) I cluster Domino using the same SAN, but I'm not using the same drives. Microsoft claims Exchange 2003 uses a shared nothing architecture: "Exchange 2003 back-end clusters require the use of a shared-nothing architecture. In a shared-nothing architecture, although all nodes in the cluster can access shared storage, they cannot access the same disk resource of that shared storage simultaneously."

    It's "shared nothing" but uses "shared storage". Is anyone else confused by this choice of terminology? The bottom line is all Exchange 2003 servers in the same cluster use the same drives. Exchange 2007 introduced Continuous Cluster Replication, which promises to offer Domino-style replication for Exchange.

    @Francie - Yeah, I understand that it's not necessarily Exchange's fault, at least not directly. The New Network Administrator is an MCSE, supposedly with 20+ years of experience in both large and small environments. Now I'm questioning whether he made up a lot of stuff on his resume.

    ReplyDelete