Today I wanted to bring up one of the potential pitfalls when you're creating a fully virtualized environment. This past weekend we had to cut building power for an extended period of time, so the network administrator brought down everything in our server room. As he brought everything back online he realized that Virtual Center, the control console for VMware ESX, could not talk to the SAN because it required DNS resolution.
The Problem
Our DNS servers are virtualized with storage on the SAN. He ran into a chicken-and-egg situation where he had dependent services that relied on each other.It took him a while to realize that DNS was the issue. The logs on the SAN side simply said "Could not connect ISCSI LUN". On the VMware side the virtual machines said "storage not available". Figuring out why the two were unable to connect took some careful analysis. Solving it proved difficult because our departmental wiki also used SAN storage, so he had no access to our documentation. In a flash he found himself back in the same situation he was in after the fire, when he could not access critical documentation because the servers with it were not available.
The Solution
So how did he solve it? Luckily he still had the old primary domain controller hanging out, which had all the DNS information. He was extremely lucky, and he knows it. To keep from having to rely on luck, how should you configure your VMware environment so this doesn't happen to you? There are a couple of ways to tackle it.Use local storage for your virtualized name servers.
Pros- Name servers will load without SAN access.
- Resilient to SAN outages.
- Cannot mix guest VM's that require SAN storage. The ISCSI initiator in VMware ESX loads when ESX boots. By having your DNS server on the same physical host as another VM that requires SAN storage, the guest on SAN storage will not be able to start.
Use a non-virtualized DNS server.
Pros- Resilient to SAN outages.
- If using a Windows server, also requires you run Active Directory services.
Use hosts files.
Pros- Resilient to SAN outages.
- May improve performance slightly since lookups will always be from local cache.
- Requires you add hosts files to the Virtual Center server, SAN server, and every ESX host server.
- Can be a maintenance burden if your environment changes frequently and you have to constantly add/remove ESX hosts.
We have opted for the last option. Our VMware host environment is fairly static, so maintaining hosts files will be a minimal maintenance issue. The resilience we gain from it make it very worthwhile. Oh, and we printed a copy of our wiki page that has all the hostnames and IP addresses of every server we have, and put it in the safe. :-) You do have a similar list, and a fireproof safe... right?