Saturday, July 19, 2008

we had a fire at work

I'm not going to mention where I work since the insurance investigation is ongoing. The next few posts are going to be a mix of business and personal since there is a lot I need to vent about, but there is also a lot that I have learned going through this process. Hopefully you'll find some good information mixed with the frustration. :-)

Last Friday, July 11th, at 2:11 AM a faulty air conditioning unit located in the attic area of the administrative building at work started an electrical fire. Fire trucks arrived at 2:20 AM and power was cut to the building while they fought the blaze. By 2:40 AM all servers had exhausted their battery backups and were offline; some gracefully, most not. The fire department allowed people into the building at 4:00 AM to start the recovery effort. There was about 2 inches of water in the building by that point.

The fire swept along the roofline and down the exterior walls. Everything in the attic was either burned or had heavy smoke damage. We had some wireless networking equipment and repeater switches that melted. About 30 file cabinets were stored in the attic, in an area directly opposite where the fire started. Luckily they didn't catch on fire, but they were very smoky.

The Process


A forklift was brought in and the entire server racks (we had three) were lifted out and moved into a building across the street. A disaster cleanup service was contacted and they had a crew on site on Saturday to start cleaning the servers. More people were flown in and by Saturday evening we had a team of about 10 people who were disassembling and cleaning servers. It was slow going, taking 3 - 4 hours per server.

All the PC's in the building suffered extensive smoke and/or water damage. We were told by the fire department that when PVC melts it releases a gas that when electrified causes electronics to become unstable. In other words, even if a PC had no apparent smoke or water damage it likely have come in contact with this gas that would cause the electronics to fail over time. The CPU fans on the computers located nearest the start of the fire (including everyone in IT) had melted. The decision was made early on to replace every PC and pull hard drives from old computers for those people who really needed their data.

The Good


When this happened we were in the process of establishing a comprehensive disaster recovery plan and had mapped out an order in which servers would need to be recovered to get people back working. We had also gotten managers to identify the order in which their direct reports would need to get new computers. A paperless initiative had been started about three months ago (nobody told IT) and about half the 30 file cabinets I mentioned were empty.

The company I work for started out in the late 90's renting a small area in the back of one building. As it grew, the owners bought the building, then three others adjacent, and finally one across the street. So we had a nearby place to go. What had been the wood shop where they crated things for shipping was converted into a new office environment. A supply closet became the new computer room. Electricians, cabling guys, carpet layers and painters were brought in and by Sunday evening you would never have known it wasn't always an office space.

Shortly after I started in May 2007 we began a PC refresh cycle and I was horrifed to realize it was a completely manual process. One of the first things I did was convince them to invest in Ghost Enterprise Server so we could do standard PC images. This has proven invaluable since we first got it, and in this case it was an absolute life saver. In two days we got 51 computers up and running. There is absolutely no way we could have done this without a cloning solution and Ghost worked flawlessly.

The Bad


TOO MANY CHIEFS!! I'll admit our DR plan wasn't fully fleshed out, but the parts we did have complete were ignored. Managers with no responsibility for IT were telling the people doing the server cleaning to switch around server priorities based on what the manager needed -- without considering that the server they wanted couldn't be put into production until its dependent servers were. Even the IT Manager was involved in shifting things around without consulting the Senior Network Administrator or me. This caused significant delays in getting our infrastructure back online.

The blame game. The PC Tech was given a list of specs and called around to local retailers to find a suitable model of computer that we could get 40 - 60 of within a couple of days. The only thing he found were Dell Inspiron 531's. If we waited three to four days we could get some additional models. He relayed this to our interim IT Manager, who is a consultant, and he said to go ahead and get them. The problem is these are AMD Sempron's and lower end than what people had before. The IT Manager insisted he didn't know they were Sempron's; the PC Tech was equally adamant he was very clear about this and even questioned whether we should get them or wait for a better model. In a meeting with the two co-owners of the company the IT Manager said he had chosen them because it was the only thing we could get in the quantity and timeframe we needed and suggested that some may need to be replaced within a year.

The tunnel vision. Some IT staff proved to be too highly specialized. There are only four of us: Senior Network Administrator, PC Technician, Senior Programmer (me), and Junior Programmer. The Jr. Programmer who reports to me was nearly untrainable. His task for three days: unbox PC's, unbox UPS's, connect the computers to the UPS, insert a Ghost boot CD I created, and initiate a GhostCast session. Once done, log in with the domain administrator account, change the computer name, and add the user as a Standard User. Have the user log in and set up his or her e-mail.

First things first: he didn't know you have to connect the battery in the UPS. The UPS's all had a bright yellow sticker telling you this with pictures showing you how to do it. I didn't tell him and he didn't read the instructions, but he did pull off the sticker. Out of a total of 51 PC's we Ghosted in two days, he did about 10 and took nearly 45 minutes per PC. The PC Tech and I were going through a PC every 20 minutes. Of the ones the Jr. Developer set up I had to either fix or offer assistance on about half of them. He struggled with one for 20 minutes before calling me over, and I pointed out he hadn't plugged in the network cable even though I had suggested he check it 10 minutes prior. We ended up with two different models of PC's and he installed with the wrong Ghost image on the final two PC's he set up. That took him nearly an hour to troubleshoot.

This lack of flexibility wasn't limited to just IT. Other people were sitting around waiting on us to do simple things like unbox their computers. Once told their managers about this there was a flurry of activity, but it had wasted the better part of a day while the four of us in IT were killing ourselves. So much for an "all for one" mentality.

The second-guessing has already started. There is the PC specs issue I highlighted above, but it goes much deeper. The current IT staff have only been with the company for a little over a year. The last network manager was using the corporate network as a playground to test various theories he would then present at conferences such as Black Hat. The result is a highly convoluted infrastructure that took us the better part of a year to fully understand. Much of it makes absolutely no sense and we can find no documentation of it except in the previous admin's presentations. I am pretty confident saying that nobody has anything resembling our network infrastructure in production, and that's not because it's exceptionally good.

With this as a background I was greivously offended when the consultant interim IT Manager sat in a meeting with the CxO's, the Sr. Network Admin and myself and chided us for not doing automated offline backups or implementing a redundant data center. Those were things on our radar and we were investigating them but the reality is we had things to get done on a day-to-day basis. I have put out a new version of our ERP software every month for the last 12 months, and we have made significant improvements in reducing the complexity and maintenance of our network environment. Our boss went from congratulating us on these efforts last week to kneecapping us this week.

The Ugly


I'm burned out and really pissed off and the Sr. Network Admin feels the same. In the last week we both have been thrown under the bus more times than we can count by our boss. Nearly every recommendation we made was overruled. We suggested that we get all the computers in and set up before we brought all the staff back in. Instead our boss caved in to management who thought it would be better for morale if the staff was brought in and could see the progress we were making. So we had 50 people in the way asking questions while we set up equipment and brought servers back online. We arranged meetings at 9:00 AM and 4:00 PM with management to discuss our progress and plan of attack. Our boss couldn't be bothered to attend those, but he would call us for updates while we're trying to get stuff done or interrupt us when he did show up.

By the end of the day yesterday we had reached full-on mutiny. We're doing what needs to be done, we're telling upper management only as much as they need to know (and explained we'll come back and give them full details when things aren't as critical, and they're okay with that), and we have cut our boss out of the loop entirely.

3 comments:

  1. Charles, I commend you on your hard work and feel for the tough weekend you have had.

    Keep making good business decisions and working hard. The second guessing (and Monday morning quarterbacking) is normal during crisis response. Once everything is back to normal, your work will be considered heroic. People remember those they can count on and it sounds like you are one of them.

    Use of your blog to vent and clear your thoughts is also a good idea (as long as it doesn't get back to the office).

    Hang in there, and post some more. It is very interesting reading about the tactical and human side of such crisis situations.

    ReplyDelete
  2. Damn, I remember when we had the fire, and the flood and the hurricane.
    No wonder I do so much DRP work :-)
    It isn't easy, and an outside guy is never going to like the way you do anything in times of crises because he can't accept blame, his as& is on the line.
    So he blames you.
    As long as you had everythign documented of your ideas and proposals you will be fine in the long run. Sucks to have to do it that way but unless you get promoted, not much else to do.
    Mistakes happen for sure but good people are hard to find, prove it to your boss's boss and all should be well.

    ReplyDelete
  3. Wow Charles, sounds like a rough week and like you're ready for a vacation :-)
    I know you put thought into all your decisions while also listening to input, so from that perspective, there isn't much else you can do.
    I can imagine feeling burned out and people in general get really unpleasant when stuck in situations they can't control.
    Hang in there, so what you feel is right, even if that may mean taking a half day off in the middle of a crisis as it's sometimes better to refuel than to run on empty.

    ReplyDelete