GroupStudy Server crash

From: Paul Borghese <pborghese_at_groupstudy.com>
Date: Thu, 12 May 2011 16:40:27 -0400 (EDT)

On Monday one of the GroupStudy servers in Atlanta had a catastrophic disk
failure. This brought down the entire site and every mailing list. The
good news is this list is impacted only minimally from the failure and
should (may?) be back to normal shortly. I apologize for any
inconvenience.

Since this is a technical list I will go into the details for those that
are interested. Please comment if you have any suggestions. On Monday I
decided it had been too long since I backed up the Atlanta server, which
is now running as a VM on a VMWare ESXi 3.5 server. At one time I had an
elaborate backup mechanism where every evening critical database file were
copied to a directory then rsynced to a linux server at my house. But
over the last two years we moved a number of times and the remote server
is still packed away in a storage unit.

I decided since it was a VM I could simply scp the .vmdk file to my house,
thus creating a perfect backup. The problem was the VM contained
multiple 200 GB files. I noticed part of the storage size was caused
because a snapshot had been taken on the GroupStudy server. To reduce
the size of the backup, I decided to delete the snapshot. This action
should have merged the snapshot with the primary. But instead it removed
all .vmdk disks and left me with a .vmdk file that was simply 26k (instead
of 200+ GB)!! Oops. The GroupStudy disk was totally destroyed.

If the VMWare server was running on another OS, I could have simply gone
in with an undelete program and tried to recover the files. But VM ESXi
runs a propriety locked-down OS with very few tools. I hate to say this,
but my last backup of the server was made two years ago, just before I
moved (yea I know  but hey it is a hobby). I called a number of people
asking for advice. I would like to thank in particular Darby Weaver who
found a VMWare guy who was quite knowledgeable. But even he was stumped.

I decided to take it slow and not do anything that may prevent a recovery.
  In the evenings (I still have a day job :-) ) I started reading
everything I could find about VM ESXi. I have now read more VMWare
knowledge base articles then I care to admit. I am thinking about seeing
what certifications VMWare offers and simply taking the test.

VMWare does offer support, for $300/call, which I really did not want to
spend. But it was obvious I was not getting anywhere trying to figure it
out myself. So I reluctantly plunked down my credit card. In my
research I found VM did offer an undelete program for the ESX platform. I
was hopeful my support request would, at minimum, give me access to the
undelete program for ESXi. Or maybe some internal use only magic VMWare
has in their back pocket. But in all honestly, they were not much help.
The VMWare tech support is not that great. Cisco TAC will escalate your
problem until you are fixed. VMWare support seems to be for people that
dont know how to read manuals. In all fairness, it could be that my
$300 support call is not going to the same people that handle high paying
corporate outages. VMWare suggested I call a data recovery service. The
data recovery service said this happens all the time and they could
recover the data for $3-$5k. I simply do not want to spend that kind of
money.

The VMWare server contains two 1 GB disks, a primary and extra disk.
The original GroupStudy VM was running on the primary. I used the
two-year-old backup disks to create a new GroupStudy VM on the extra disk,
thus preserving the primary to the best of my ability. Of course the
backups were created before migrating to VMWare, so none of the kernel
drivers worked out of the box.

After fixing the kernel and initrd boot files, the GroupStudy website has
been restored literally to the date of the Obama Inauguration. So welcome
back to January 2009 (quick buy Apple Stock  and gold!). With regards
to the CCIE List, this actually has less impact then you would think. The
actual mailing list is running off a Linux server in Dallas, and has been
unaffected. What we lost was two years of archives. But I may be able to
get them back as the Dallas server has copies of the archives in a MySQL
DB. If they are complete, I can simply write a Perl script to extract the
archives to a format the website can use.

Bu there is other lists that are affected more and frankly being a techie
I hate to give up. We know the data is most likely still on the disk. We
just need to find it. I feel with the disk, a hex editor, and some
voodoo I could recover the data I needed. Frankly I only need one of the
backup files that was created daily, not the entire disk. The problem is
the Primary disk is where the VMWare OS is located, so I cant simply
remove it. And it currently resides in a data center in Atlanta, a 10
hour drive from my house!

So now I am trying to MacGyver my way to the disk. The extra disk
contains about 400 GB of free space. It turns out VMWare does offer disk
dump and gzip on the ESXi platform. I am disk dumping the entire primary
hard drive to the extra drive, using gzip to compress the data. I am
praying for a much better then 2:1 compression ratio! If that works I
will download the dd file and restore to another 1 TB hard drive, thus
creating a copy of the primary drive. Then I need to figure out the
VMWare partition tables and vmdk disk formats. If the 400GB of free
space is not enough, I am considering mounting an Amazon EC2 NFS server on
the VMWare file system and trying again. I also called the VMWare support
engineer (that poor guy) and asked him to send me any documentation he can
find about the VMFS and VMDK file structures. I also found an open
source VMFS driver (http://code.google.com/p/vmfs/) that may be of use.

So if you have any suggestions, please send them to me! No matter how bad
it got, I kept on thinking  at least I am not Sony!

Paul Borghese

Blogs and organic groups at http://www.ccie.net
Received on Thu May 12 2011 - 16:40:27 ART

This archive was generated by hypermail 2.2.0 : Wed Jun 01 2011 - 09:01:11 ART