Re: GroupStudy Server crash

From: Haroon <itguy.pro_at_gmail.com>
Date: Thu, 12 May 2011 18:32:21 -0400

Paul,

Thats some ordeal... sorry to hear!

Your backup strategy is great.... I also use rsync to backup over 110GB to
remote servers twice a day but my stuff (linux server, mysql DBs, etc.) is
not on vm.

ESXi 3.5 comes with very little tools for recovery.

Have you tried enabling SSH on esxi??? That may give you more control and go
under the hood as far as vmware is concerned...

As far as your datacenters, what datacenter do you use in Dallas? Softlayer?
Instead of copying 200GB to your house (residential cable/dsl?), I can
recommend a vendor that I use to backup data right in dallas datacenter
using rsync.

Please keep us posted.

Thanks,

Haroon

On Thu, May 12, 2011 at 2:00 PM, ccieagent <ccieagent_at_verizon.net> wrote:

> Paul,
> Good to see you got it working again. I was beginning to wonder if the
> pressure of our joining GroupStudy and the CCIE Flyer was going to happen!
> LOL
> Sorry to hear about your ordeal. Talk to you soon.
>
> -----Original Message-----
> From: nobody_at_groupstudy.com [mailto:nobody_at_groupstudy.com] On Behalf Of
> Paul
> Borghese
> Sent: Thursday, May 12, 2011 4:40 PM
> To: ccielab_at_groupstudy.com
> Subject: GroupStudy Server crash
>
> On Monday one of the GroupStudy servers in Atlanta had a catastrophic disk
> failure. This brought down the entire site and every mailing list. The
> good news is this list is impacted only minimally from the failure and
> should (may?) be back to normal shortly. I apologize for any
> inconvenience.
>
> Since this is a technical list I will go into the details for those that
> are
> interested. Please comment if you have any suggestions. On Monday I
> decided it had been too long since I backed up the Atlanta server, which
> is now running as a VM on a VMWare ESXi 3.5 server. At one time I had an
> elaborate backup mechanism where every evening critical database file were
> copied to a directory then rsync ed to a linux server at my house. But
> over
> the last two years we moved a number of times and the remote server is
> still
> packed away in a storage unit.
>
> I decided since it was a VM I could simply scp the .vmdk file to my house,
> thus creating a perfect backup. The problem was the VM contained
> multiple 200 GB files. I noticed part of the storage size was caused
> because a snapshot had been taken on the GroupStudy server. To reduce
> the size of the backup, I decided to delete the snapshot. This action
> should have merged the snapshot with the primary. But instead it removed
> all .vmdk disks and left me with a .vmdk file that was simply 26k (instead
> of 200+ GB)!! Oops. The GroupStudy disk was totally destroyed.
>
> If the VMWare server was running on another OS, I could have simply gone in
> with an undelete program and tried to recover the files. But VM ESXi runs
> a
> propriety locked-down OS with very few tools. I hate to say this, but my
> last backup of the server was made two years ago, just before I
> moved (yea I know but hey it is a hobby). I called a number of people
> asking for advice. I would like to thank in particular Darby Weaver who
> found a VMWare guy who was quite knowledgeable. But even he was stumped.
>
> I decided to take it slow and not do anything that may prevent a recovery.
> In the evenings (I still have a day job :-) ) I started reading everything
> I could find about VM ESXi. I have now read more VMWare
> knowledge base articles then I care to admit. I am thinking about seeing
> what certifications VMWare offers and simply taking the test.
>
> VMWare does offer support, for $300/call, which I really did not want to
> spend. But it was obvious I was not getting anywhere trying to figure it
> out myself. So I reluctantly plunked down my credit card. In my
> research I found VM did offer an undelete program for the ESX platform. I
> was hopeful my support request would, at minimum, give me access to the
> undelete program for ESXi. Or maybe some internal use only magic VMWare
> has
> in their back pocket. But in all honestly, they were not much help.
> The VMWare tech support is not that great. Cisco TAC will escalate your
> problem until you are fixed. VMWare support seems to be for people that
> don t know how to read manuals. In all fairness, it could be that my
> $300 support call is not going to the same people that handle high paying
> corporate outages. VMWare suggested I call a data recovery service. The
> data recovery service said this happens all the time and they could
> recover the data for $3-$5k. I simply do not want to spend that kind of
> money.
>
> The VMWare server contains two 1 GB disks, a primary and extra disk.
> The original GroupStudy VM was running on the primary. I used the
> two-year-old backup disks to create a new GroupStudy VM on the extra disk,
> thus preserving the primary to the best of my ability. Of course the
> backups were created before migrating to VMWare, so none of the kernel
> drivers worked out of the box.
>
> After fixing the kernel and initrd boot files, the GroupStudy website has
> been restored literally to the date of the Obama Inauguration. So welcome
> back to January 2009 (quick buy Apple Stock and gold!). With regards
> to the CCIE List, this actually has less impact then you would think. The
> actual mailing list is running off a Linux server in Dallas, and has been
> unaffected. What we lost was two years of archives. But I may be able to
> get them back as the Dallas server has copies of the archives in a MySQL
> DB.
> If they are complete, I can simply write a Perl script to extract the
> archives to a format the website can use.
>
> Bu there is other lists that are affected more and frankly being a techie I
> hate to give up. We know the data is most likely still on the disk. We
> just need to find it. I feel with the disk, a hex editor, and some
> voodoo I could recover the data I needed. Frankly I only need one of the
> backup files that was created daily, not the entire disk. The problem is
> the Primary disk is where the VMWare OS is located, so I can t simply
> remove it. And it currently resides in a data center in Atlanta, a 10
> hour drive from my house!
>
> So now I am trying to MacGyver my way to the disk. The extra disk
> contains about 400 GB of free space. It turns out VMWare does offer disk
> dump and gzip on the ESXi platform. I am disk dumping the entire primary
> hard drive to the extra drive, using gzip to compress the data. I am
> praying for a much better then 2:1 compression ratio! If that works I will
> download the dd file and restore to another 1 TB hard drive, thus
> creating a copy of the primary drive. Then I need to figure out the
> VMWare partition tables and vmdk disk formats. If the 400GB of free
> space is not enough, I am considering mounting an Amazon EC2 NFS server on
> the VMWare file system and trying again. I also called the VMWare support
> engineer (that poor guy) and asked him to send me any documentation he can
> find about the VMFS and VMDK file structures. I also found an open source
> VMFS driver (http://code.google.com/p/vmfs/) that may be of use.
>
> So if you have any suggestions, please send them to me! No matter how bad
> it got, I kept on thinking at least I am not Sony!
>
> Paul Borghese
>
>
> Blogs and organic groups at http://www.ccie.net
>
> _______________________________________________________________________
> Subscription information may be found at:
> http://www.groupstudy.com/list/CCIELab.html
>
>
> Blogs and organic groups at http://www.ccie.net
>
> _______________________________________________________________________
> Subscription information may be found at:
> http://www.groupstudy.com/list/CCIELab.html

Blogs and organic groups at http://www.ccie.net
Received on Thu May 12 2011 - 18:32:21 ART

This archive was generated by hypermail 2.2.0 : Wed Jun 01 2011 - 09:01:11 ART