Catastrophic failover of Critter.net

Apr 10, 2014 21:24


Earlier this morning (April 10), I noticed that the primary drive in the mirrored  array on Critter.Net had failed. This has happened before, and either comes back with a reboot or I get the hosting service to replace the drive. So I rebooted the server.

And the server did not come back.

This has also happened before, and the last time the hosting service said that it was because of a 'faulty kernel implementation for the ACPI services on the motherboard'. Though after that time, I had disabled the kernel support for those services. Apparently, this did not help.

After almost four hours of back and forth with the hosting company through their ticketing system, I finally got access to a console server that gave me access so I could see where the server was failing in boot. And it was failing in boot on both drives.

Well, I thought, let's try and replace the kernel on the primary drive. That didn't work.

Since I had the rest of the data safe on the secondary drive (because of the mirroring), I thought let's try and install a new operating system on that primary drive. That failed, because the drive was failing.

I requested the hosting service to replace the primary drive. By the time that happened it was 5pm Pacific (this stared at 8am Pacific).

But when I tried to install the new operating system on the new drive, something happened to the second drive with all the data. As near as I can work out, even without me specifically adding the secondary drive to a new mirror I was trying to set up on the new primary, it actually seemed to add that secondary drive into the mirror.

So when I created the new filesystems on what I thought was just the new drive... it looks like it affected them on the secondary drive as well.

What that means is that all of the data on the safe secondary drive... became instantly inaccessible.

Everything. Every database, document, photo, website, email. Gone.

I'm even now trying to see if I can somehow recover the data, but I feel that the 'newfs' that was run may have overwritten the important 'superblocks' that contain the filesystem information. If that is the case, then I cannot get any of the data back on that disk at all.

"What about backups?" some will no doubt be asking. Yes, there were backups, but no they weren't current. But that's not the biggest issue there. The backup script that I was using to back up parts of the server to the local backup drives at the hosting service used some kind of PGP key to encrypt the backups... and I don't have that key any more. Not just that, but while the script exists to make backups, actually restoring from them is something I have not yet worked out how to do. It still may be possible - I'm working on it.

At the moment, the best I can hope for right now is just to rebuild the services to the bare minimum to get things like email working again. That is not going to happen until Friday April 11 at the earliest, maybe later.

Right now, emails are being held at the backup mail exchanger location, so nothing sent since 8am April 10 Pacific is technically lost. Anything before then... sorry.

Words seriously cannot express my frustration, anguish, disappointment and grief over this entire situation. I have not just let myself down, but I have let down every person who uses the service in some way - whether they be people who have a critter.net or other email hosted on the server, or a website, or people who even use the websites and forums hosted there.

This absolutely sucks for you, and I apologize most profusely.

If you can take solace in anything about this, know that this whole thing is affecting me as it is you - all of my websites and email were hosted there as well, so I'm just as screwed.

I've done as much as I am able to tonight, and will be looking at trying to get the email services going again in the morning.

After that, I am going to seriously consider whether Critter.Net will be staying around. This is not and has never been a money maker for me. It's been a loss leader all the time. If I do decide to close the doors, I will do my best to refund any money for services that were already paid for in advance. This wouldn't happen instantly, but when I am able to afford to put in money to pay people back. (Note: None of the money ever went to me - it went to a separate paypal account that was used to pay for all the services used to provide the service)

Considering how badly I've screwed things up here, I will understand if people don't want to wait until I get some kind of restoration of services back.

Once again, my apologies.
Previous post Next post
Up