Comments | lensman: Meanwhile back in the jungle....

lensman

Meanwhile back in the jungle....

Jun 06, 2011 02:32

So I mentioned there was a server problem at work ( Read more... )

Back to all threads

ninjarat June 7 2011, 00:33:19 UTC

I'm not going to recommend hardware because I don't know your systems.

I am going to provide some advice. Doing HA right can be big and expensive. Redundant computers with redundant system disks and internal storage. Redundant public facing network interfaces with redundant switches. Redundant private network interfaces with redundant switches there. Redundant heartbeat network interfaces with redundant switches there, too. Redundant fibrechannel interfaces with redundant fibre switches connecting the redundant backend storage. Redundant power everywhere. Redundant everything. Literally.

Doing HA wrong is easy, and there are two ways to do it. One is to skimp on the hardware, such as using only one switch for the heartbeat or public facing networks. Lose that switch and the whole cluster is effectively dead or worse: in a split-brain state which can lead to data corruption on the backend. The other is to use the "cold" nodes as live nodes. This is easy to rationalize: they're consuming power and cooling, might as well use them. When something faults and those services fail over you will find yourself operating over capacity and the whole thing will fall apart.

Do it right. Do it right the first time. Pay the expense. And then demonstrate it working. Yank the power out of something "critical" and watch it keep on ticking without anyone noticing. Put it back, clear the fault, and do it again with something else. Test everything.

Or do it wrong, and demonstrate the catastrophe when the critical point faults.

Read this:
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
Specifically point 3.

lensman June 7 2011, 06:01:20 UTC

Yes after a quick email from another colleague I'm currently thinking this will be 2 EMC VNXe NAS's (Have already gotten 2 internal good refs for these) cross connected to 2 backend non-routed Gig switches (Not sure if I should interconnect the two switches or not, but I don't think that will matter too much)
The switches in turn will be cross connected to 6 VMWare ESXi servers (3 are already existing ESXi boxes and have their data stores on each machine separately on is currently a standalone server but that will be re-purposed with it's functions moved into the VMWare cloud). These are all min 2x quad core with RAID 5 (Unfortunately two of them do not have redundant PS) and of course everything has UPS Battery backups, and live in a data center on campus. (Not on city power)

I was able to confirm that that the VShphere Lic is already covered by central. :-) So It'll just be Hardware and OS Lic.

Outward facing Network is run by central, and we're comfortable wiping our hands at that point. Otherwise I think I'd have to look at replicating off campus, and while we're growing we're not THAT big yet. Although I could see the potential for our partner countries to handle one or more of those :-) Which would make things "interesting"... (Road trip) :-)

Back to all threads