More work rants: wolphin

wolphin

More work rants

Jul 14, 2012 02:04

Today, just after lunch I got one of those "drop everything and work on this problem" calls from the boss. It's involving network storage, we have an ongoing problem with the centrally provided system playing nice with Linux. We have various theories as to what is happening, but the upshot is we can mount the area fine as root, but when we log in as a user the files may be visiable or they may not exist, file permissions could be totally screwed, its a mess.

This has been an ongoing problem since the beginning of the year. I was away for four months and got to skip most of it, but it still hasn't been fixed.

Central IT's solution was to give us a dedicated Redhat server supposedly with its own storage hanging of it. They gave us a test one before which apparently worked, but I was away. Today they gave us the production one... and it still doesn't work.

Plan B is we're going to host things locally. I just happen to have an old NFS server with 2Tb of space on it. Of course that's running CentOS 5.8 and I don't know what they've done, but I could not get Centos to pull info from the central AD. I can query it fine, I can get kerboros tickets, but it won't authenticate.

Anyway, on the drive home today I had one of those "Ahh, that's what I should do" moments and got home and promptly spend the next two hours remotely connected in doing things. It didn't work.

The solution is to build a Redhat box. I have some blank RH VMs ready to go, so I fire one of them up, but its on one cluster and my 2Tb of space is on another. Not a problem, I can re-assign it... but not from home, no, I need to be sitting at the console of my machine. I um and ahh, then decide screw it. So at 10pm I hoon back into work.

I get in there, press the magic button and wait... and wait... and wait... then notice that some machines have hung. I do some digging and end up on one of the cross mounted NFS servers which has decided to start generating lots of errors (about an hour before I got in); because its cross mounted, anything that has an open file to it, such as the web server, is frozen.

Right, I can log into the console of the VM, configure the disk not to mount automatically so I can run some diagnostics on it and reboot the VM. It doesn't come up. I go in and look, the ESX host isn't responding. I power cycle that.

I get very impressed with the high availability clustering, it detects the host is down and shifts things off. Host comes back up, I power up the VM, run my tests, all's good so I put it back into production, then get to run around bringing up the machines that hung because they now have stale mounts.

Midnight rolls around. I loose all enthusiasm for what I went in there to do, so I come home.

Decide to check how things are. A different host now is saying its not connected, except two of the VMs on it are still running fine, the other is pinging, but otherwise not responding. I'm ignoring it until Monday.

Apparently some network paths have failed, but it has four paths, so I'm not sure why things haven't just failed over. Gives me something to look forward to on Monday.

In related news, while I was running tests on that from home I configured myself up a RH box on the other cluster, so I've now for my 2Tb of space, its just not a 64bit machine, but it should be fine. At this point in time, I really don't care.

work