The joy and pain of upgrading server... So there was a reason behind the downtime and not just some some admin "fairies" tripping on a power cord at the DC ;)
I'm the only one who has access to the datacenter (I have two other folks who help me out with work, but they can't get in without me physically being there). So if someone tripped on a cable, it was probably me! ;-)
Thursday night's scheduled maintenance was an OS bare-metal install -- and I did this 100% remotely (2nd time I've done such; probably why I've written a fairly thorough document
on the procedure). I ran into some snags (reboot, "Oh wait a minute, crap, this isn't going to work, I forgot to..."), which caused me to go into a frenzy/angry panic ("I'm going to have to go the datacenter to fix this"), followed by pure anger because I couldn't find my datacenter badge or cage key. Since I'm in the process of moving, all sorts of boxes and crap are scattered throughout my flat -- so I literally tore the place up (it looked like a typhoon hit it, and I'm in no way exaggerating), yadda yadda... I was super upset/irritated. Only until I calmed down did I realise I didn't have a badge/key any longer because the datacenter had upgraded to biometric (hand) scan and a pin + cage keypad locks. Sigh.
After I got things back up and working -- with ZFS in the picture -- I went to bed and thought all was well. Twelve hours later (Friday evening) I get a highly erratic call from our junior admin who didn't really do a good job of troubleshooting the problem and just wanted to reboot the box (which didn't work because of the actual problem, some kernel thread/operation was flat out hung), and he didn't have access to the remote rebooter (that's my fault). I was groggy given that I had taken Nyquil + melatonin to sleep.
Once he described the problem slowly, I was like "...this sounds familiar, I think someone on freebsd-fs posted something like this recently". Yup, the situation we experienced was identical to another guy running on completely different hardware, totally different software configuration, etc.. His workaround was to remove ZFS from the picture.
So that's exactly what we did (migrated from ZFS to using a block/sector-level RAID-1 implementation called gmirror, with standard UFS2 filesystems in use). System's been stable so far, with no signs of processes being deadlocked waiting for internal ZFS operations to complete (since ZFS isn't in use).
And no, none of this is a hardware problem or OS/hardware incompatibility. It's purely a FreeBSD 8.1-STABLE software bug, and absolutely 100% related to ZFS.
Is ZFS worth using yet? I've just been using ReiserFS up to now.
ZFS is worth using only if you're running on Solaris or OpenSolaris. Linux's ZFS port uses FUSE, which means performance is going to suck. There are also two ZFS kernel-level ports (super-duper-insane-patches) in progress, but I don't know anything about them. ZFS on the Solaris' literally just works -- no tuning, nada. On FreeBSD, it's a complete nightmare, as this whole situation proves.
At my day/night job, we use Solaris 10 extensively across thousands of machines (all different hardware revisions/models/specs), with absolutely zero problem. It's wonderful.
As for ReiserFS, I have no personal experience with it, but I do have a colleague who has literally had to hex edit a hard disk to recover pieces of a ReiserFS filesystem that exploded horribly due to some software bug. He could comment on its stability. If I was using Linux, I'd probably stick to using md and nothing more (I'm a KISS admin). However, on Linux, Btrfs is looking very, very nice though, and will definitely give ZFS a run for its money. Thumbs up to positive evolution that keeps it simple.