Although the story actually started about a year and a half ago, the interesting bits of the story started today, so I'll start there. As one might expect, it's approximately move-out time for college students, like me. Along with college students -- at least, the geeky ones -- go their computers, off of the brilliantly fast resnet connection, and on to their next place in life.
Now, for some students, their natural next place isn't good enough; if you run a file server, or an IRC server, or a web server, or anything like that, then that resnet connection -- or other university bandwidth -- is pretty critical to the machine's functioning. In my case, I was lucky; I had permission to take my machine ("nyus") to live over the summer in a lab that I had access to. The only stipulation was that the machine couldn't run 'caseless' like it used to -- I had to mount it in some sort of case for easy movement around the lab.
I had some grumblings about this, but only because the machine had never lived in a case. It'd always been mostly hanging out loose on my desk, or on a floor, or something like that. "Oh well", I said to myself -- if my professor was going to be so nice as to let me use his room and his IP allocations, the least I could do would be to put it in a case for him.
Earlier today, between commencement and going out to dinner with twinofmunin and her family, we shut the system down, and packed it up into a plastic box. Well -- I almost shut it down. I said to myself, "Wouldn't it be neat if I could preserve the uptime, and all of my state, too?". I decided that it would be a neat trick to try and hibernate the machine, move it, and then unhibernate it. "Why not? The worst that could happen is that the machine just won't resume, right?"
I killed enough processes to make the running system fit in swap, and hit the big button. The machine happily dumped its state out to disk, although instead of powering off, it panic'ed. I gave it no mind, unplugged the machine, and began putting it in its plastic box.
On the way off to dinner, twinofmunin and I ran off to the lab, dumped the box in a room, got in the car, and hauled out for a while.
A healthy five hours later, most of her stuff was packed for her departure. With clear minds, we went back to the lab to put the machine back together. We started by cleaning off the accumulated dust from my dorm room, and there was plenty of it. While Car cleaned out the motherboard, I began mounting the drives in a bracket.
nyus's drives had never been mounted in a bracket before; to protect them on my desk, I'd machined these nice covers for them out of various colors of Lexan. The drives didn't quite fit in the bracket with the covers on, so I did something interesting -- I mounted the drives labeled "hdc" and "hdd" upside down, with the boards facing each other. This required that I cable them in an "interesting fashion", but hey -- this is nyus, and no part of nyus's casing arrangement has ever been done in a standard way.
The rest of the machine's assembly was uneventful. It was somewhat later than I had intended -- this machine's assembly was only to be a quick stop, not as long as it was taking to put it together. I hurried up a bit, and powered the machine up for testing.
The machine happily booted on the first go-around -- at least hda was alive! As I saw it beginning to mount the root filesystem, I realized that I'd forgotten to resume the machine. I hit the power switch, and four seconds later, it was powered off again. I booted it again, and this time remembered to pass the resume= option to the kernel. To my surprise, it happily dropped me at a shell, just where I'd left it!
I poked around a bit, and brought up the network interface. As I brought various other services on the machine back online, I had a moment of realization. What if I swapped hdc and hdd on the chain? I hoped that linux-md (the RAID layer) would have figured it out on resume, but I wasn't counting on it. I did some analysis to see how things were going. mdadm --detail gave some indication that things were OK:
Number Major Minor RaidDevice State 0 3 65 0 active sync /dev/hdb1 1 22 1 1 active sync /dev/hdd1 2 22 65 2 active sync /dev/hdc1
"Phew", I thought -- it seemed like I had dodged a bullet. Maybe the MD layer was actually smart enough to avoid human stupidity at its worst! I started looking up e-mail addresses for the linux-raid mailing list to send an e-mail thanking them for building this support in.
As I opened up my e-mail client, I also ssh'ed into nyus so I could copy and paste logs over. I realized that I hadn't remembered to start sshd yet, so I turned around and prepared to log in as root to relaunch the service.
It didn't go as well as planned. On screen, messages were starting to scroll, along the lines of:
dm-0: rw=0, want=big number, limit=smaller number on the same order of magnitude Buffer I/O error on device dm-0, logical block another big number
This was bad. mdadm, evidently, was lying to me, and something had gone horribly wrong. At that point, I made the first smart decision I'd made all night. I didn't try to unmount the disks, or do any other diagnostics. I punched the switch on the power supply.
At this point, any damage that had happened was already done -- it's out of my hands, now. I rebooted the machine, and prepared for cleanup on any data corruption that was there. There was some initial concern when I saw that MD didn't automatically mount the device, but thankfully, that was just due to some fuzziness with autoloaded kernel modules.
In retrospect, I wish that that had been the only issue; if that were the case, then I could simply force-assemble the drive with mdadm and call it a night. Sadly, the problems were deeper.
I convinced the array to assemble, and the LVM mapper came up with no problems. I decided that before I tried to mount any filesystem that was left, though, I should probably do some integrity checks.
My first stop was with e2fsck. I figured I'd run it in read-only mode, and if anything seemed screwy, I'd take it from there, depending on how screwy it was. Maybe it was time to just dump all the data off and recreate the filesystem, or maybe it was just a few sectors here and there that got nuked -- but it seemed foolish to come up with a plan before knowing what exactly had happened.
It seems that what happened was closer to the second; "a few" sectors were missing.
nyus:~# e2fsck -vn /dev/storage/storage0 e2fsck 1.41.3 (12-Oct-2008) e2fsck: Group descriptors look bad... trying backup blocks... e2fsck: Bad magic number in super-block while trying to open /dev/storage/storage0 The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193
nyus:~# e2fsck -vn -b 8193 /dev/storage/storage0 e2fsck 1.41.3 (12-Oct-2008) e2fsck: Bad magic number in super-block while trying to open /dev/storage/storage0 ...
This is what we refer to as 'bad news'. I have no idea what state the filesystem is in, and I have no idea what state the RAID/MD array is in. My current leading hypothesis is that a few sectors are swapped (i.e., sectors that should be on a parity disk are on a data disk, and some sectors that should be on a data disk are instead on the parity disk); but there's no easy way for me to tell right off the bat.
I've decided that I'm not going to try to write anything to fix it tonight. Tomorrow, I'll go to Best Buy, and pick up three nice shiny SATA drives for the machine, and I'll image the current drives off. Only once I have an image of each of these drives will I try to make changes.
What went wrong?
Well, it's obvious from the above what went wrong in the computer. But what went wrong in me to get the system into this situation?
All pilots, as part of the basic knowledge training for a private pilot certificate, have to learn about five "bad attitudes" that can lead to disaster. (For those of you who are non pilots, that page does a pretty good job of talking about them in more detail). In essence, though, those attitudes are: anti-authority ("The rules are stupid, don't tell me what to do!"); impulsivity ("Make a decision, now"); invulnerability ("It can't happen to me"); macho ("Don't worry about it; I can do it"); and resignation ("Whatever I do, it won't make a difference"). From the above story, I can identify at least four of those five!
Like any good accident, too, this happened only as a chain of events. That is to say, without any one of the events in the chain happening, then the final accident wouldn't've happened. To some extent, this is a systemic error, then, not a one-off; the same classes of mistakes had to happen repeatedly in order to get the result that I got.
So where did it go wrong? Let's think.
- Anti-authority. The most prominent example in the above case is violating the prime rule of suspending a machine: don't make changes while the machine is asleep. Don't boot it into another OS, don't modify the hardware, just don't do it. But when I said, "It's OK, I can put it together the same way" -- this is where we started to run into problems.
- Impulsivity. When the machine came back, I decided that a report of apparent integrity was sufficient to give me a full overview of the system. A problem that could have resulted, and eventually did result, in data loss got insufficient investigation. If the exhaust gas temperature spiked briefly while you were flying, would you forget about it? No, you'd land as soon as you could, even if you looked at the rest of the gauges and they looked OK. In front of the terminal, though, the impulsivity takes over, and potential problems go away when you stop looking at them.
- Macho. Arguably, the stupidest decision of the day was to try something new and untested on a complex system, all for the sake of a silly number (uptime). After I'd "committed" to that, all of the rest of the decisions went from there. There was to be no incremental testing, because then I'd lose the uptime all over again. Shutting down to investigate and fix the problem was out of the question if things appeared to be working normally. If you think, "I can do it" -- why take chances?
- Invulnerability. I said that this story actually started over a year and a half ago. A year and a half ago, I picked up a bunch of nice 500GB drives, and set them up to store a bunch of my data. I knew then that I'd I would never have a way to back them up effectively -- I used them to back other machines up, so hopefully my data was all replicated. I said, "it'll never happen to me" -- I have them in RAID5, so I can survive a disk failure, and I haven't had a filesystem failure in over 8 years, since the days of ext2. What I didn't count on was the human factor. Human stupidity, as they say, knows no bounds. It might not happen to me -- but I sure did.
The take-away, anyway, is that series of conditions stack up to end up with terrible results. The best we can do is try to realize when those chains are happening before it's too late, and break the chain. Remember those attitudes -- when something important is on the line, they're a good fallback.
There are old pilots, and there are bold pilots, but there are no old bold pilots.