I intended to only post this when I was done, and when the machine was back to life (or when I had admitted defeat); but this entry is getting seriously long already. zetorux also wanted an update, so I'll post this now, and make another entry after more progress in a few days.
I've ljcutted this, because this is an inordinate amount of text.
On Tuesday, the replacement drives arrived. Tuesday night, they were all cabled up, and I began imaging the 500GB sets.
I knew beforehand that it would be important to be able to cheaply experiment with the images; I anticipated making a lot of mistakes, and I figured that rolling back would be a common operation. Initially, I figured that the real thing that I wanted was ZFS, which has brilliant snapshot support, as well as plenty of other good things. Sadly, though, I'm running Linux on that machine, and the only way to get ZFS is with zfs-fuse, which is of unknown code quality, and unknown performance.
Regardless, I decided to use that, at least initially. zfs-fuse didn't have the big feature that would've made this helpful -- ZVols -- but I could emulate those with the loopback device. It seemed that I was CPU-bound on my copies from the 500GB sets to the ZFS mount, which was somewhat concerning; would it really have the performance I needed once I added layers of loopback and RAID emulation and mounts on top?
Thankfully, while I was waiting, slashclee told me of LVM's snapshot support. I quickly ditched ZFS, and proceeded to create a LVM2 volume group on top of the 1TB drives. While I was waiting for the data to be dd'ed over, I went to play Call of Duty 5: World at War. Nazi Zombies mode is hilarious.
I have iostat data from the copy, but I don't think it will be very meaningful, since I was pretty much saturating the IDE link, I think. I might graph it some day.
Wednesday night, out until Thursday morning, was productive. twinofmunin and kalenedrael read up on RAID5 (conclusion: the Linux kernel is insane. but we knew that), and I read up on ext3. By the end of the night, I managed to produce an "e3view" utility that could walk up through the stage of the block group descriptor table, and heuristically determine which sectors "looked funny" in that table. It is actually capable of repairing the BGDT, too, but that 'repair' strategy won't work for the rest of the metadata.
Potentially the highlight was seeing this:
Block group 1919 Bitmap block : 62881792 (0x03bf8000) Inode block : 62881793 (0x03bf8001) Inode table : 62881794 (0x03bf8002) Reading from new sector: 128 Block group 1920 Bitmap block : 130023424 (0x07c00000) ...looks bad! ...but expected 03c00000! Fixing...
In case you don't know what you're looking at, you're looking at a valid block group descriptor that was in sector 127, and then the transition to sector 128, which got smashed a long time ago. e3view managed to heuristically determine that the bitmap block pointer in the descriptor pointed to a pretty bogus looking block, and then knew enough to repair it.
We've learned a little more about what data corruption looks like, but not quite enough yet. Tomorrow, we need to draw up tables as to what on-disk formats look like given some combination of inputs, and see what level of repair work we can do.
I think a pretty large amount of metadata has been pretty soundly trashed; in particular, a lot of inode tables have gotten it pretty badly. I'm not sure how well we'll be able to do, but it'll be a fun exercise no matter what. I think I'm pretty confident at this stage that at the end I'll at least be able to have a list of smashed files (files for which data was unrecoverable) -- I'm not sure how big that list will be right yet, though.
The findings for today are somewhat somber, although I'll not start with that. I'll instead start with some happy news -- the tools that I'm developing as I go are now on the interblags! I don't make any promises as to code quality or readability, and I especially don't make any promises as to usability or correctness. But, if you want an interesting read, here it is: e3tools on GitHub.
It is currently (as of commit 3c297b1d2c077603d965672e75ffb55a2cfbc2ea)
$ ln -s /dev/hda3 recover $ sudo ./e3view -sd | less
And it'll give you a neat view of how your filesystem is linked together. You can also try the option -D, which tries to rebuild the block group descriptor table if it finds corruption. In the current version, it actually will never write it back to disk (it'll keep it in the internal exception table for future reads that it tries to do), and at exit, it will report how many sectors it has 'dirty'. (It's kind of like COW.) But, I'm not sure how that behavior will change in future versions; so you shouldn't rely on that option being safe to use unless I explicitly give a SHA1 hash of a commit that is safe.
It might be time, now, to talk about how ext3 works internally, and what tactics we have available to recover from the "event" that happened.
In ext3, everything centers around the "block"; and in a way, this makes sense. The entire goal of a filesystem is to point to blocks containing data, and to organize them in a sane fashion. But instead of aggregating all of the blocks on a volume into a single pool, ext3 splits up the metadata so that it's somewhat more local to blocks -- i.e., there's less seeking to get from the metadata that you just read to the data. It does this by organizing blocks into "block groups", which are regions on disk comprising a fixed number of blocks.
With each block group, there are three tables that describe the contents -- the block bitmap (which tells which blocks have been used), the inode bitmap (which tells which inodes are in use), and the inode table (which gives all of the inodes available for that block group). Usually, they are placed right at the beginning of a block group, but there isn't any requirement in the on-disk format that they are. In fact, since they could be anywhere on disk, the pointers to these structures are stored in yet another structure called a "block group descriptor"; and a table of these block group descriptors (the aptly-named block group descriptor table) goes with every copy of the file system's superblock.
So the first corruption that we had to contend with before we could read anything else out of the filesystem was that bits of the block group descriptor table got smashed -- in particular, parts of it were swapped around. Luckily, since every version of mke2fs that I know about puts the pointed-to tables in exactly the same place each time, I can pretty accurately reconstruct the block group descriptor table if something has gone terribly wrong.
Even if I couldn't, though, I have a few more tricks up my sleeve. First off, I can try to un-swap the swapped sectors on the array. This means that I'll have to work at a lower level than the filesystem, but this seems growingly inevitable. The theory behind this is that if we know how RAID5 is trying to store the sectors on disk, and then what it actually stored when it tried to store sectors, then we can recover "bad" looking sectors by just looking in a different place from where RAID would otherwise expect to look.
The other thing that we can do involves a little bit more trickery with ext3. The enhancement of ext3 over ext2, you might recall, is that ext3 has a journal. So, if we find a block that we can't recover (I suspect that there are combinations of circumstances that could have lead to blocks just getting wiped off the surface of the RAID array), we can go and look in ext3's journal file, and see if it tried to write to it at an earlier point in time! We might not have the most current metadata, then, but old metadata is about infinitely better than missing metadata. Obviously this works only for a small amount of missing metadata -- beyond the last 32MB written or so, it probably won't be in the journal any more.
RAID recovery is preferable to ext3-level recovery, because RAID recovery is more likely to give us the newest information, and for that matter, it's more likely to give us any information at all. ext3-level recovery I think will not be sufficient either, though. My bet, then, and for that matter my only real hope, is that the combination of the two will be sufficient to recover the vast majority of the directory structure on the disk.
OK, enough of that, though. Popping off the stack... right, back to ext3. So, the next layer of ext3 metadata is the inode, which describes everything about a file except for its name. (Directories, then, are just mappings of names to inode numbers.) Inodes contain information like what type of a file it is, where it can be found on disk (with a few exceptions), size, block count, modification time, ... yeah. That sort of thing. Metadata. And, since one inode can be pointed to by many filenames, the inode structure has a count of how many times it's pointed to. (This will become important in a moment!) The upshot of all this is that if a file's inode is gone, then the file is essentially lost. Inodes are stored all in inode tables in each block group -- they're not scattered in with the data, they're placed in contiguous blocks of A Bunch of them. (This is what the "Inodes per" under "group information" means in the output of e3view -s, for those of you following along at home.)
Yesterday's task, then, was at the block group descriptor layer; today, I've been investigating what's happened to the block groups on the machine themselves, and in particular, the inode tables. (I haven't investigated block bitmaps yet). I've also done a lot of code reorganization, but that's not really relevant to specific tasks of the recovery effort for today.
I wanted to get a handle on the condition of the inode tables, so I decided to take a look through for inodes that looked obviously bogus. In this case, the heuristic that I used was for an inode having a link count above 4096 (who would ever have 4096 hard-links to a file???) -- if I saw one of those, then I assumed that the inode in the table had been smacked somehow. As it turned out, this was a pretty good heuristic -- I tried this on the ext3 partition on my laptop, and zero of the inodes on the system were marked as bogus.
I ran this on nyus, and aggregated the counts of bogus blocks with a truly terrifying one-liner that took about a minute to run. As it turns out, the news... was not good.
nyus:~/de-a/e3tools# (grep bogus bogus.txt | cut -d' ' -f1,4) | (o=0; b=0; while read a; do no=$(echo $a | cut -d' ' -f1); nb=$(echo $a | cut -d' ' -f2); o=$(($o+$no)); b=$(($b+$nb)); done; echo OK $o BOGUS $b) OK 118454426 BOGUS 3639142
Out of close to 120 million inodes, 3.6 million were bogus. This is far more than I think we can recover using the journal; that's 931 megabytes in inodes alone. I'm not sure how many of the bogus ones formerly had data in them. Now, it's of interest to note that this is about 3% of the inodes on the system. I'm left wondering a bit how that happened; when did the machine have time to write nearly a gigabyte of data to disk without me noticing?
I'm not sure what this means overall, because I'm not sure if they're permanently gone, or just displaced, or sad, or what. It doesn't seem good, but anything could happen. The next few days should be interesting.