I've done a bunch of hacking on the ext3 indexed directory patch, and made a lot of progress. In fact, I think it's just about ready for integration with the 2.5 tree. Testing and comments are appreciated. Note: the readdir() changes mean that we're now using the filp->private_data field. Is this going to be a problem for Lustre, Andreas? The code can be found either via bitkeeper, at: bk://extfs.bkbits.net/ext3-htree-2.4 and bk://extfs.bkbits.net/ext3-htree-2.5 Ext3 developers who sent me ssh keys can push changes to the read/write version of the repository using the URLs: bk://extfs@extfs.bkbits.net/ext3-htree-2.4 and bk://extfs@extfs.bkbits.net/ext3-htree-2.5 Or for those people who want straight diffs, patches against 2.4 and 2.5 can be found at: http://thunk.org/tytso/linux/ext3-dxdir - Ted Here are the Release Notes for my changes: *) The Ext3 hash-tree code uses the new TEA and half-MD4 hash that ships with e2fsprogs 1.28 *) The code has been massively reorganized and rewritten up to make it more maintainable. In particular, the horrible mess which had been ext3_find_entry and ext3_add_entry has been broken up into multiple functions. This eliminated a lot of goto's, and more importantly, found some memory leaks which have been eliminated. Some were only in the error paths, but some were in normal code paths that would be executed after a split, for example. *) As a side effect of the reorganization, the ext3_find_entry() and ext3_readdir() paths can now support arbitrary numbers of indirect levels in the tree. (The resulting code was simpler, too!) The one code path that still could use some work is the split handling on the insert path. Once that is generalized, we will be able to remove the depth restriction on the indexed tree. *) The error handling has been cleaned up, so that they are appropriately reported back up to userspace, where approach. Ext3_std_error() is now called when the journal routines report an error, in line with all of the other journalling calls in the rest of the ext3 codebase. *) Ext3_readdir now traverses the directory in hash sort order. This (mostly) solves the requirement that readdir not return doubled file entries, or skip a block of files, which could happen if another process managed to trigger a tree split while the readdir operation was going on. Unfortunately, on 32 bit machines, we have to use a 31-bit sort under all conditions, even if telldir64() is used in userspace. This is because the VFS uses a single path for sys_lseek() and sys_llseek(), and so the ext3 code can't tell whether to send back a 32-bit value or a 64-bit value. And, if we send a 64-bit positional value when the 32-bit lseek() was called, lseek() will bomb out. This is unfortunate, but the chances of our hitting this failure more are relatively small. Since readdir (well, sys_getdents64) returns 4k of data at at time, that means a directory with 400,000 entries have approximately 3,000 getdents() system calls, with each getdents() returning approximately 133 entries. In order to trigger the problem, the 31-bit hash collision must be positioned such that it straddles one of the 4k getdents blocks --- while a split operation is taking place. (If we were willing to modify the VFS layer, so that ext3_readdir() knew how much space was left in the filldir buffer, we could be even smarter about things by stopping if there is only space for less than all of the entries that have the same hash value. I'm not entirely convinced this is necessary, though. The risk is low and in the worst case, one or two filenames will simply be returned twice by readdir, which shouldn't cause problems for most applications.) *) Ext3_readdir(), ext3_add_entry(), and ext3_find_entry() now automatically fall back to the old linear directory code if they find a indexed directory format they don't understand. This allows us to add new hash versions, and/or allow the depth of the tree to be increased, without losing backwards compatiblity with deployed kernels.