The goal of the NVFS filesystem is to create a small and fast filesystem useable on persistent memory. COMPARISON WITH NOVAFS ====================== - nvfs is smaller: 5882 lines of code vs 21459 lines of code - nvfs has more features - xattrs, security labels, acls, quota - nvfs has fsck - nvfs supports the current kernel, novafs supports kernels up to 5.1 - nvfs doesn't require patching the kernel, it uses the standard VFS interface - nvfs is faster - see the "BENCHMARKS" file - a downside: nvfs doesn't have snapshots BLOCKS ====== The minimum allocation unit on the nvfs filesystem is one block. The filesystem block must be greater or equal than the kernel page. The first block is the superblock - the first 2048 bytes are static part of the superblock, the following 2048 bytes are dynamic part (that is changeable). The static part of the superblock is copied at the end of the block device, so that the filesystem can be recovered if the first block is corrupted. GROUPS ====== The filesystem is divided to groups, group size is block_size*block_size*8 (so that the one bitmap can map blocks on the whole group). Each group contains: one data block (the first group has a superblock here) free block bitmap free block and free inode counters free inode bitmap inodes data blocks The reason why I use groups is that it allows the filesystem to be extended. If bitmaps and inodes were at a fixed location, the filesystem could not grow. FILES ===== File blocks are mapped using classic direct/indirect blocks just like on ext2. Each inode has 11 direct blocks and 5 indirect blocks. Using direct/indirect blocks was a conscious decision. Disks and SSDs perform better with linear access and so filesystems use various algorithms to reduce file fragmentation. On the other hand, persistent memory can access linear sequence of pages as fast as random sequence, so we don't need to care about fragmentation much and we want as fast algorithm as possible. On disks and SSDs, accessing a block is as costly as accessing a word, so it is reasonable to stuff as much information as possible to a block. Btrees or b+trees are used to minimize the number of disk accesses. However, on persistent memory, accessing a block is much more time consuming than accessing a word, therefore using b+trees doesn't bring an advantage. If a block is mapped using 3-level indirection, we need just three word accesses to find the location of the block. If we used btrees, the number of word accesses would be much larger. DIRECTORIES =========== Directories are done differently - each directory is a radix tree, the file name is hashed and the hash value is used for lookup in the radix tree. The directory inode has 16 pointers as a base for the radix tree. When the directory is created, all 16 pointers point to the same directory block. +-----+ ===> +-----+ | | ===> | | | | ===> | | |inode| ===> |block| | | ===> | | | | ===> | | | | ===> | | +-----+ ===> +-----+ When the block fills up with directory entries, the block is split to two. The files with the top hash bit clear are put into "block 1" and the files with the top hash bit set are put into "block 2". +-----+ | | | | |block| +-----+ ===> | 1 | | | ===> | | | | ===> | | |inode| ===> +-----+ | | ===> +-----+ | | ===> | | | | ===> | | +-----+ ===> |block| | 2 | | | | | +-----+ When "block 1" or "block 2" fills up they are split again. If we are in a situation where we can no longer split a block, because there is just one pointer from the inode to the block, we allocate a new full block full of indirect pointers: +-----+ -------------->+-----+ ===> +-----+ | | ---> | | ===> | | | | ===> | | ===> | | | | ===> | | ===> | | |inode| ===> | | ===> | | | | ===> | | ===> | | | | ===> | | ===> | | | | ===> |indir| ===> |block| +-----+ ===> |block| ===> | | | | ===> | | | | ===> | | | | ===> | | | | ===> | | | | ===> | | | | ===> | | +-----+ ===> +-----+ Now, there are enough pointers to the "block", so the block can be split as needed. The filesystem uses 32-bit hash values, if the user fills up a directory with too many files that have the same hash value, a single-linked list of directory blocks is used as a last resort. SOFT UPDATES ============ For data consistency, the NVFS filesystem uses a technique known as "soft updates" (it was used on BSD filesystems). Updates for various locations are ordered in such a way that in case of crash, there could be lost blocks or lost inodes, but there could not be any visible corruption. The fsck tool with the argument "-a" is run on reboot and it will clean up the lost structures. The fsck tool is multithreaded (using the OpenMP library) and it mmaps the whole device and performs checking on the map. It can check up to 1.6 million inodes per second. Currently, there's a problem with the kernel - when we mmap the device /dev/pmem0, the kernel will use page cache instead of mapping the persistent memory directly. Moving the blocks from pmem to page cache slows down fsck about 10 times. I hope that this could be somehow resolved. MICRO JOURNALING ================ For operations that must be atomic (such as moving a file from one directory to another or replacing file's extended attributes) NVFS uses a small journal. There is separate journal for each cpu. The journal contains 16 address-value pairs. First, we write addresses and values that must be updated atomically to the journal, then we put end-of-journal mark to the journal, then we replay the journal content to the device and finally we invalidate the journal. VERIFYING THE FILESYSTEM DRIVER =============================== The tool "./verifier" can be used to run tests on the filesystem. You need to compile the filesystem with "CONFIG_NVFS_LOG" defined - it is normally not defined because it causes slow down. You can run "while ./verifier; do :; done" for continuous testing. ./verifier uses a persistent memory device at /dev/pmem0. It will test various operations (defined in utils/verifier.c:tests). The function do_test tests one operation, defined by struct test. It does this: 1. create and mount a new filesystem 2. run the "prepare" phase of the test 3. unmount the filesystem 4. create binary backup of the filesystem 5. mount it with logging enabled 6. run the "test" phase of the test 7. unmount the filesystem 8. back up the filesystem again 9. copy the backup created in step 4 to /dev/pmem0 10. run the "./replay" tool - that will load the log created at the step 6, randomly reorder the entries (to simulate reordering in the CPU write-combining buffers), cut it at a random position (to simulate a crash) and replay the modifications to /dev/pmem0 11. run nvfsck on /dev/pmem0 12. run nvfsck again to verify that previous invocation fixed all the errors 13. mount the filesystem 14. run the "check" phase of the test This sequence may be used for example to prove that if the system crashes during "rename" operation, no matter where the crash happens, either the old file or the new file is present in the directory (see the functions "prepare_rename", "test_rename", "check_rename").