The goal of the NVFS filesystem is to create a small and fast filesystem useable
on persistent memory.


COMPARISON WITH NOVAFS
======================

- nvfs is smaller: 5882 lines of code vs 21459 lines of code
- nvfs has more features - xattrs, security labels, acls, quota
- nvfs has fsck
- nvfs supports the current kernel, novafs supports kernels up to 5.1
- nvfs doesn't require patching the kernel, it uses the standard VFS interface
- nvfs is faster - see the "BENCHMARKS" file
- a downside: nvfs doesn't have snapshots


BLOCKS
======

The minimum allocation unit on the nvfs filesystem is one block. The filesystem
block must be greater or equal than the kernel page.

The first block is the superblock - the first 2048 bytes are static part of the
superblock, the following 2048 bytes are dynamic part (that is changeable). The
static part of the superblock is copied at the end of the block device, so that
the filesystem can be recovered if the first block is corrupted.


GROUPS
======

The filesystem is divided to groups, group size is block_size*block_size*8 (so
that the one bitmap can map blocks on the whole group).

Each group contains:
	one data block (the first group has a superblock here)
	free block bitmap
	free block and free inode counters
	free inode bitmap
	inodes
	data blocks

The reason why I use groups is that it allows the filesystem to be extended. If
bitmaps and inodes were at a fixed location, the filesystem could not grow.


FILES
=====

File blocks are mapped using classic direct/indirect blocks just like on ext2.
Each inode has 11 direct blocks and 5 indirect blocks.

Using direct/indirect blocks was a conscious decision. Disks and SSDs perform
better with linear access and so filesystems use various algorithms to reduce
file fragmentation. On the other hand, persistent memory can access linear
sequence of pages as fast as random sequence, so we don't need to care about
fragmentation much and we want as fast algorithm as possible.

On disks and SSDs, accessing a block is as costly as accessing a word, so it is
reasonable to stuff as much information as possible to a block. Btrees or
b+trees are used to minimize the number of disk accesses. However, on persistent
memory, accessing a block is much more time consuming than accessing a word,
therefore using b+trees doesn't bring an advantage. If a block is mapped using
3-level indirection, we need just three word accesses to find the location of
the block. If we used btrees, the number of word accesses would be much larger.


DIRECTORIES
===========

Directories are done differently - each directory is a radix tree, the file name
is hashed and the hash value is used for lookup in the radix tree.

The directory inode has 16 pointers as a base for the radix tree. When the
directory is created, all 16 pointers point to the same directory block.

+-----+ ===> +-----+
|     | ===> |     |
|     | ===> |     |
|inode| ===> |block|
|     | ===> |     |
|     | ===> |     |
|     | ===> |     |
+-----+ ===> +-----+

When the block fills up with directory entries, the block is split to two. The
files with the top hash bit clear are put into "block 1" and the files with the
top hash bit set are put into "block 2".

             +-----+
             |     |
             |     |
             |block|
+-----+ ===> |  1  |
|     | ===> |     |
|     | ===> |     |
|inode| ===> +-----+
|     | ===> +-----+
|     | ===> |     |
|     | ===> |     |
+-----+ ===> |block|
             |  2  |
             |     |
             |     |
             +-----+

When "block 1" or "block 2" fills up they are split again. If we are in a
situation where we can no longer split a block, because there is just one
pointer from the inode to the block, we allocate a new full block full of
indirect pointers:

+-----+ -------------->+-----+ ===> +-----+
|     | --->           |     | ===> |     |
|     | ===>           |     | ===> |     |
|     | ===>           |     | ===> |     |
|inode| ===>           |     | ===> |     |
|     | ===>           |     | ===> |     |
|     | ===>           |     | ===> |     |
|     | ===>           |indir| ===> |block|
+-----+ ===>           |block| ===> |     |
                       |     | ===> |     |
                       |     | ===> |     |
                       |     | ===> |     |
                       |     | ===> |     |
                       |     | ===> |     |
                       |     | ===> |     |
                       +-----+ ===> +-----+

Now, there are enough pointers to the "block", so the block can be split as
needed.

The filesystem uses 32-bit hash values, if the user fills up a directory with
too many files that have the same hash value, a single-linked list of directory
blocks is used as a last resort.


SOFT UPDATES
============

For data consistency, the NVFS filesystem uses a technique known as "soft
updates" (it was used on BSD filesystems). Updates for various locations are
ordered in such a way that in case of crash, there could be lost blocks or lost
inodes, but there could not be any visible corruption. The fsck tool with the
argument "-a" is run on reboot and it will clean up the lost structures.

The fsck tool is multithreaded (using the OpenMP library) and it mmaps the whole
device and performs checking on the map. It can check up to 1.6 million inodes
per second.

Currently, there's a problem with the kernel - when we mmap the device
/dev/pmem0, the kernel will use page cache instead of mapping the persistent
memory directly. Moving the blocks from pmem to page cache slows down fsck about
10 times. I hope that this could be somehow resolved.


MICRO JOURNALING
================

For operations that must be atomic (such as moving a file from one directory to
another or replacing file's extended attributes) NVFS uses a small journal.
There is separate journal for each cpu. The journal contains 16 address-value
pairs. First, we write addresses and values that must be updated atomically to
the journal, then we put end-of-journal mark to the journal, then we replay the
journal content to the device and finally we invalidate the journal.


VERIFYING THE FILESYSTEM DRIVER
===============================

The tool "./verifier" can be used to run tests on the filesystem. You need to
compile the filesystem with "CONFIG_NVFS_LOG" defined - it is normally not
defined because it causes slow down.

You can run "while ./verifier; do :; done" for continuous testing.

./verifier uses a persistent memory device at /dev/pmem0. It will test various
operations (defined in utils/verifier.c:tests). The function do_test tests one
operation, defined by struct test. It does this:
1. create and mount a new filesystem
2. run the "prepare" phase of the test
3. unmount the filesystem
4. create binary backup of the filesystem
5. mount it with logging enabled
6. run the "test" phase of the test
7. unmount the filesystem
8. back up the filesystem again
9. copy the backup created in step 4 to /dev/pmem0
10. run the "./replay" tool - that will load the log created at the step 6,
    randomly reorder the entries (to simulate reordering in the CPU
    write-combining buffers), cut it at a random position (to simulate a crash)
    and replay the modifications to /dev/pmem0
11. run nvfsck on /dev/pmem0
12. run nvfsck again to verify that previous invocation fixed all the errors
13. mount the filesystem
14. run the "check" phase of the test

This sequence may be used for example to prove that if the system crashes during
"rename" operation, no matter where the crash happens, either the old file or
the new file is present in the directory (see the functions "prepare_rename",
"test_rename", "check_rename").