--- Documentation/device-mapper/cache.txt | 223 +++++++++++++++++++++------------- 1 file changed, 139 insertions(+), 84 deletions(-) Index: linux/Documentation/device-mapper/cache.txt =================================================================== --- linux.orig/Documentation/device-mapper/cache.txt +++ linux/Documentation/device-mapper/cache.txt @@ -1,4 +1,5 @@ -* Introduction +Introduction +============ dm-cache is a device mapper target written by Joe Thornber, Heinz Maueslhagen, and Mike Snitzer. @@ -7,122 +8,137 @@ It aims to improve performance of a bloc dynamically migrating some of its data to a faster, smaller device (eg, an SSD). -There are various caching solutions out there, for example bcache, we -feel there is a need for a purely device-mapper solution that allows -us to insert this caching at different levels of the dm stack. For +// MOVE TO PATCH HEADER +There are various caching solutions out there, for example bcache. +// Insert ref. +We feel there is a need for a purely device-mapper solution +This device-mapper solution allows +us to insert this caching at different levels of the dm stack, for instance above the data device for a thin-provisioning pool. Caching solutions that are integrated more closely with the virtual memory system should give better performance. +// Any evidence for this yet? The target reuses the metadata library used in the thin-provisioning library. -The decision of what and when to migrate data is left to a plug-in +The decision as to what data to migrate and when is left to a plug-in policy module. Several of these have been written as we experiment, and we hope other people will contribute others for specific io scenarios (eg. a vm image server). -* Glossary +Glossary +======== -- Migration - Movement of a logical block from one device to the other. -- Promotion - Migration from slow device to fast device. -- Demotion - Migration from fast device to slow device. + Migration - Movement of a logical block from one device to the other. + Promotion - Migration from slow device to fast device. + Demotion - Migration from fast device to slow device. -* Design +[Clarify "migration"/"movement" can also involve leaving the old in +place or maintaining two copies of the data.] -** Sub devices +Design +====== -The target is constructed by passing three devices to it (along with -other params detailed later): - -- An origin device (the big, slow one). - -- A cache device (the small, fast one). +Sub-devices +----------- -- A small metadata device. +The target is constructed by passing three devices to it (along with +other parameters detailed later): - Device that records which blocks are in the cache. Which are dirty, - and extra hints for use by the policy object. +1. An origin device - the big, slow one. - This information could be put on the cache device, but having it - separate allows the volume manager to configure it differently. eg, - as a mirror for extra robustness. +2. A cache device - the small, fast one. +3. A small metadata device - records which blocks are in the cache, + which are dirty, and extra hints for use by the policy object. + This information could be put on the cache device, but having it + separate allows the volume manager to configure it differently, + e.g. as a mirror for extra robustness. -** Fixed block size +Fixed block size +---------------- The origin is divided up into blocks of a fixed size. This block size is configurable when you first create the cache. Typically we've been using block sizes of 256k - 1024k. +// What about smaller ones that I thought Mike had been testing? + Having a fixed block size simplifies the target a lot. But it is -something of a compromise. For instance a small part of a block may -be getting hit a lot (eg, /etc/passwd), yet the whole block will be -promoted to the cache. So large block sizes are bad, because they -waste cache space. And small block sizes are bad because they -increase the amount of metadata (both in core and on disk). +something of a compromise. For instance, a small part of a block may be +getting hit a lot (e.g. a tiny /etc/passwd), yet the whole block will be -** Writeback/writethrough +// Is that example valid? What about filesystem caching? -The cache has these two modes. +promoted to the cache. So large block sizes are bad because they waste +cache space. And small block sizes are bad because they increase the +amount of metadata (both in core and on disk). -If writeback is selected then writes to blocks that are cached will -only go to the cache, and the block will be marked dirty in the +Writeback/writethrough +---------------------- + +The cache supports these two modes, writeback and writethrough. + +If writeback is selected then a write to a block that is cached will +go only to the cache and the block will be marked dirty in the metadata. -If writethrough mode is selected then a write to a cached block will -not complete until has hit both the origin and cache device. Clean +If writethrough is selected then a write to a cached block will +not complete until it has hit both the origin and cache devices. Clean blocks should remain clean. -A simple cleaner policy is provided, which will clean all dirty blocks -in a cache. Useful for decommissioning a cache. +A simple cleaner policy is provided, which will clean (write back) all +dirty blocks in a cache. Useful for decommissioning a cache. -** Migration throttling +Migration throttling +-------------------- Migrating data between the origin and cache device uses bandwidth. The user can set a throttle to prevent more than a certain amount of -migrations occuring at any one time. Currently we're not taking any -account of normal io traffic going to the devs. More work needs to be -done here to avoid migrating during those peak io moments. +migration occuring at any one time. Currently we're not taking any +account of normal io traffic going to the devices. More work needs +doing here to avoid migrating during those peak io moments. -** Updating on disk metadata +Updating on-disk metadata +------------------------- -On disk metadata is committed everytime a REQ_SYNC or REQ_FUA bio is +On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is written. If no such requests are made then commits will occur every second. This means the cache behaves like a physical disk that has a write cache (the same is true of the thin-provisioning target). If power is lost you may lose some recent writes. The metadata should -always be consistent in spite of a crash. +always be consistent in spite of any crash. The 'dirty' state for a cache block changes far too frequently for us to keep updating it on the fly. So we treat it as a hint. In normal operation it will be written when the dm device is suspended. If the system crashes all cache blocks will be assumed dirty when restarted. -** per block policy hints +Per-block policy hints +---------------------- Policy plug-ins can store a chunk of data per cache block. It's up to -the policy how big this chunk is (please keep it small). Like the +the policy how big this chunk is, but it should be kept small. Like the dirty flags this data is lost if there's a crash so a safe fallback value should always be possible. -For instance the 'mq' policy, which is currently the default policy, +For instance, the 'mq' policy, which is currently the default policy, uses this facility to store the hit count of the cache blocks. If there's a crash this information will be lost, which means the cache may be less efficient until those hit counts are regenerated. -Policy hints effect performance, not correctness. +Policy hints affect performance, not correctness. -** Policy messaging +Policy messaging +---------------- -Policies will have different tunables, specific to each one. So we -need a generic way of getting and setting these. One way would be -through a sysfs interface; much as we do with a block device's queue -parameters. Another is to use the device-mapper message facility. -We're using that latter method currently, though don't feel strongly -one way or the other. +Policies will have different tunables, specific to each one, so we +need a generic way of getting and setting these. Device-mapper +messages are used. (A sysfs interface would also be possible.) -** discard bitset resolution +Discard bitset resolution +------------------------- We can avoid copying data during migration if we know the block has been discarded. A prime example of this is when mkfs discards the @@ -132,16 +148,14 @@ from the cache blocks. This is because state for all of the origin device (compare with the dirty bitset which is just for the smaller cache device). -** Target interface +Target interface +================ - cache - - - +Constructor +----------- + cache <#feature args> []* - - <#policy args> - [policy args]* + <#policy args> [policy args]* metadata dev : fast device holding the persistent metadata cache dev : fast device holding cached data blocks @@ -156,10 +170,40 @@ which is just for the smaller cache devi key/value pairs passed to the policy. policy args : key/value pairs (eg, 'migration_threshold 1024000') +// Resolve mismatch when compared with equivalent source code comment! + A policy called 'default' is always registered. This is an alias for the policy we currently think is giving best all round performance. -* Example usage +N.B. As the default policy could vary between kernels, if you are +relying on the characteristics of a specific policy, always request it +by name. + +// FIXME Must policies just ignore unrecognised arguments? E.g. if +// default changes and old args aren't valid + +Status +------ + +// FIXME Needs writing and including in source code as comment too! - BLOCKS SENDING UPSTREAM + + DMEMIT("%llu/%llu %u %u %u %u %u %u %llu %u %llu", + (unsigned long long)(nr_blocks_metadata - nr_free_blocks_metadata), + (unsigned long long)nr_blocks_metadata, + (unsigned) atomic_read(&cache->read_hit), + (unsigned) atomic_read(&cache->read_miss), + (unsigned) atomic_read(&cache->write_hit), + (unsigned) atomic_read(&cache->write_miss), + (unsigned) atomic_read(&cache->demotion), + (unsigned) atomic_read(&cache->promotion), + (unsigned long long) from_cblock(residency), + cache->nr_dirty, + (unsigned long long) cache->migration_threshold); + + + +Examples +======== The test suite can be found here: @@ -167,27 +211,36 @@ https://github.com/jthornber/thinp-test- 0 41943040 cache /dev/mapper/metadata /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0 -* Policy interface +// FIXME convert this to 'dmsetup' equivalent + +// FIXME More examples! + + +Guidance for writing policies +============================= + +Try to keep transactionality out of it. The core is careful to +avoid asking about anything that is migrating. This is a pain, but +makes it easier to write the policies. -- Try to keep transactionality out of it. The core is careful to - avoid asking about anything that is migrating. This is a pain, but - makes it easier to write the policies. - -- Mappings are loaded into the policy at construction time. - -- Every bio that is mapped by the target is referred to the policy, it - can give a simple HIT or MISS or issue a migration. - -- Currently there's no way for the policy to issue background work, - eg, start writing back dirty blocks that are soon going to be evicted. - -- Because we map bios, rather than requests it's easy for the policy - to get fooled by many small bios. For this reason the core target - issues periodic ticks to the policy. It's suggested that the policy - doesn't update states (eg, hit counts) for a block more than once - for each tick. [The core ticks by watching bios complete, and so - trying to see when the io scheduler has let the ios run] +Mappings are loaded into the policy at construction time. +Every bio that is mapped by the target is referred to the policy. +The policy can return a simple HIT or MISS or issue a migration. + +Currently there's no way for the policy to issue background work, +e.g. to start writing back dirty blocks that are going to be evicte +soon. + +Because we map bios, rather than requests it's easy for the policy +to get fooled by many small bios. For this reason the core target +issues periodic ticks to the policy. It's suggested that the policy + +// FIXME Describe this better: define 'tick' first. + +doesn't update states (eg, hit counts) for a block more than once +for each tick. The core ticks by watching bios complete, and so +trying to see when the io scheduler has let the ios run. void (*destroy)(struct dm_cache_policy *p); void (*map)(struct dm_cache_policy *p, dm_block_t origin_block, int data_dir, @@ -207,3 +260,5 @@ https://github.com/jthornber/thinp-test- void (*tick)(struct dm_cache_policy *p); +// FIXME Delete. Annotated version should live in .h file and no point copying that here. +