---
 Documentation/device-mapper/cache.txt |  223 +++++++++++++++++++++-------------
 1 file changed, 139 insertions(+), 84 deletions(-)

Index: linux/Documentation/device-mapper/cache.txt
===================================================================
--- linux.orig/Documentation/device-mapper/cache.txt
+++ linux/Documentation/device-mapper/cache.txt
@@ -1,4 +1,5 @@
-* Introduction
+Introduction
+============
 
 dm-cache is a device mapper target written by Joe Thornber, Heinz
 Maueslhagen, and Mike Snitzer.
@@ -7,122 +8,137 @@ It aims to improve performance of a bloc
 dynamically migrating some of its data to a faster, smaller device
 (eg, an SSD).
 
-There are various caching solutions out there, for example bcache, we
-feel there is a need for a purely device-mapper solution that allows
-us to insert this caching at different levels of the dm stack.  For
+// MOVE TO PATCH HEADER
+There are various caching solutions out there, for example bcache.
+// Insert ref.
+We feel there is a need for a purely device-mapper solution
+This device-mapper solution allows
+us to insert this caching at different levels of the dm stack, for
 instance above the data device for a thin-provisioning pool.  Caching
 solutions that are integrated more closely with the virtual memory
 system should give better performance.
+// Any evidence for this yet?
 
 The target reuses the metadata library used in the thin-provisioning
 library.
 
-The decision of what and when to migrate data is left to a plug-in
+The decision as to what data to migrate and when is left to a plug-in
 policy module.  Several of these have been written as we experiment,
 and we hope other people will contribute others for specific io
 scenarios (eg. a vm image server).
 
-* Glossary
+Glossary
+========
 
-- Migration -  Movement of a logical block from one device to the other.
-- Promotion -  Migration from slow device to fast device.
-- Demotion  -  Migration from fast device to slow device.
+  Migration -  Movement of a logical block from one device to the other.
+  Promotion -  Migration from slow device to fast device.
+  Demotion  -  Migration from fast device to slow device.
 
-* Design
+[Clarify "migration"/"movement" can also involve leaving the old in
+place or maintaining two copies of the data.]
 
-** Sub devices
+Design
+======
 
-The target is constructed by passing three devices to it (along with
-other params detailed later):
-
-- An origin device (the big, slow one).
-
-- A cache device (the small, fast one).
+Sub-devices
+-----------
 
-- A small metadata device.
+The target is constructed by passing three devices to it (along with
+other parameters detailed later):
 
-  Device that records which blocks are in the cache.  Which are dirty,
-  and extra hints for use by the policy object.
+1. An origin device - the big, slow one.
 
-  This information could be put on the cache device, but having it
-  separate allows the volume manager to configure it differently.  eg,
-  as a mirror for extra robustness.
+2. A cache device - the small, fast one.
 
+3. A small metadata device - records which blocks are in the cache,
+   which are dirty, and extra hints for use by the policy object.
+   This information could be put on the cache device, but having it
+   separate allows the volume manager to configure it differently,
+   e.g. as a mirror for extra robustness.
 
-** Fixed block size
+Fixed block size
+----------------
 
 The origin is divided up into blocks of a fixed size.  This block size
 is configurable when you first create the cache.  Typically we've been
 using block sizes of 256k - 1024k.
 
+// What about smaller ones that I thought Mike had been testing?
+
 Having a fixed block size simplifies the target a lot.  But it is
-something of a compromise.  For instance a small part of a block may
-be getting hit a lot (eg, /etc/passwd), yet the whole block will be
-promoted to the cache.  So large block sizes are bad, because they
-waste cache space.  And small block sizes are bad because they
-increase the amount of metadata (both in core and on disk).
+something of a compromise.  For instance, a small part of a block may be
+getting hit a lot (e.g. a tiny /etc/passwd), yet the whole block will be
 
-** Writeback/writethrough
+// Is that example valid?  What about filesystem caching?
 
-The cache has these two modes.
+promoted to the cache.  So large block sizes are bad because they waste
+cache space.  And small block sizes are bad because they increase the
+amount of metadata (both in core and on disk).
 
-If writeback is selected then writes to blocks that are cached will
-only go to the cache, and the block will be marked dirty in the
+Writeback/writethrough
+----------------------
+
+The cache supports these two modes, writeback and writethrough.
+
+If writeback is selected then a write to a block that is cached will
+go only to the cache and the block will be marked dirty in the
 metadata.
 
-If writethrough mode is selected then a write to a cached block will
-not complete until has hit both the origin and cache device.  Clean
+If writethrough is selected then a write to a cached block will
+not complete until it has hit both the origin and cache devices.  Clean
 blocks should remain clean.
 
-A simple cleaner policy is provided, which will clean all dirty blocks
-in a cache.  Useful for decommissioning a cache.
+A simple cleaner policy is provided, which will clean (write back) all
+dirty blocks in a cache.  Useful for decommissioning a cache.
 
-** Migration throttling
+Migration throttling
+--------------------
 
 Migrating data between the origin and cache device uses bandwidth.
 The user can set a throttle to prevent more than a certain amount of
-migrations occuring at any one time.  Currently we're not taking any
-account of normal io traffic going to the devs.  More work needs to be
-done here to avoid migrating during those peak io moments.
+migration occuring at any one time.  Currently we're not taking any
+account of normal io traffic going to the devices.  More work needs
+doing here to avoid migrating during those peak io moments.
 
-** Updating on disk metadata
+Updating on-disk metadata
+-------------------------
 
-On disk metadata is committed everytime a REQ_SYNC or REQ_FUA bio is
+On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is
 written.  If no such requests are made then commits will occur every
 second.  This means the cache behaves like a physical disk that has a
 write cache (the same is true of the thin-provisioning target).  If
 power is lost you may lose some recent writes.  The metadata should
-always be consistent in spite of a crash.
+always be consistent in spite of any crash.
 
 The 'dirty' state for a cache block changes far too frequently for us
 to keep updating it on the fly.  So we treat it as a hint.  In normal
 operation it will be written when the dm device is suspended.  If the
 system crashes all cache blocks will be assumed dirty when restarted.
 
-** per block policy hints
+Per-block policy hints
+----------------------
 
 Policy plug-ins can store a chunk of data per cache block.  It's up to
-the policy how big this chunk is (please keep it small).  Like the
+the policy how big this chunk is, but it should be kept small.  Like the
 dirty flags this data is lost if there's a crash so a safe fallback
 value should always be possible.
 
-For instance the 'mq' policy, which is currently the default policy,
+For instance, the 'mq' policy, which is currently the default policy,
 uses this facility to store the hit count of the cache blocks.  If
 there's a crash this information will be lost, which means the cache
 may be less efficient until those hit counts are regenerated.
 
-Policy hints effect performance, not correctness.
+Policy hints affect performance, not correctness.
 
-** Policy messaging
+Policy messaging
+----------------
 
-Policies will have different tunables, specific to each one.  So we
-need a generic way of getting and setting these.  One way would be
-through a sysfs interface; much as we do with a block device's queue
-parameters.  Another is to use the device-mapper message facility.
-We're using that latter method currently, though don't feel strongly
-one way or the other.
+Policies will have different tunables, specific to each one, so we
+need a generic way of getting and setting these.  Device-mapper
+messages are used.  (A sysfs interface would also be possible.)
 
-** discard bitset resolution
+Discard bitset resolution
+-------------------------
 
 We can avoid copying data during migration if we know the block has
 been discarded.  A prime example of this is when mkfs discards the
@@ -132,16 +148,14 @@ from the cache blocks.  This is because 
 state for all of the origin device (compare with the dirty bitset
 which is just for the smaller cache device).
 
-** Target interface
+Target interface
+================
 
- cache <metadata dev>
-       <cache dev>
-       <origin dev>
-       <block size>
+Constructor
+-----------
+ cache <metadata dev> <cache dev> <origin dev> <block size>
        <#feature args> [<feature arg>]*
-       <policy>
-       <#policy args>
-       [policy args]*
+       <policy> <#policy args> [policy args]*
 
  metadata dev    : fast device holding the persistent metadata
  cache dev	 : fast device holding cached data blocks
@@ -156,10 +170,40 @@ which is just for the smaller cache devi
                    key/value pairs passed to the policy.
  policy args     : key/value pairs (eg, 'migration_threshold 1024000')
 
+// Resolve mismatch when compared with equivalent source code comment!
+
 A policy called 'default' is always registered.  This is an alias for
 the policy we currently think is giving best all round performance.
 
-* Example usage
+N.B. As the default policy could vary between kernels, if you are
+relying on the characteristics of a specific policy, always request it
+by name.
+
+// FIXME Must policies just ignore unrecognised arguments? E.g. if
+// default changes and old args aren't valid
+
+Status
+------
+
+// FIXME Needs writing and including in source code as comment too! - BLOCKS SENDING UPSTREAM
+
+              DMEMIT("%llu/%llu %u %u %u %u %u %u %llu %u %llu",
+                       (unsigned long long)(nr_blocks_metadata - nr_free_blocks_metadata),
+                       (unsigned long long)nr_blocks_metadata,
+                       (unsigned) atomic_read(&cache->read_hit),
+                       (unsigned) atomic_read(&cache->read_miss),
+                       (unsigned) atomic_read(&cache->write_hit),
+                       (unsigned) atomic_read(&cache->write_miss),
+                       (unsigned) atomic_read(&cache->demotion),
+                       (unsigned) atomic_read(&cache->promotion),
+                       (unsigned long long) from_cblock(residency),
+                       cache->nr_dirty,
+                       (unsigned long long) cache->migration_threshold);
+
+
+
+Examples
+========
 
 The test suite can be found here:
 
@@ -167,27 +211,36 @@ https://github.com/jthornber/thinp-test-
 
 0 41943040 cache /dev/mapper/metadata /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0
 
-* Policy interface
+// FIXME convert this to 'dmsetup' equivalent
+
+// FIXME More examples!
+
+
+Guidance for writing policies
+=============================
+
+Try to keep transactionality out of it.  The core is careful to
+avoid asking about anything that is migrating.  This is a pain, but
+makes it easier to write the policies.
 
-- Try to keep transactionality out of it.  The core is careful to
-  avoid asking about anything that is migrating.  This is a pain, but
-  makes it easier to write the policies.
-
-- Mappings are loaded into the policy at construction time.
-
-- Every bio that is mapped by the target is referred to the policy, it
-  can give a simple HIT or MISS or issue a migration.
-
-- Currently there's no way for the policy to issue background work,
-  eg, start writing back dirty blocks that are soon going to be evicted.
-
-- Because we map bios, rather than requests it's easy for the policy
-  to get fooled by many small bios.  For this reason the core target
-  issues periodic ticks to the policy.  It's suggested that the policy
-  doesn't update states (eg, hit counts) for a block more than once
-  for each tick.  [The core ticks by watching bios complete, and so
-  trying to see when the io scheduler has let the ios run]
+Mappings are loaded into the policy at construction time.
 
+Every bio that is mapped by the target is referred to the policy.
+The policy can return a simple HIT or MISS or issue a migration.
+
+Currently there's no way for the policy to issue background work,
+e.g. to start writing back dirty blocks that are going to be evicte
+soon.
+
+Because we map bios, rather than requests it's easy for the policy
+to get fooled by many small bios.  For this reason the core target
+issues periodic ticks to the policy.  It's suggested that the policy
+
+// FIXME Describe this better: define 'tick' first.
+
+doesn't update states (eg, hit counts) for a block more than once
+for each tick.  The core ticks by watching bios complete, and so
+trying to see when the io scheduler has let the ios run.
 
 	void (*destroy)(struct dm_cache_policy *p);
 	void (*map)(struct dm_cache_policy *p, dm_block_t origin_block, int data_dir,
@@ -207,3 +260,5 @@ https://github.com/jthornber/thinp-test-
 
 	void (*tick)(struct dm_cache_policy *p);
 
+// FIXME Delete.  Annotated version should live in .h file and no point copying that here.
+