dm-replicator
=============

Device-mapper replicator is designed to enable redundant copies of
storage devices to be made - preferentially, to remote locations.
RAID1 (aka mirroring) is often used to maintain redundant copies of
storage for fault tolerance purposes.  Unlike RAID1, which often
assumes similar device characteristics, dm-replicator is designed to
handle devices with different latency and bandwidth characteristics
which are often the result of the geograhic disparity of multi-site
architectures.  Simply put, you might choose RAID1 to protect from
a single device failure, but you would choose remote replication
via dm-replicator for protection against a site failure.

dm-replicator works by first sending write requests to the "replicator
log".  Not to be confused with the device-mapper dirty log, this
replicator log behaves similarly to that of a journal.  Write requests
go to this log first and then are copied to all the replicate devices
at their various locations.  Requests are cleared from the log once all
replicate devices confirm the data is received/copied.  This architecture
allows dm-replicator to be flexible in terms of device characteristics.
If one device should fall behind the others - perhaps due to high latency -
the slack is picked up by the log.  The user has a great deal of
flexibility in specifying to what degree a particular site is allowed to
fall behind - if at all.

Device-Mapper's dm-replicator has two targets, "replicator" and
"replicator-dev".  The "replicator" target is used to setup the
aforementioned log and allow the specification of site link properties.
Through the "replicator" target, the user might specify that writes
that are copied to the local site must happen synchronously (i.e the
writes are complete only after they have passed through the log device
and have landed on the local site's disk).  They may also specify that
a remote link should asynchronously complete writes, but that the remote
link should never fall more than 100MB behind in terms of processing.
Again, the "replicator" target is used to define the replicator log and
the characteristics of each site link.

The "replicator-dev" target is used to define the devices used and
associate them with a particular replicator log.  You might think of
this stage in a similar way to setting up RAID1 (mirroring).  You
define a set of devices which will be copies of each other, but
access the device through the mirror virtual device which takes care
of the copying.  The user accessible replicator device is analogous
to the mirror virtual device, while the set of devices being copied
to are analogous to the mirror images (sometimes called 'legs').
When creating a replicator device via the "replicator-dev" target,
it must be associated with the replicator log (created with the
aforementioned "replicator" target).  When each redundant device
is specified as part of the replicator device, it is associated with
a site link whose properties were defined when the "replicator"
target was created.

The user can go farther than simply replicating one device.  They
can continue to add replicator devices - associating them with a
particular replicator log.  Writes that go through the replicator
log are guarenteed to have their write ordering preserved.  So, if
you associate more than one replicator device to a particular
replicator log, you are preserving write ordering across multiple
devices.  This might be useful if you had a database that spanned
multiple disks and write ordering must be preserved or any transaction
accounting scheme would be foiled.  (You can imagine this like
preserving write ordering across a number of mirrored devices, where
each mirror has images/legs in different geographic locations.)

dm-replicator has a modular architecture.  Future implementations for
the replicator log and site link modules are allowed.  The current
replication log is ringbuffer - utilized to store all writes being
subject to replication and enforce write ordering.  The current site
link code is based on accessing block devices (iSCSI, FC, etc) and
does device recovery including (initial) resynchronization.


Picture of a 2 site configuration with 3 local devices (LDs) in a
primary site being resycnhronied to 3 remotes sites with 3 remote
devices (RDs) each via site links (SLINK) 1-2 with site link 0
as a special case to handle the local devices:

                                           |
    Local (primary) site                   |      Remote sites
    --------------------                   |      ------------
                                           |
    D1   D2     Dn                         |
     |   |       |                         |
     +---+- ... -+                         |
         |                                 |
       REPLOG-----------------+- SLINK1 ------------+
         |                    |            |        |
       SLINK0 (special case)  |            |        |
         |                    |            |        |
     +-----+   ...  +         |            |   +----+- ... -+
     |     |        |         |            |   |    |       |
    LD1   LD2      LDn        |            |  RD1  RD2     RDn
                              |            |
                              +-- SLINK2------------+
                              |            |        |
                              |            |   +----+- ... -+
                              |            |   |    |       |
                              |            |  RD1  RD2     RDn
                              |            |
                              |            |
                              |            |
                              +- SLINKm ------------+
                                           |        |
                                           |   +----+- ... -+
                                           |   |    |       |
                                           |  RD1  RD2     RDn


The following are descriptions of the device-mapper tables used to
construct the "replicator" and "replicator-dev" targets.

"replicator" target parameters:
-------------------------------
<start> <length> replicator \
	<replog_type> <#replog_params> <replog_params> \
	[<slink_type_0> <#slink_params_0> <slink_params_0>]{1..N}

<replog_type>    = "ringbuffer" is currently the only available type
<#replog_params> = # of args following this one intended for the replog (2 or 4)
<replog_params>  = <dev_path> <dev_start> [auto/create/open <size>]
	<dev_path>  = device path of replication log (REPLOG) backing store
	<dev_start> = offset to REPLOG header
	create	    = The replication log will be initialized if not active
		      and sized to "size".  (If already active, the create
		      will fail.)  Size is always in sectors.
	open	    = The replication log must be initialized and valid or
		      the constructor will fail.
	auto        = If a valid replication log header is found on the
		      replication device, this will behave like 'open'.
		      Otherwise, this option behaves like 'create'.

<slink_type>    = "blockdev" is currently the only available type
<#slink_params> = 1/2/4
<slink_params>  = <slink_nr> [<slink_policy> [<fall_behind> <N>]]
	<slink_nr>     = This is a unique number that is used to identify a
			 particular site/location.  '0' is always used to
			 identify the local site, while increasing integers
			 are used to identify remote sites.
	<slink_policy> = The policy can be either 'sync' or 'async'.
			 'sync' means write requests will not return until
			 the data is on the storage device.  'async' allows
			 a device to "fall behind"; that is, outstanding
			 write requests are waiting in the replication log
			 to be processed for this site, but it is not delaying
			 the writes of other sites.
	<fall_behind>  = This field is used to specify how far the user is
			 willing to allow write requests to this specific site
			 to "fall behind" in processing before switching to
			 a 'sync' policy.  This "fall behind" threshhold can
			 be specified in three ways: ios, size, or timeout.
			 'ios' is the number of pending I/Os allowed (e.g.
			 "ios 10000").  'size' is the amount of pending data
			 allowed (e.g. "size 200m").  Size labels include:
			 s (sectors), k, m, g, t, p, and e.  'timeout' is
			 the amount of time allowed for writes to be
			 outstanding.  Time labels include: s, m, h, and d.


"replicator-dev" target parameters:
-----------------------------------
start> <length> replicator-dev
       <replicator_device> <dev_nr> \
       [<slink_nr> <#dev_params> <dev_params>
        <dlog_type> <#dlog_params> <dlog_params>]{1..N}

<replicator_device> = device previously constructed via "replication" target
<dev_nr>	    = An integer that is used to 'tag' write requests as
		      belonging to a particular set of devices - specifically,
		      the devices that follow this argument (i.e. the site
		      link devices).
<slink_nr>	    = This number identifies the site/location where the next
		      device to be specified comes from.  It is exactly the
		      same number used to identify the site/location (and its
		      policies) in the "replicator" target.  Interestingly,
		      while one might normally expect a "dev_type" argument
		      here, it can be deduced from the site link number and
		      the 'slink_type' given in the "replication" target.
<#dev_params>	    = '1'  (The number of allowed parameters actually depends
		      on the 'slink_type' given in the "replication" target.
		      Since our only option there is "blockdev", the only
		      allowable number here is currently '1'.)
<dev_params>	    = 'dev_path'  (Again, since "blockdev" is the only
		      'slink_type' available, the only allowable argument here
		      is the path to the device.)
<dlog_type>	    = Not to be confused with the "replicator log", this is
		      the type of dirty log associated with this particular
		      device.  Dirty logs are used for synchronization, during
		      initialization or fall behind conditions, to bring devices
		      into a coherent state with its peers - analogous to
		      rebuilding a RAID1 (mirror) device.  Available dirty
		      log types include: 'nolog', 'core', and 'disk'
<#dlog_params>	    = The number of arguments required for a particular log
		      type - 'nolog' = 0, 'core' = 1/2, 'disk' = 2/3.
<dlog_params>	    = 'nolog' => ~no arguments~
		      'core'  => <region_size> [sync | nosync]
		      'disk'  => <dlog_dev_path> <region_size> [sync | nosync]
	<region_size>   = This sets the granularity at which the dirty log
			  tracks what areas of the device is in-sync.
	[sync | nosync] = Optionally specify whether the sync should be forced
			  or avoided initially.