dm-replicator ============= Device-mapper replicator is designed to enable redundant copies of storage devices to be made - preferentially, to remote locations. RAID1 (aka mirroring) is often used to maintain redundant copies of storage for fault tolerance purposes. Unlike RAID1, which often assumes similar device characteristics, dm-replicator is designed to handle devices with different latency and bandwidth characteristics which are often the result of the geograhic disparity of multi-site architectures. Simply put, you might choose RAID1 to protect from a single device failure, but you would choose remote replication via dm-replicator for protection against a site failure. dm-replicator works by first sending write requests to the "replicator log". Not to be confused with the device-mapper dirty log, this replicator log behaves similarly to that of a journal. Write requests go to this log first and then are copied to all the replicate devices at their various locations. Requests are cleared from the log once all replicate devices confirm the data is received/copied. This architecture allows dm-replicator to be flexible in terms of device characteristics. If one device should fall behind the others - perhaps due to high latency - the slack is picked up by the log. The user has a great deal of flexibility in specifying to what degree a particular site is allowed to fall behind - if at all. Device-Mapper's dm-replicator has two targets, "replicator" and "replicator-dev". The "replicator" target is used to setup the aforementioned log and allow the specification of site link properties. Through the "replicator" target, the user might specify that writes that are copied to the local site must happen synchronously (i.e the writes are complete only after they have passed through the log device and have landed on the local site's disk). They may also specify that a remote link should asynchronously complete writes, but that the remote link should never fall more than 100MB behind in terms of processing. Again, the "replicator" target is used to define the replicator log and the characteristics of each site link. The "replicator-dev" target is used to define the devices used and associate them with a particular replicator log. You might think of this stage in a similar way to setting up RAID1 (mirroring). You define a set of devices which will be copies of each other, but access the device through the mirror virtual device which takes care of the copying. The user accessible replicator device is analogous to the mirror virtual device, while the set of devices being copied to are analogous to the mirror images (sometimes called 'legs'). When creating a replicator device via the "replicator-dev" target, it must be associated with the replicator log (created with the aforementioned "replicator" target). When each redundant device is specified as part of the replicator device, it is associated with a site link whose properties were defined when the "replicator" target was created. The user can go farther than simply replicating one device. They can continue to add replicator devices - associating them with a particular replicator log. Writes that go through the replicator log are guarenteed to have their write ordering preserved. So, if you associate more than one replicator device to a particular replicator log, you are preserving write ordering across multiple devices. This might be useful if you had a database that spanned multiple disks and write ordering must be preserved or any transaction accounting scheme would be foiled. (You can imagine this like preserving write ordering across a number of mirrored devices, where each mirror has images/legs in different geographic locations.) dm-replicator has a modular architecture. Future implementations for the replicator log and site link modules are allowed. The current replication log is ringbuffer - utilized to store all writes being subject to replication and enforce write ordering. The current site link code is based on accessing block devices (iSCSI, FC, etc) and does device recovery including (initial) resynchronization. Picture of a 2 site configuration with 3 local devices (LDs) in a primary site being resycnhronied to 3 remotes sites with 3 remote devices (RDs) each via site links (SLINK) 1-2 with site link 0 as a special case to handle the local devices: | Local (primary) site | Remote sites -------------------- | ------------ | D1 D2 Dn | | | | | +---+- ... -+ | | | REPLOG-----------------+- SLINK1 ------------+ | | | | SLINK0 (special case) | | | | | | | +-----+ ... + | | +----+- ... -+ | | | | | | | | LD1 LD2 LDn | | RD1 RD2 RDn | | +-- SLINK2------------+ | | | | | +----+- ... -+ | | | | | | | RD1 RD2 RDn | | | | | | +- SLINKm ------------+ | | | +----+- ... -+ | | | | | RD1 RD2 RDn The following are descriptions of the device-mapper tables used to construct the "replicator" and "replicator-dev" targets. "replicator" target parameters: ------------------------------- replicator \ <#replog_params> \ [ <#slink_params_0> ]{1..N} = "ringbuffer" is currently the only available type <#replog_params> = # of args following this one intended for the replog (2 or 4) = [auto/create/open ] = device path of replication log (REPLOG) backing store = offset to REPLOG header create = The replication log will be initialized if not active and sized to "size". (If already active, the create will fail.) Size is always in sectors. open = The replication log must be initialized and valid or the constructor will fail. auto = If a valid replication log header is found on the replication device, this will behave like 'open'. Otherwise, this option behaves like 'create'. = "blockdev" is currently the only available type <#slink_params> = 1/2/4 = [ [ ]] = This is a unique number that is used to identify a particular site/location. '0' is always used to identify the local site, while increasing integers are used to identify remote sites. = The policy can be either 'sync' or 'async'. 'sync' means write requests will not return until the data is on the storage device. 'async' allows a device to "fall behind"; that is, outstanding write requests are waiting in the replication log to be processed for this site, but it is not delaying the writes of other sites. = This field is used to specify how far the user is willing to allow write requests to this specific site to "fall behind" in processing before switching to a 'sync' policy. This "fall behind" threshhold can be specified in three ways: ios, size, or timeout. 'ios' is the number of pending I/Os allowed (e.g. "ios 10000"). 'size' is the amount of pending data allowed (e.g. "size 200m"). Size labels include: s (sectors), k, m, g, t, p, and e. 'timeout' is the amount of time allowed for writes to be outstanding. Time labels include: s, m, h, and d. "replicator-dev" target parameters: ----------------------------------- start> replicator-dev \ [ <#dev_params> <#dlog_params> ]{1..N} = device previously constructed via "replication" target = An integer that is used to 'tag' write requests as belonging to a particular set of devices - specifically, the devices that follow this argument (i.e. the site link devices). = This number identifies the site/location where the next device to be specified comes from. It is exactly the same number used to identify the site/location (and its policies) in the "replicator" target. Interestingly, while one might normally expect a "dev_type" argument here, it can be deduced from the site link number and the 'slink_type' given in the "replication" target. <#dev_params> = '1' (The number of allowed parameters actually depends on the 'slink_type' given in the "replication" target. Since our only option there is "blockdev", the only allowable number here is currently '1'.) = 'dev_path' (Again, since "blockdev" is the only 'slink_type' available, the only allowable argument here is the path to the device.) = Not to be confused with the "replicator log", this is the type of dirty log associated with this particular device. Dirty logs are used for synchronization, during initialization or fall behind conditions, to bring devices into a coherent state with its peers - analogous to rebuilding a RAID1 (mirror) device. Available dirty log types include: 'nolog', 'core', and 'disk' <#dlog_params> = The number of arguments required for a particular log type - 'nolog' = 0, 'core' = 1/2, 'disk' = 2/3. = 'nolog' => ~no arguments~ 'core' => [sync | nosync] 'disk' => [sync | nosync] = This sets the granularity at which the dirty log tracks what areas of the device is in-sync. [sync | nosync] = Optionally specify whether the sync should be forced or avoided initially.