I/O Limits: block sizes, alignment and I/O hints TOC: ==== * Overview * Userspace access * Standards * Stacking I/O Limits * LVM * Partition and Filesystem tools Overview: ========= The Linux I/O stack has been enhanced to consume vendor-provided "I/O Limits" information that allows Linux tools (parted, lvm, mkfs.*, etc) to optimize placement of and access to data. I/O that is not properly aligned relative to the device's "I/O Limits" will result in reduced performance or, in the worst case, application failure (see: "Direct I/O best practices" in "Userspace access" below). Not all storage devices export this "I/O Limits" information yet. Such "legacy" devices will work fine given the various RHEL6 tools' defaults will conservatively align all I/O on a 4K, or larger power of 2, boundary. Utilization of this "I/O Limits" information enables 4K sector devices to be fully supported for data volumes. Boot support for 4K sector devices is planned but not yet supported. The kernel provides both block device ioctl and sysfs access to each device's various "I/O Limits". I/O Limits ---------- Certain 4K sector devices may use a 4K 'physical_block_size' internally but expose a finer-grained 512 byte 'logical_block_size' to Linux. This discrepancy introduces potential for misaligned I/O. Linux will attempt to start all data areas on a naturally aligned ('physical_block_size') boundary by making sure it accounts for any 'alignment_offset' if the beginning of the Linux block device is offset from the underlying physical alignment. Storage vendors can also supply "I/O hints" about a device's preferred minimum unit for random I/O ('minimum_io_size') and streaming I/O ('optimal_io_size'). For example, these hints may correspond to a RAID device's chunk size and stripe size respectively. Userspace access ================ Direct I/O best practices ------------------------- Users must always take care to use properly aligned and sized IO. This is especially important for Direct I/O access. Direct I/O should be aligned on a 'logical_block_size' boundary and in multiples of the 'logical_block_size'. With native 4K devices (logical_block_size is 4K) it is now critical that applications perform Direct I/O that is a multiple of the device's 'logical_block_size'. This means that applications that do not perform 4K aligned I/O, but 512-byte aligned I/O, will break with native 4K devices. Applications may consult a device's "I/O Limits" to ensure they are using properly aligned and sized I/O. The "I/O Limits" are exposed through both sysfs and block device ioctl interfaces (also see: libblkid). sysfs interface --------------- /sys/block//alignment_offset /sys/block///alignment_offset /sys/block//queue/physical_block_size /sys/block//queue/logical_block_size /sys/block//queue/minimum_io_size /sys/block//queue/optimal_io_size The kernel will still export these sysfs attribute for "legacy" devices that do not provide "I/O Limits" information, for example: alignment_offset: 0 physical_block_size: 512 logical_block_size: 512 minimum_io_size: 512 optimal_io_size: 0 block device ioctls ------------------- BLKALIGNOFF: alignment_offset BLKPBSZGET: physical_block_size BLKSSZGET: logical_block_size BLKIOMIN: minimum_io_size BLKIOOPT: optimal_io_size Standards ========= ATA --- ATA devices must report appropriate information via the IDENTIFY DEVICE command. ATA devices only report "I/O Limits" for 'physical_block_size', 'logical_block_size' and 'alignment_offset'. The additional "I/O Hints" are outside the scope of the ATA Command Set. SCSI ---- The kernel's "I/O Limits" support requires at least version 3 of the SCSI Primary Commands protocol (SPC-3). Linux will only send a READ CAPACITY(16) and "extended inquiry" (which gains access to the BLOCK LIMITS VPD page) to devices which claim conformance to SPC-3. 1) READ CAPACITY(16) provides the block sizes and alignment offset: LOGICAL BLOCK LENGTH IN BYTES: /sys/block//queue/logical_block_size LOGICAL BLOCKS PER PHYSICAL BLOCK EXPONENT is used to derive: /sys/block//queue/physical_block_size LOWEST ALIGNED LOGICAL BLOCK ADDRESS: /sys/block//alignment_offset /sys/block///alignment_offset 2) BLOCK LIMITS VPD provides the "I/O hints": OPTIMAL TRANSFER LENGTH GRANULARITY and OPTIMAL TRANSFER LENGTH are used to derive: /sys/block//queue/minimum_io_size /sys/block//queue/optimal_io_size The sg3_utils package provides the 'sg_inq' utility that can be used to access the BLOCK LIMITS VPD page (0xb0), using: sg_inq -p 0xb0 Stacking I/O Limits =================== All layers of the Linux I/O stack have been engineered to propagate the various "I/O Limits" up the stack. When a layer consumes an attribute or aggregates many devices, it must expose appropriate "I/O Limits" so that upper-layer devices or tools will have an accurate view of the storage as it transformed. Some practical examples are: - only one layer in the I/O stack should adjust for a non-zero 'alignment_offset'; once a layer adjusts for it it will export a device with an 'alignment_offset' of zero - a striped Device Mapper (DM) device, created with LVM, must export a 'minimum_io_size' and 'optimal_io_size' relative to the stripe count (number of disks) and user provided chunk size Linux Device Mapper (DM) and Software Raid (MD) device drivers can be used to arbitrarily combine devices with different "I/O Limits". The kernel's block layer goes to great lengths to reasonably combine the "I/O Limits" of the individual devices. The kernel will not prevent combining heterogenuous devices but the user should be aware of the risk associated with doing so. For instance, a 512 byte device and a 4K device may be combined into a single logical DM device; the resulting DM device would have a 'logical_block_size' of 4K. Filesystems layered on such a hybrid device assume that 4K will be written atomically but in reality it will span 8 LBAs when issued to the 512 byte device. Using a 4K 'logical_block_size' for the higher-level DM device increases potential for a partial write to the 512b device if there is a system crash. If combining multiple devices' "I/O Limits" results in a conflict the block layer may report a warning that the device is susceptible to partial writes and/or misaligned. Logical Volume Manager (LVM) ============================ LVM provides userspace tools that are used to manage the kernel's DM devices. LVM will shift the start of the data area, that a given DM device will use, to account for a non-zero 'alignment_offset' associated with any device LVM manages. This means LVM logical volumes will be properly aligned (alignment_offset=0). LVM will adjust for any 'alignment_offset' by default but this may be disabled through lvm.conf's 'data_alignment_offset_detection'. Disabling this is not recommended. LVM will also detect the "I/O hints" for a device. The start of a device's data area will be a multiple of the 'minimum_io_size' or 'optimal_io_size' exposed in sysfs. 'minimum_io_size' is used if 'optimal_io_size' is undefined (0). LVM will automatically determine these "I/O hints" by default but this may be disabled through lvm.conf's 'data_alignment_detection'. Disabling this is not recommended. Partition and Filesystem tools ============================== util-linux-ng's libblkid and fdisk ---------------------------------- The libblkid library provided with the util-linux-ng package includes a programmatic API to access a device's "I/O Limits". libblkid allows applications, especially those that use Direct I/O, to properly size their I/O requests. util-linux-ng's fdisk uses libblkid to determine a device's "I/O Limits" for optimal placement of all partitions. If a device doesn't provide "I/O Limits" information fdisk will align all partitions on a 1MB boundary. parted and libparted -------------------- parted's libparted also uses libblkid's "I/O Limits" API. The RHEL6 installer (anaconda) uses libparted. This means that all partitions created with either the installer or parted will be properly aligned. The default alignment for all partitions created on a device that doesn't appear to provide "I/O Limits" information will be 1MB. The heuristic parted uses is: 1) Always use the reported 'alignment_offset' as the offset for the start of the first primary partition. 2a) If 'optimal_io_size' is defined (not 0) align all partitions on an 'optimal_io_size' boundary. 2b) If 'optimal_io_size' is undefined (0) and 'alignment_offset' is 0 and 'minimum_io_size' is a power of 2: use a 1MB default alignment. - as you can see this is the catch all for "legacy" devices which don't appear to provide "I/O hints"; so in the default case all partitions will align on a 1MB boundary. - NOTE: we can't distinguish between a "legacy" device and modern device that provides "I/O hints" with alignment_offset=0 and optimal_io_size=0. Such a device might be a single SAS 4K device. So worst case we lose < 1MB of space at the start of the disk. Filesystem tools ---------------- mkfs.ext[234], mkfs.xfs, and mkfs.gfs2 have been enhanced to consume a device's "I/O Limits". Linux filesystems are not allowed to be formatted to use a block size that is smaller than the underlying storage's 'logical_block_size'. mkfs.ext[234] and mkfs.xfs also use the "I/O hints" to layout ondisk data structure and data areas relative to the underlying storage's 'minimum_io_size' and 'optimal_io_size' -- this allows filesystems to be optimally formatted for various RAID (striped) layouts.