BTRFS RAID Stripe Tree Design
=============================


Problem Statement
-----------------


RAID on zoned devices
---------------------
The current implementation of RAID profiles in BTRFS is based on the implicit
assumption that data placement is deterministic in the device chunks used for
mapping block groups.
With deterministic data placement, all physical on-disk extents of one logical
file extent are positioned at the same offset relative to the starting LBA of
a device chunk. This prevents the need for reading any meta-data to access an
on-disk file extent. Figure 1 shows an example of it.


	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 1     |      |    D 1     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 2     |      |    D 2     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	Figure 1: Deterministic data placement



With non-deterministic data placement, the on-disk extents of a logical file
extent can be scattered around inside the chunk. To read back the data with
non-deterministic data placement, additional meta-data describing the position
of the extents inside a chunk is needed. Figure 2 shows an example for this
style of data placement.


	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 1     |      |    D 2     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|    D 2     |      |    D 1     |
	|            |      |            |
	+------------+      +------------+
	|            |      |            |
	|            |      |            |
	|            |      |            |
	|            |      |            |
	+------------+      +------------+
	Figure 2: Non-deterministic data placement

As BTRFS support for zoned block devices uses the Zone Append operation for
writing file data extents, there is no guarantee that the written extents have
the same offset within different device chunks. This implies that to be able
to use RAID with zoned devices, non-deterministic data placement must be
supported and additional meta-data describing the location of file extents
within device chunks is needed.


Lessons learned from RAID 5
---------------------------
BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
write hole problem. This problem exists because sub-stripe write operations
are not done using a copy-on-write (COW) but done using Read-Modify-Write
(RMW). With out-of-place writing like COW, no blocks will be overwritten and
there is no risk of exposing bad data or corrupting a data stripe parity in
case of sudden power loss or other unexpected events preventing correctly
writing to the device.

RAID Stripe Tree Design overview
--------------------------------

To solve the problems stated above, additional meta-data is introduced: a RAID
Stripe Tree holds the logical to physical translation for the RAID stripes.
For each logical file extent (struct btrfs_file_extent_item) a stripe extent
is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
holds the device ID and the physical start location on that device of the
sub-stripe of a file extent, as well as the stride's length.
Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
btrfs_key and disk_num_bytes as the offset. The key’s type is
BTRFS_STRIPE_EXTENT_KEY.

On-disk format
--------------

struct btrfs_file_extent_item {
        /* […] */
        __le64 disk_bytenr; ------------------------------------+
        __le64 disk_num_bytes;  --------------------------------|----+
	/* […] */                                               |    |
};                                                              |    |
                                                                |    |
struct btrfs_key key = {   -------------------------------------|----|--+
	.objectid = btrfs_file_extent_item::disk_bytenr, <------+    |  |
	.type = BTRFS_STRIPE_EXTENT_KEY,                             |  |
	.offset = btrfs_file_extent_item::disk_num_bytes, <----------+  |
};                                                                      |
                                                                        |
struct btrfs_raid_stride { <------------------+                         |
       __le64 devid;                          |                         |
       __le64 physical;                       |                         |
       __le64 length;                         |                         |
};                                            |                         |
                                              |                         |
struct btrfs_stripe_extent {  <---------------|-------------------------+
	u8 encoding;                          |
	u8 reserved[7];                       |
       struct btrfs_raid_stride strides[]; ---+
};


User-space support
------------------


mkfs
----
Supporting the RAID Stripe Tree in user-space consists of three things to do
for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
Stripe Tree is impossible for a kernel version without appropriate support.
Lastly it must allow RAID profiles on zoned devices once the tree is present.


Check
-----
The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its
task is to read the struct btrfs_stripe_extent entries for each struct
btrfs_file_extent_item and verify that a correct mapping between these two
exists. If data checksum verification is requested as well, the tree also must
be read to perform the logical to physical translation, otherwise the data
cannot be read, and the checksums cannot be verified.


Example tree dumps
------------------

Example 1: Write 128k to and empty FS

RAID0
        item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 131072
                        stripe 1 devid 2 physical XXXXXXXXX length 131072
RAID10
        item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
                        stripe 2 devid 3 physical XXXXXXXXX length 65536
                        stripe 3 devid 4 physical XXXXXXXXX length 65536

Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
write 16k.

RAID0
        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 2 physical XXXXXXXXX length 4096
        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID0
                        stripe 0 devid 2 physical XXXXXXXXX length 61440
                        stripe 1 devid 1 physical XXXXXXXXX length 4096
        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
RAID1
        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
                        stripe 1 devid 2 physical XXXXXXXXX length 4096
        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
                        stripe 1 devid 2 physical XXXXXXXXX length 4096
RAID10
        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 65536
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 3 physical XXXXXXXXX length 4096
                        stripe 1 devid 4 physical XXXXXXXXX length 4096
        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
                        encoding: RAID10
                        stripe 0 devid 3 physical XXXXXXXXX length 61440
                        stripe 1 devid 4 physical XXXXXXXXX length 61440
                        stripe 2 devid 1 physical XXXXXXXXX length 4096
                        stripe 3 devid 2 physical XXXXXXXXX length 4096
        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 4096
                        stripe 1 devid 2 physical XXXXXXXXX length 4096

Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
overwrite 8k in the middle.

RAID0
        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 65536
                        stripe 2 devid 1 physical XXXXXXXXX length 32768
        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
                        encoding: RAID0
                        stripe 0 devid 1 physical XXXXXXXXX length 8192
RAID1
        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 32768
        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 131072
                        stripe 1 devid 2 physical XXXXXXXXX length 131072
        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 physical XXXXXXXXX length 8192
                        stripe 1 devid 2 physical XXXXXXXXX length 8192
RAID10
        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 32768
        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 32768
                        stripe 1 devid 2 physical XXXXXXXXX length 32768
                        stripe 2 devid 3 physical XXXXXXXXX length 65536
                        stripe 3 devid 4 physical XXXXXXXXX length 65536
                        stripe 4 devid 1 physical XXXXXXXXX length 32768
                        stripe 5 devid 2 physical XXXXXXXXX length 32768
        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
                        encoding: RAID10
                        stripe 0 devid 1 physical XXXXXXXXX length 8192
                        stripe 1 devid 2 physical XXXXXXXXX length 8192


Glossary
--------


RAID
	Redundant Array of Independent Disks. This is a storage mechanism
	where data is not stored on a single disk alone but either mirrored
	(in case of RAID 1) or split across multiple disks (RAID 0). Other
	RAID levels like RAID5 and RAID6 stripe the data across multiple disks
	and add parity information to enable data recovery in case of a disk
	failure.


LBA 
	Logical Block Address. LBAs describe the address space of a block
	device as a linearly increasing address map. LBAs are internally
	mapped to different physical locations  by the device firmware.


Zoned Block Device
	Zoned Block Devices are a special kind of block devices that partition
	their LBA space into so called zones. These zones can impose write
	constraints on the host, e.g., allowing only sequential writes aligned
	to a zone write pointer.


Zone Append
	A write operation where the start LBA of a Zone is specified instead
	of a destination LBA for the data to be written. On completion the
	device reports the starting LBA used to write the data back to the
	host.


Copy-on-Write
	A write technique where the data is not overwritten, but a new version
	of it is written out of place.


Read-Modify-Write
	A write technique where the data to be written is first read from the
	block device, modified in memory and the modified data written back in
	place on the block device.
