The SPDK "reduce" block compression scheme is based on using SSDs for storing compressed blocks of storage and persistent memory for metadata. This metadata includes mappings of logical blocks requested by a user to the compressed blocks on SSD. The scheme described in this document is generic and not tied to any specific block device framework such as the SPDK block device (bdev) framework. This algorithm will be implemented in a library called "libreduce". Higher-level software modules can built on top of this library to create and present block devices in a specific block device framework. For SPDK, a bdev_reduce module will serve as a wrapper around the libreduce library, to present the compressed block devices as an SPDK bdev.
This scheme only describes how compressed blocks are stored on an SSD and the metadata for tracking those compressed blocks. It relies on the higher-software module to perform the compression algorithm itself. For SPDK, the bdev_reduce module will utilize the DPDK compressdev framework to perform compression and decompression on behalf of the libreduce library.
(Note that in some cases, blocks of storage may not be compressible, or cannot be compressed enough to realize savings from the compression. In these cases, the data may be stored uncompressed on disk. The phrase "compressed blocks of storage" includes these uncompressed blocks.)
A compressed block device is a logical entity built on top of a similarly-sized backing storage device. The backing storage device must be thin-provisioned to realize any savings from compression for reasons described later in this document. This algorithm has no direct knowledge of the implementation of the backing storage device, except that it will always use the lowest-numbered blocks available on the backing storage device. This will ensure that when this algorithm is used on a thin-provisioned backing storage device, blocks will not be allocated until they are actually needed.
The backing storage device must be sized for the worst case scenario, where no data can be compressed. In this case, the size of the backing storage device would be the same as the compressed block device. Since this algorithm ensures atomicity by never overwriting data in place, some additional backing storage is required to temporarily store data for writes in progress before the associated metadata is updated.
Storage from the backing storage device will be allocated, read, and written to in 4KB units for best NVMe performance. These 4KB units are called "backing IO units". They are indexed from 0 to N-1 with the indices called "backing IO unit indices". At start, the full set of indices represent the "free backing IO unit list".
A compressed block device compresses and decompresses data in units of chunks, where a chunk is a multiple of at least two 4KB backing IO units. The number of backing IO units per chunk determines the chunk size and is specified when the compressed block device is created. A chunk consumes a number of 4KB backing IO units between 1 and the number of 4KB units in the chunk. For example, a 16KB chunk consumes 1, 2, 3 or 4 backing IO units. The number of backing IO units depends on how much the chunk was able to be compressed. The blocks on disk associated with a chunk are stored in a "chunk map" in persistent memory. Each chunk map consists of N 64-bit values, where N is the maximum number of backing IO units in the chunk. Each 64-bit value corresponds to a backing IO unit index. A special value (for example, 2^64-1) is used for backing IO units not needed due to compression. The number of chunk maps allocated is equal to the size of the compressed block device divided by its chunk size, plus some number of extra chunk maps. These extra chunk maps are used to ensure atomicity on writes and will be explained later in this document. At start, all of the chunk maps represent the "free chunk map list".
Finally, the logical view of the compressed block device is represented by the "logical map". The logical map is a mapping of chunk offsets into the compressed block device to the corresponding chunk map. Each entry in the logical map is a 64-bit value, denoting the associated chunk map. A special value (UINT64_MAX) is used if there is no associated chunk map. The mapping is determined by dividing the byte offset by the chunk size to get an index, which is used as an array index into the array of chunk map entries. At start, all entries in the logical map have no associated chunk map. Note that while access to the backing storage device is in 4KB units, the logical view may allow 4KB or 512B unit access and should perform similarly.
To illustrate this algorithm, we will use a real example at a very small scale.
The size of the compressed block device is 64KB, with a chunk size of 16KB. This will realize the following:
In these examples, the value "X" will represent the special value (2^64-1) described above.
Operations that span a chunk boundary are logically split into multiple operations, each of which is associated with a single chunk.
Example: 20KB write at offset 4KB
In this case, the write operation is split into a 12KB write at offset 4KB (affecting only chunk 0 in the logical map) and a 8KB write at offset 16KB (affecting only chunk 1 in the logical map). Each write is processed independently using the algorithm described above. Completion of the 20KB write does not occur until both operations have completed.
Unmap operations on an entire chunk are achieved by removing the chunk map entry (if any) from the logical map. The chunk map is returned to the free chunk map list, and any backing IO units associated with the chunk map are returned to the free backing IO unit list.
Unmap operations that affect only part of a chunk can be treated as writing zeroes to that region of the chunk. If the entire chunk is unmapped via several operations, it can be detected via the uncompressed data equaling all zeroes. When this occurs, the chunk map entry may be removed from the logical map.
After an entire chunk has been unmapped, subsequent reads to the chunk will return all zeroes. This is similar to the "Read 16KB at offset 16KB" example above.
Write zeroes operations are handled similarly to unmap operations. If a write zeroes operation covers an entire chunk, we can remove the chunk's entry in the logical map completely. Then subsequent reads to that chunk will return all zeroes.
An application using libreduce will periodically exit and need to be restarted. When the application restarts, it will reload compressed volumes so they can be used again from the same state as when the application exited.
When the compressed volume is reloaded, the free chunk map list and free backing IO unit list are reconstructed by walking the logical map. The logical map will only point to valid chunk maps, and the valid chunk maps will only point to valid backing IO units. Any chunk maps and backing IO units not referenced go into their respective free lists.
This ensures that if a system crashes in the middle of a write operation - i.e. during or after a chunk map is updated, but before it is written to the logical map - that everything related to that in-progress write will be ignored after the compressed volume is restarted.
Implementations must take care to handle overlapping operations on the same chunk. For example, operation 1 writes some data to chunk A, and while this is in progress, operation 2 also writes some data to chunk A. In this case, operation 2 should not start until operation 1 has completed. Further optimizations are outside the scope of this document.
Backing storage must be thin provisioned to realize any savings from compression. This algorithm will always use (and reuse) backing IO units available closest to offset 0 on the backing device. This ensures that even though backing storage device may have been sized similarly to the size of the compressed volume, storage for the backing storage device will not actually be allocated until the backing IO units are actually needed.