The programmer's guide is intended for developers authoring applications that utilize the SPDK Blobstore. It is intended to supplement the source code in providing an overall understanding of how to integrate Blobstore into an application as well as provide some high level insight into how Blobstore works behind the scenes. It is not intended to serve as a design document or an API reference and in some cases source code snippets and high level sequences will be discussed; for the latest source code reference refer to the repo.
Blobstore is a persistent, power-fail safe block allocator designed to be used as the local storage system backing a higher level storage service, typically in lieu of a traditional filesystem. These higher level services can be local databases or key/value stores (MySQL, RocksDB), they can be dedicated appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It is not designed to be a general purpose filesystem, however, and it is intentionally not POSIX compliant. To avoid confusion, we avoid references to files or objects instead using the term 'blob'. The Blobstore is designed to allow asynchronous, uncached, parallel reads and writes to groups of blocks on a block device called 'blobs'. Blobs are typically large, measured in at least hundreds of kilobytes, and are always a multiple of the underlying block size.
The Blobstore is designed primarily to run on "next generation" media, which means the device supports fast random reads and writes, with no required background garbage collection. However, in practice the design will run well on NAND too.
The Blobstore defines a hierarchy of storage abstractions as follows.
For all Blobstore operations regarding atomicity, there is a dependency on the underlying device to guarantee atomic operations of at least one page size. Atomicity here can refer to multiple operations:
Blobstore is callback driven; in the event that any Blobstore API is unable to make forward progress it will not block but instead return control at that point and make a call to the callback function provided in the API, along with arguments, when the original call is completed. The callback will be made on the same thread that the call was made from, more on threads later. Some API, however, offer no callback arguments; in these cases the calls are fully synchronous. Examples of asynchronous calls that utilize callbacks include those that involve disk IO, for example, where some amount of polling is required before the IO is completed.
Blobstore requires a backing storage device that can be integrated using the bdev
layer, or by directly integrating a device driver to Blobstore. The blobstore performs operations on a backing block device by calling function pointers supplied to it at initialization time. For convenience, an implementation of these function pointers that route I/O to the bdev layer is available in bdev_blob.c
. Alternatively, for example, the SPDK NVMe driver may be directly integrated bypassing a small amount of bdev
layer overhead. These options will be discussed further in the upcoming section on examples.
Because Blobstore is designed to be lock-free, metadata operations need to be isolated to a single thread to avoid taking locks on in memory data structures that maintain data on the layout of definitions of blobs (along with other data). In Blobstore this is implemented as the metadata thread
and is defined to be the thread on which the application makes metadata related calls on. It is up to the application to setup a separate thread to make these calls on and to assure that it does not mix relevant IO operations with metadata operations even if they are on separate threads. This will be discussed further in the Design Considerations section.
An application using Blobstore with the SPDK NVMe driver, for example, can support a variety of thread scenarios. The simplest would be a single threaded application where the application, the Blobstore code and the NVMe driver share a single core. In this case, the single thread would be used to submit both metadata operations as well as IO operations and it would be up to the application to assure that only one metadata operation is issued at a time and not intermingled with affected IO operations.
Channels are an SPDK-wide abstraction and with Blobstore the best way to think about them is that they are required in order to do IO. The application will perform IO to the channel and channels are best thought of as being associated 1:1 with a thread.
With external snapshots (see External Snapshots and Esnap Clones), a read from a blob may lead to reading from the device containing the blobstore or an external snapshot device. To support this, each blobstore IO channel maintains a tree of channels to be used when reading from external snapshot devices.
When an application creates a blob, it does not provide a name as is the case with many other similar storage systems, instead it is returned a unique identifier by the Blobstore that it needs to use on subsequent APIs to perform operations on the Blobstore.
When the Blobstore is initialized, there are multiple configuration options to consider. The options and their defaults are:
Blobstore is only capable of doing page sized read/write operations. If the application requires finer granularity it will have to accommodate that itself.
As mentioned earlier, Blobstore can share a single thread with an application or the application can define any number of threads, within resource constraints, that makes sense. The basic considerations that must be followed are:
As with all SPDK based applications, Blobstore requires memory used for data buffers to be allocated with SPDK API.
Asynchronous Blobstore callbacks all include an error number that should be checked; non-zero values indicate an error. Synchronous calls will typically return an error value if applicable.
Asynchronous callbacks will return control not immediately, but at the point in execution where no more forward progress can be made without blocking. Therefore, no assumptions can be made about the progress of an asynchronous call until the callback has completed.
Setting and removing of xattrs in Blobstore is a metadata operation, xattrs are stored in per blob metadata. Therefore, xattrs are not persisted until a blob synchronization call is made and completed. Having a step process for persisting per blob metadata allows for applications to perform batches of xattr updates, for example, with only one more expensive call to synchronize and persist the values.
As described earlier, there are two types of metadata in Blobstore, per blob and one global metadata for the Blobstore itself. Only the per blob metadata can be explicitly synchronized via API. The global metadata will be inconsistent during run-time and only synchronized on proper shutdown. The implication, however, of an improper shutdown is only a performance penalty on the next startup as the global metadata will need to be rebuilt based on a parsing of the per blob metadata. For consistent start times, it is important to always close down the Blobstore properly via API.
Multiple examples of how to iterate through the blobs are included in the sample code and tools. Worthy to note, however, if walking through the existing blobs via the iter API, if your application finds the blob its looking for it will either need to explicitly close it (because was opened internally by the Blobstore) or complete walking the full list.
The super blob is simply a single blob ID that can be stored as part of the global metadata to act as sort of a "root" blob. The application may choose to use this blob to store any information that it needs or finds relevant in understanding any kind of structure for what is on the Blobstore.
There are multiple examples of Blobstore usage in the repo:
hello_blob.c
this is a very basic example of a single threaded application that does nothing more than demonstrate the very basic API. Although Blobstore is optimized for NVMe, this example uses a RAM disk (malloc) back-end so that it can be executed easily in any development environment. The malloc back-end is a bdev
module thus this example uses not only the SPDK Framework but the bdev
layer as well.blobcli.c
example is command line utility intended to not only serve as example code but as a test and development tool for Blobstore itself. It is also a simple single threaded application that relies on both the SPDK Framework and the bdev
layer but offers multiple modes of operation to accomplish some real-world tasks. In command mode, it accepts single-shot commands which can be a little time consuming if there are many commands to get through as each one will take a few seconds waiting for DPDK initialization. It therefore has a shell mode that allows the developer to get to a blob>
prompt and then very quickly interact with Blobstore with simple commands that include the ability to import/export blobs from/to regular files. Lastly there is a scripting mode to automate a series of tasks, again, handy for development and/or test type activities.Blobstore configuration options are described in the initialization options section under Design Considerations.
The information in this section is not necessarily relevant to designing an application for use with Blobstore, but understanding a little more about the internals may be interesting and is also included here for those wanting to contribute to the Blobstore effort itself.
The Blobstore owns the entire storage device. The device is divided into clusters starting from the beginning, such that cluster 0 begins at the first logical block.
Cluster 0 is special and has the following format, where page 0 is the first page of the cluster:
The super block is a single page located at the beginning of the partition. It contains basic information about the Blobstore. The metadata region is the remainder of cluster 0 and may extend to additional clusters. Refer to the latest source code for complete structural details of the super block and metadata region.
Each blob is allocated a non-contiguous set of pages inside the metadata region for its metadata. These pages form a linked list. The first page in the list will be written in place on update, while all other pages will be written to fresh locations. This requires the backing device to support an atomic write size greater than or equal to the page size to guarantee that the operation is atomic. See the section on atomicity for details.
Each blob is an ordered list of clusters, where starting LBA of a cluster is called extent. A blob can be thin provisioned, resulting in no extent for some of the clusters. When first write operation occurs to the unallocated cluster - new extent is chosen. This information is stored in RAM and on-disk.
There are two extent representations on-disk, dependent on use_extent_table
(default:true) opts used when creating a blob.
Each in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is allocated to a blob it is considered owned by that blob and that particular blob's metadata maintains a reference to the cluster as a record of ownership. Cluster ownership is transferred during snapshot operations described later in Snapshots and Clones.
Through the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it owns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend on whether the operation targets blocks that are backed by a cluster owned by the blob or not.
External snapshots allow some external data source to act as a snapshot. This allows clones to be created of data that resides outside of the blobstore containing the clone.
As mentioned in Blob cluster layout, a blob may be thin provisioned. A thin provisioned blob starts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned blob's back device is a zeroes device. A read from a zeroes device fills the read buffer with zeroes.
When a thin provisioned volume writes to a block that does not have an allocated cluster, the following steps are performed:
A snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other blob. While the interface gives the illusion of being able to create many snapshots of a blob, under the covers this results in a chain of snapshots that are clones of the previous snapshot.
When blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new blob. That is:
Step | Action | State |
---|---|---|
1 | Create blob1 | blob1 (rw) |
2 | Create snapshot blob2 of blob1 | blob1 (rw) --> blob2 (ro) |
2a | Write to blob1 | blob1 (rw) --> blob2 (ro) |
3 | Create snapshot blob3 of blob1 | blob1 (rw) --> blob3 (ro) ---> blob2 (ro) |
Supposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a full write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is transferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause one or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters allocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device becomes blob3.
It is important to understand the chain above when considering strategies to use a golden image from which many clones are made. The IO path is more efficient if one snapshot is cloned many times than it is to create a new snapshot for every clone. The following illustrates the difference.
Using a single snapshot means the data originally referenced by the golden image is always one hop away.
Using a snapshot per clone means that the chain of back devices grows with every new snapshot and clone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from clone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the blocks originally allocated to golden).
A snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted, the clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or freed, depending on whether the clone already owns a cluster for a particular block range.
Removal of the last clone leaves the snapshot in place. This snapshot continues to be read-only and can serve as the snapshot for future clones.
A clone can remove its dependence on a snapshot with the following operations:
A blobstore that is loaded with the esnap_bs_dev_create
callback defined will support external snapshots (esnaps). An external snapshot is not useful on its own: it needs to be cloned by a blob. A clone of an external snapshot is referred to as an esnap clone. An esnap clone supports IO and other operations just like any other clone.
An esnap clone can be recognized in various ways:
SPDK_BLOB_EXTERNAL_SNAPSHOT
(0x8) bit is set in invalid_flags
and an internal XATTR with name BLOB_EXTERNAL_SNAPSHOT_ID
("EXTSNAP") exists.spdk_blob
structure contains the metadata read from disk, blob->parent_id
is set to SPDK_BLOBID_EXTERNAL_SNAPSHOT
, and blob->back_bs_dev
references a blobstore device which is not a blob in the same blobstore nor a zeroes device.A read only blob can be copied over a blob store device in a way that only clusters allocated to the blob will be written on the device. This device must have a size equal or greater than blob's size and blob store's block size must be an integer multiple of device's block size. This functionality can be used to recreate the entire snapshot stack of a blob into a different blob store.
We can change the parent of a thin provisioned blob, making the blob a clone of a snapshot of the same blobstore or a clone of an external snapshot. The previous parent of the blob can be a snapshot, an external snapshot or none.
If the new parent of the blob is a snapshot of the same blobstore, blob and snapshot must have the same number of clusters.
If the new parent of the blob is an external snapshot, the size of the esnap must be an integer multiple of blob's cluster size.
A copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster size. Typical copy-on-write involves the following steps:
If the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if the blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are offloaded to the device. Neither of these optimizations are available when the back device is an external snapshot.
Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either a serial fashion or in parallel, respectively. Both are defined using the following structure:
These requests sets are basically bookkeeping mechanisms to help Blobstore efficiently deal with related groups of IO. They are an internal construct only and are pre-allocated on a per channel basis (channels were discussed earlier). They are removed from a channel associated linked list when the set (sequence or batch) is started and then returned to the list when completed.
Each request set maintains a reference to a channel
and a back_channel
. The channel
is used for performing IO on the blobstore device. The back_channel
is used for performing IO on the blob's back device, blob->back_bs_dev
. For blobs that are not esnap clones, channel
and back_channel
reference an IO channel used with the device that contains the blobstore. For blobs that are esnap clones, channel
is the same as with any other blob and back_channel
is an IO channel for the external snapshot device.
blobstore.h
contains many of the key structures for the internal workings of Blobstore. Only a few notable ones are reviewed here. Note that blobstore.h
is an internal header file, the header file for Blobstore that defines the public API is blob.h
.
This is an in-memory data structure that contains key elements like the blob identifier, its current state and two copies of the mutable metadata for the blob; one copy is the current metadata and the other is the last copy written to disk.
This is a per blob structure, included the struct spdk_blob
struct that actually defines the blob itself. It has the specific information on size and makeup of the blob (ie how many clusters are allocated for this blob and which ones.)
This is the main in-memory structure for the entire Blobstore. It defines the global on disk metadata region and maintains information relevant to the entire system - initialization options such as cluster size, etc.
The super block is an on-disk structure that contains all of the relevant information that's in the in-memory Blobstore structure just discussed along with other elements one would expect to see here such as signature, version, checksum, etc.
In general, Blobstore.c
is laid out with groups of related functions blocked together with descriptive comments. For example,
And for the most part the following conventions are followed throughout:
cpl
are related to set or callback completions