Welcome to bcachefs’s documentation!
Introduction
Bcachefs is a modern, general purpose, copy on write filesystem descended from bcache, a block layer cache.
The internal architecture is very different from most existing filesystems where the inode is central and many data structures hang off of the inode. Instead, bcachefs is architected more like a filesystem on top of a relational database, with tables for the different filesystem data types - extents, inodes, dirents, xattrs, et cetera.
bcachefs supports almost all of the same features as other modern COW filesystems, such as ZFS and btrfs, but in general with a cleaner, simpler, higher performance design.
Performance
The core of the architecture is a very high performance and very low latency b+ tree, which also is not a conventional b+ tree but more of hybrid, taking concepts from compacting data structures: btree nodes are very large, log structured, and compacted (resorted) as necessary in memory. This means our b+ trees are very shallow compared to other filesystems.
What this means for the end user is that since we require very few seeks or disk reads, filesystem latency is extremely good - especially cache cold filesystem latency, which does not show up in most benchmarks but has a huge impact on real world performance, as well as how fast the system “feels” in normal interactive usage. Latency has been a major focus throughout the codebase - notably, we have assertions that we never hold b+ tree locks while doing IO, and the btree transaction layer makes it easily to aggressively drop and retake locks as needed - one major goal of bcachefs is to be the first general purpose soft realtime filesystem.
Additionally, unlike other COW btrees, btree updates are journalled. This greatly improves our write efficiency on random update workloads, as it means btree writes are only done when we have a large block of updates, or when required by memory reclaim or journal reclaim.
Bucket based allocation
As mentioned bcachefs is descended from bcache, where the ability to efficiently invalidate cached data and reuse disk space was a core design requirement. To make this possible the allocator divides the disk up into buckets, typically 512k to 2M but possibly larger or smaller. Buckets and data pointers have generation numbers: we can reuse a bucket with cached data in it without finding and deleting all the data pointers by incrementing the generation number.
In keeping with the copy-on-write theme of avoiding update in place wherever possible, we never rewrite or overwrite data within a bucket - when we allocate a bucket, we write to it sequentially and then we don’t write to it again until the bucket has been invalidated and the generation number incremented.
This means we require a copying garbage collector to deal with internal fragmentation, when patterns of random writes leave us with many buckets that are partially empty (because the data they contained was overwritten) - copy GC evacuates buckets that are mostly empty by writing the data they contain to new buckets. This also means that we need to reserve space on the device for the copy GC reserve when formatting - typically 8% or 12%.
There are some advantages to structuring the allocator this way, besides being able to support cached data:
By maintaining multiple write points that are writing to different buckets, we’re able to easily and naturally segregate unrelated IO from different processes, which helps greatly with fragmentation.
The fast path of the allocator is essentially a simple bump allocator - the disk space allocation is extremely fast
Fragmentation is generally a non issue unless copygc has to kick in, and it usually doesn’t under typical usage patterns. The allocator and copygc are doing essentially the same things as the flash translation layer in SSDs, but within the filesystem we have much greater visibility into where writes are coming from and how to segregate them, as well as which data is actually live - performance is generally more predictable than with SSDs under similar usage patterns.
The same algorithms will in the future be used for managing SMR hard drives directly, avoiding the translation layer in the hard drive - doing this work within the filesystem should give much better performance and much more predictable latency.
IO path options
Most options that control the IO path can be set at either the
filesystem level or on individual inodes (files and directories). When
set on a directory via the bcachefs attr
command, they will be
automatically applied recursively.
Checksumming
bcachefs supports both metadata and data checksumming - crc32c by default, but stronger checksums are available as well. Enabling data checksumming incurs some performance overhead - besides the checksum calculation, writes have to be bounced for checksum stability (Linux generally cannot guarantee that the buffer being written is not modified in flight), but reads generally do not have to be bounced.
Checksum granularity in bcachefs is at the level of individual extents,
which results in smaller metadata but means we have to read entire
extents in order to verify the checksum. By default, checksummed and
compressed extents are capped at 64k. For most applications and usage
scenarios this is an ideal trade off, but small random O_DIRECT
reads will incur significant overhead. In the future, checksum
granularity will be a per-inode option.
Encryption
bcachefs supports authenticated (AEAD style) encryption - ChaCha20/Poly1305. When encryption is enabled, the poly1305 MAC replaces the normal data and metadata checksums. This style of encryption is superior to typical block layer or filesystem level encryption (usually AES-XTS), which only operates on blocks and doesn’t have a way to store nonces or MACs. In contrast, we store a nonce and cryptographic MAC alongside data pointers - meaning we have a chain of trust up to the superblock (or journal, in the case of unclean shutdowns) and can definitely tell if metadata has been modified, dropped, or replaced with an earlier version - replay attacks are not possible.
Encryption can only be specified for the entire filesystem, not per file or directory - this is because metadata blocks do not belong to a particular file. All metadata except for the superblock is encrypted.
In the future we’ll probably add AES-GCM for platforms that have hardware acceleration for AES, but in the meantime software implementations of ChaCha20 are also quite fast on most platforms.
scrypt
is used for the key derivation function - for converting the
user supplied passphrase to an encryption key.
To format a filesystem with encryption, use
bcachefs format --encrypted /dev/sda1
You will be prompted for a passphrase. Then, to use an encrypted filesystem use the command
bcachefs unlock /dev/sda1
You will be prompted for the passphrase and the encryption key will be added to your in-kernel keyring; mount, fsck and other commands will then work as usual.
The passphrase on an existing encrypted filesystem can be changed with
the bcachefs set-passphrase
command. To permanently unlock an
encrypted filesystem, use the bcachefs remove-passphrase
command -
this can be useful when dumping filesystem metadata for debugging by the
developers.
There is a wide_macs
option which controls the size of the
cryptographic MACs stored on disk. By default, only 80 bits are stored,
which should be sufficient security for most applications. With the
wide_macs
option enabled we store the full 128 bit MAC, at the cost
of making extents 8 bytes bigger.
Compression
bcachefs supports gzip, lz4 and zstd compression. As with data checksumming, we compress entire extents, not individual disk blocks - this gives us better compression ratios than other filesystems, at the cost of reduced small random read performance.
Data can also be compressed or recompressed with a different algorithm
in the background by the rebalance thread, if the
background_compression
option is set.
Multiple devices
bcachefs is a multi-device filesystem. Devices need not be the same size: by default, the allocator will stripe across all available devices but biasing in favor of the devices with more free space, so that all devices in the filesystem fill up at the same rate. Devices need not have the same performance characteristics: we track device IO latency and direct reads to the device that is currently fastest.
Replication
bcachefs supports standard RAID1/10 style redundancy with the
data_replicas
and metadata_replicas
options. Layout is not fixed
as with RAID10: a given extent can be replicated across any set of
devices; the bcachefs fs usage
command shows how data is replicated
within a filesystem.
Erasure coding
bcachefs also supports Reed-Solomon erasure coding - the same algorithm
used by most RAID5/6 implementations) When enabled with the ec
option, the desired redundancy is taken from the data_replicas
option - erasure coding of metadata is not supported.
Erasure coding works significantly differently from both conventional RAID implementations and other filesystems with similar features. In conventional RAID, the “write hole” is a significant problem - doing a small write within a stripe requires the P and Q (recovery) blocks to be updated as well, and since those writes cannot be done atomically there is a window where the P and Q blocks are inconsistent - meaning that if the system crashes and recovers with a drive missing, reconstruct reads for unrelated data within that stripe will be corrupted.
ZFS avoids this by fragmenting individual writes so that every write becomes a new stripe - this works, but the fragmentation has a negative effect on performance: metadata becomes bigger, and both read and write requests are excessively fragmented. Btrfs’s erasure coding implementation is more conventional, and still subject to the write hole problem.
bcachefs’s erasure coding takes advantage of our copy on write nature - since updating stripes in place is a problem, we simply don’t do that. And since excessively small stripes is a problem for fragmentation, we don’t erasure code individual extents, we erasure code entire buckets - taking advantage of bucket based allocation and copying garbage collection.
When erasure coding is enabled, writes are initially replicated, but one of the replicas is allocated from a bucket that is queued up to be part of a new stripe. When we finish filling up the new stripe, we write out the P and Q buckets and then drop the extra replicas for all the data within that stripe - the effect is similar to full data journalling, and it means that after erasure coding is done the layout of our data on disk is ideal.
Since disks have write caches that are only flushed when we issue a cache flush command - which we only do on journal commit - if we can tweak the allocator so that the buckets used for the extra replicas are reused (and then overwritten again) immediately, this full data journalling should have negligible overhead - this optimization is not implemented yet, however.
Device labels and targets
By default, writes are striped across all devices in a filesystem, but they may be directed to a specific device or set of devices with the various target options. The allocator only prefers to allocate from devices matching the specified target; if those devices are full, it will fall back to allocating from any device in the filesystem.
Target options may refer to a device directly, e.g.
foreground_target=/dev/sda1
, or they may refer to a device label. A
device label is a path delimited by periods - e.g. ssd.ssd1 (and labels
need not be unique). This gives us ways of referring to multiple devices
in target options: If we specify ssd in a target option, that will refer
to all devices with the label ssd or labels that start with ssd. (e.g.
ssd.ssd1, ssd.ssd2).
Four target options exist. These options all may be set at the filesystem level (at format time, at mount time, or at runtime via sysfs), or on a particular file or directory:
foreground_target
: normal foreground data writes, and metadata
ifmetadata_target
is not setmetadata_target
: btree writes
background_target
: If set, user data (not metadata) will be moved
to this target in the background
promote_target
: If set, a cached copy will be added to this
target on read, if none exists
Caching
When an extent has multiple copies on different devices, some of those copies may be marked as cached. Buckets containing only cached data are discarded as needed by the allocator in LRU order.
background_target
option, the original copy is left in place but
marked as cached. With the promote_target
option, the original
copy is left unchanged and the new copy on the promote_target
device is marked as cached.To do writeback caching, set foreground_target
and
promote_target
to the cache device, and background_target
to the
backing device. To do writearound caching, set foreground_target
to
the backing device and promote_target
to the cache device.
Durability
Some devices may be considered to be more reliable than others. For example, we might have a filesystem composed of a hardware RAID array and several NVME flash devices, to be used as cache. We can set replicas=2 so that losing any of the NVME flash devices will not cause us to lose data, and then additionally we can set durability=2 for the hardware RAID device to tell bcachefs that we don’t need extra replicas for data on that device - data on that device will count as two replicas, not just one.
The durability option can also be used for writethrough caching: by setting durability=0 for a device, it can be used as a cache and only as a cache - bcachefs won’t consider copies on that device to count towards the number of replicas we’re supposed to keep.
Reflink
bcachefs supports reflink, similarly to other filesystems with the same feature. cp –reflink will create a copy that shares the underlying storage. Reading from that file will become slightly slower - the extent pointing to that data is moved to the reflink btree (with a refcount added) and in the extents btree we leave a key that points to the indirect extent in the reflink btree, meaning that we now have to do two btree lookups to read from that data instead of just one.
Inline data extents
bcachefs supports inline data extents, controlled by the inline_data
option (on by default). When the end of a file is being written and is
smaller than half of the filesystem blocksize, it will be written as an
inline data extent. Inline data extents can also be reflinked (moved to
the reflink btree with a refcount added): as a todo item we also intend
to support compressed inline data extents.
Subvolumes and snapshots
bcachefs supports subvolumes and snapshots with a similar userspace interface as btrfs. A new subvolume may be created empty, or it may be created as a snapshot of another subvolume. Snapshots are writeable and may be snapshotted again, creating a tree of snapshots.
Snapshots are very cheap to create: they’re not based on cloning of COW btrees as with btrfs, but instead are based on versioning of individual keys in the btrees. Many thousands or millions of snapshots can be created, with the only limitation being disk space.
The following subcommands exist for managing subvolumes and snapshots:
bcachefs subvolume create
: Create a new, empty subvolumebcachefs subvolume destroy
: Delete an existing subvolume or snapshotbcachefs subvolume snapshot
: Create a snapshot of an existing subvolume
A subvolume can also be deleting with a normal rmdir after deleting all
the contents, as with rm -rf
. Still to be implemented: read-only
snapshots, recursive snapshot creation, and a method for recursively
listing subvolumes.
Quotas
bcachefs supports conventional user/group/project quotas. Quotas do not currently apply to snapshot subvolumes, because if a file changes ownership in the snapshot it would be ambiguous as to what quota data within that file should be charged to.
When a directory has a project ID set it is inherited automatically by descendants on creation and rename. When renaming a directory would cause the project ID to change we return -EXDEV so that the move is done file by file, so that the project ID is propagated correctly to descendants - thus, project quotas can be used as subdirectory quotas.
Formatting
To format a new bcachefs filesystem use the subcommand
bcachefs format
, or mkfs.bcachefs
. All persistent
filesystem-wide options can be specified at format time. For an example
of a multi device filesystem with compression, encryption, replication
and writeback caching:
bcachefs format --compression=lz4 \ --encrypted \ --replicas=2 \ --label=ssd.ssd1 /dev/sda \ --label=ssd.ssd2 /dev/sdb \ --label=hdd.hdd1 /dev/sdc \ --label=hdd.hdd2 /dev/sdd \ --label=hdd.hdd3 /dev/sde \ --label=hdd.hdd4 /dev/sdf \ --foreground_target=ssd \ --promote_target=ssd \ --background_target=hdd
Mounting
To mount a multi device filesystem, there are two options. You can specify all component devices, separated by hyphens, e.g.
mount -t bcachefs /dev/sda:/dev/sdb:/dev/sdc /mnt
Or, use the mount.bcachefs tool to mount by filesystem UUID. Still todo: improve the mount.bcachefs tool to support mounting by filesystem label.
No special handling is needed for recovering from unclean shutdown. Journal replay happens automatically, and diagnostic messages in the dmesg log will indicate whether recovery was from clean or unclean shutdown.
The -o degraded
option will allow a filesystem to be mounted without
all the the devices, but will fail if data would be missing. The
-o very_degraded
can be used to attempt mounting when data would be
missing.
Also relevant is the -o nochanges
option. It disallows any and all
writes to the underlying devices, pinning dirty data in memory as
necessary if for example journal replay was necessary - think of it as a
“super read-only” mode. It can be used for data recovery, and for
testing version upgrades.
The -o verbose
enables additional log output during the mount
process.
Checking Filesystem Integrity
It is possible to run fsck either in userspace with the
bcachefs fsck
subcommand (also available as fsck.bcachefs
, or in
the kernel while mounting by specifying the -o fsck
mount option. In
either case the exact same fsck implementation is being run, only the
environment is different. Running fsck in the kernel at mount time has
the advantage of somewhat better performance, while running in userspace
has the ability to be stopped with ctrl-c and can prompt the user for
fixing errors. To fix errors while running fsck in the kernel, use the
-o fix_errors
option.
The -n
option passed to fsck implies the -o nochanges
option;
bcachefs fsck -ny
can be used to test filesystem repair in dry-run
mode.
Status of data
The bcachefs fs usage
may be used to display filesystem usage broken
out in various ways. Data usage is broken out by type: superblock,
journal, btree, data, cached data, and parity, and by which sets of
devices extents are replicated across. We also give per-device usage
which includes fragmentation due to partially used buckets.
Journal
The journal has a number of tunables that affect filesystem performance.
Journal commits are fairly expensive operations as they require issuing
FLUSH and FUA operations to the underlying devices. By default, we issue
a journal flush one second after a filesystem update has been done; this
is controlled with the journal_flush_delay
option, which takes a
parameter in milliseconds.
Filesystem sync and fsync operations issue journal flushes; this can be
disabled with the journal_flush_disabled
option - the
journal_flush_delay
option will still apply, and in the event of a
system crash we will never lose more than (by default) one second of
work. This option may be useful on a personal workstation or laptop, and
perhaps less appropriate on a server.
The journal reclaim thread runs in the background, kicking off btree
node writes and btree key cache flushes to free up space in the journal.
Even in the absence of space pressure it will run slowly in the
background: this is controlled by the journal_reclaim_delay
parameter, with a default of 100 milliseconds.
The journal should be sized sufficiently that bursts of activity do not
fill up the journal too quickly; also, a larger journal mean that we can
queue up larger btree writes. The bcachefs device resize-journal
can
be used for resizing the journal on disk on a particular device - it can
be used on a mounted or unmounted filesystem.
In the future, we should implement a method to see how much space is currently utilized in the journal.
Device management
Filesystem resize
A filesystem can be resized on a particular device with the
bcachefs device resize
subcommand. Currently only growing is
supported, not shrinking.
Device add/removal
The following subcommands exist for adding and removing devices from a mounted filesystem:
bcachefs device add
: Formats and adds a new device to an existing filesystem.bcachefs device remove
: Permenantly removes a device from an existing filesystem.bcachefs device online
: Connects a device to a running filesystem that was mounted without it (i.e. in degraded mode)bcachefs device offline
: Disconnects a device from a mounted filesystem without removing it.bcachefs device evacuate
: Migrates data off of a particular device to prepare for removal, setting it read-only if necessary.bcachefs device set-state
: Changes the state of a member device: one of rw (readwrite), ro (readonly), failed, or spare.A failed device is considered to have 0 durability, and replicas on that device won’t be counted towards the number of replicas an extent should have by rereplicate - however, bcachefs will still attempt to read from devices marked as failed.
The bcachefs device remove
, bcachefs device offline
and
bcachefs device set-state
commands take force options for when they
would leave the filesystem degraded or with data missing. Todo:
regularize and improve those options.
Data management
Data rereplicate
The bcachefs data rereplicate
command may be used to scan for
extents that have insufficient replicas and write additional replicas,
e.g. after a device has been removed from a filesystem or after
replication has been enabled or increased.
Rebalance
To be implemented: a command for moving data between devices to equalize usage on each device. Not normally required because the allocator attempts to equalize usage across devices as it stripes, but can be necessary in certain scenarios - i.e. when a two-device filesystem with replication enabled that is very full has a third device added.
Scrub
To be implemented: a command for reading all data within a filesystem and ensuring that checksums are valid, fixing bitrot when a valid copy can be found.
Options
Most bcachefs options can be set filesystem wide, and a significant
subset can also be set on inodes (files and directories), overriding the
global defaults. Filesystem wide options may be set when formatting,
when mounting, or at runtime via /sys/fs/bcachefs/<uuid>/options/
.
When set at runtime via sysfs the persistent options in the superblock
are updated as well; when options are passed as mount parameters the
persistent options are unmodified.
File and directory options
<say something here about how attrs must be set via bcachefs attr command>
Options set on inodes (files and directories) are automatically inherited by their descendants, and inodes also record whether a given option was explicitly set or inherited from their parent. When renaming a directory would cause inherited attributes to change we fail the rename with -EXDEV, causing userspace to do the rename file by file so that inherited attributes stay consistent.
Inode options are available as extended attributes. The options that
have been explicitly set are available under the bcachefs
namespace,
and the effective options (explicitly set and inherited options) are
available under the bcachefs_effective
namespace. Examples of
listing options with the getfattr command:
$ getfattr -d -m '^bcachefs\.' filename $ getfattr -d -m '^bcachefs_effective\.' filename
Options may be set via the extended attribute interface, but it is
preferable to use the bcachefs setattr
command as it will correctly
propagate options recursively.
Full option list
block_size
formatFilesystem block size (default 4k)
btree_node_size
formaterrors
format,mount,rutimemetadata_replicas
format,mount,runtimedata_replicas
format,mount,runtime,inodereplicas
formatmetadata_checksum
format,mount,runtimedata_checksum
format,mount,runtime,inodecompression
format,mount,runtime,inodebackground_compression
format,mount,runtime,inodestr_hash
format,mount,runtime,inodemetadata_target
format,mount,runtime,inodeforeground_target
format,mount,runtime,inodebackground_target
format,mount,runtime,inodepromote_target
format,mount,runtime,inodeerasure_code
format,mount,runtime,inodeinodes_32bit
format,mount,runtimeshard_inode_numbers
format,mount,runtimewide_macs
format,mount,runtimeinline_data
format,mount,runtimejournal_flush_delay
format,mount,runtimejournal_flush_disabled
format,mount,runtimeDisables journal flush on sync/fsync. journal_flush_delay
remains
in effect, thus with the default setting not more than 1 second of
work will be lost.
journal_reclaim_delay
format,mount,runtimeacl
format,mountusrquota
format,mountgrpquota
format,mountprjquota
format,mountdegraded
mountvery_degraded
mountverbose
mountfsck
mountfix_errors
mountratelimit_errors
mountread_only
mountnochanges
mountnorecovery
mountnoexcl
mountversion_upgrade
mountdiscard
deviceError actions
The errors
option is used for inconsistencies that indicate some
sort of a bug. Valid error actions are:
continue
Log the error but continue normal operation
ro
Emergency read only, immediately halting any changes to the filesystem on disk
panic
Immediately halt the entire machine, printing a backtrace on the system console
Checksum types
Valid checksum types are:
none
crc32c
(default)
crc64
Compression types
Valid compression types are:
none
(default)
lz4
gzip
zstd
String hash types
Valid hash types for string hash tables are:
crc32c
crc64
siphash
(default)
Debugging tools
Sysfs interface
Mounted filesystems are available in sysfs at
/sys/fs/bcachefs/<uuid>/
with various options, performance counters
and internal debugging aids.
Options
/sys/fs/bcachefs/<uuid>/options/
, and settings changed via sysfs
will be persistently changed in the superblock as well.Time stats
bcachefs tracks the latency and frequency of various operations and
events, with quantiles for latency/duration in the
/sys/fs/bcachefs/<uuid>/time_stats/
directory.
blocked_allocate
blocked_allocate_open_bucket
blocked_journal
btree_gc
btree_lock_contended_read
btree_lock_contended_intent
btree_lock_contended_write
btree_node_mem_alloc
btree_node_split
btree_node_compact
btree_node_merge
btree_node_sort
btree_node_read
btree_interior_update_foreground
btree_interior_update_total
data_read
data_write
data_promote
promote_target
. This is done asynchronously from the original
read.journal_flush_write
journal_noflush_write
journal_flush_seq
Internals
btree_cache
dirty_btree_nodes
For each dirty btree node, prints:
Whether the
need_write
flag is setThe level of the btree node
The number of sectors written
Whether writing this node is blocked, waiting for other nodes to be written
Whether it is waiting on a btree_update to complete and make it reachable on-disk
btree_key_cache
btree_transactions
btree_updates
journal_debug
journal_pins
Lists items pinning journal entries, preventing them
from being reclaimed.
new_stripes
stripes_heap
open_buckets
io_timers_read
io_timers_write
trigger_journal_flush
trigger_gc
prune_cache
read_realloc_races
extent_migrate_done
extent_migrate_raced
Unit and performance tests
Echoing into /sys/fs/bcachefs/<uuid>/perf_test
runs various low
level btree tests, some intended as unit tests and others as performance
tests. The syntax is
echo <test_name> <nr_iterations> <nr_threads> > perf_test
When complete, the elapsed time will be printed in the dmesg log. The
full list of tests that can be run can be found near the bottom of
fs/bcachefs/tests.c
.
Debugfs interface
The contents of every btree, as well as various internal per-btree-node
information, are available under /sys/kernel/debug/bcachefs/<uuid>/
.
For every btree, we have the following files:
-formats
-bfloat-failed
Listing and dumping filesystem metadata
bcachefs show-super
This subcommand is used for examining and printing bcachefs superblocks. It takes two optional parameters:
-l
: Print superblock layout, which records the amount of space
reserved for the superblock and the locations of the backup
superblocks.
-f, –fields=(fields)
: List of superblock sections to print,
all
to print all sections.
bcachefs list
This subcommand gives access to the same functionality as the debugfs interface, listing btree nodes and contents, but for offline filesystems.
bcachefs list_journal
This subcommand lists the contents of the journal, which primarily records btree updates ordered by when they occured.
bcachefs dump
This subcommand can dump all metadata in a filesystem (including multi
device filesystems) as qcow2 images: when encountering issues that
fsck
can not recover from and need attention from the developers,
this makes it possible to send the developers only the required
metadata. Encrypted filesystems must first be unlocked with
bcachefs remove-passphrase
.
ioctl interface
This section documents bcachefs-specific ioctls:
BCH_IOCTL_QUERY_UUID
BCH_IOCTL_FS_USAGE
bch_replicas
entry.BCH_IOCTL_DEV_USAGE
BCH_IOCTL_READ_SUPER
BCH_IOCTL_DISK_ADD
BCH_IOCTL_DISK_REMOVE
BCH_IOCTL_DISK_ONLINE
BCH_IOCTL_DISK_OFFLINE
BCH_IOCTL_DISK_SET_STATE
BCH_IOCTL_DISK_GET_IDX
BCH_IOCTL_DISK_RESIZE
BCH_IOCTL_DISK_RESIZE_JOURNAL
BCH_IOCTL_DATA
BCH_IOCTL_SUBVOLUME_CREATE
BCH_IOCTL_SUBVOLUME_DESTROY
BCHFS_IOC_REINHERIT_ATTRS
On disk format
Superblock
The superblock is the first thing to be read when accessing a bcachefs filesystem. It is located 4kb from the start of the device, with redundant copies elsewhere - typically one immediately after the first superblock, and one at the end of the device.
The bch_sb_layout
records the amount of space reserved for the
superblock as well as the locations of all the superblocks. It is
included with every superblock, and additionally written 3584 bytes from
the start of the device (512 bytes before the first superblock).
Most of the superblock is identical across each device. The exceptions
are the dev_idx
field, and the journal section which gives the
location of the journal.
The main section of the superblock contains UUIDs, version numbers, number of devices within the filesystem and device index, block size, filesystem creation time, and various options and settings. The superblock also has a number of variable length sections:
BCH_SB_FIELD_journal
BCH_SB_FIELD_members
BCH_SB_FIELD_crypt
BCH_SB_FIELD_replicas
BCH_SB_FIELD_quota
BCH_SB_FIELD_disk_groups
BCH_SB_FIELD_clean
struct jset
): btree roots, as well as filesystem usage and
read/write counters (total amount of data read/written to this
filesystem). This allows reading the journal to be skipped after
clean shutdowns.Journal
Every journal write (struct jset
) contains a list of entries:
struct jset_entry
. Below are listed the various journal entry types.
BCH_JSET_ENTRY_btree_key
struct bkey
), and the
btree_id
and level
fields of jset_entry
record the
btree ID and level the key belongs to.BCH_JSET_ENTRY_btree_root
KEY_TYPE_btree_ptr_v2
, and the btree_id and level fields of
jset_entry
record the btree ID and depth.BCH_JSET_ENTRY_clock
BCH_JSET_ENTRY_usage
BCH_JSET_ENTRY_data_usage
BCH_JSET_ENTRY_dev_usage
Btrees
Btree keys
KEY_TYPE_deleted
KEY_TYPE_whiteout
KEY_TYPE_error
KEY_TYPE_cookie
KEY_TYPE_hash_whiteout
KEY_TYPE_btree_ptr
KEY_TYPE_extent
KEY_TYPE_reservation
KEY_TYPE_inode
KEY_TYPE_inode_generation
KEY_TYPE_dirent
KEY_TYPE_xattr
KEY_TYPE_alloc
KEY_TYPE_quota
KEY_TYPE_stripe
KEY_TYPE_reflink_p
KEY_TYPE_reflink_v
KEY_TYPE_inline_data
KEY_TYPE_btree_ptr_v2
KEY_TYPE_indirect_inline_data
KEY_TYPE_alloc_v2
KEY_TYPE_subvolume
KEY_TYPE_snapshot
KEY_TYPE_inode_v2
KEY_TYPE_alloc_v3