v6.19-rc1: Filesystem, Storage and Block layer updates
Linus just now announced v6.19-rc1 kernel release on a usual Sunday afternoon.
This brings a number of features across various subsystems. This article tries
to cover some of the filesystem, storage and block level updates:
EXT4 continues to march towards iomap conversion, which is a modern filesystem framework providing generic filesystem centric block mapping abstractions to efficiently handle I/O.
In v6.19, EXT4 got some good optimizations for it’s online de-fragmentation code by moving away from PAGE_SIZE based data movement and extent swapping
move_extent_per_pageto folio basedmext_move_extent().EXT4 had added large folio support in v6.16 kernel. That means, this new interface allows for iterating over and copying the data based on the extents of the original file instead of PAGE_SIZE thereby utilizing large folios. This makes ext4 a step closer towards iomap conversion. Additionally there are some good performance improvements too as mentioned by the author in this patch series.
EXT4 also got the support for large blocksize (bs > ps) in v6.19.
Since large folios are already supported for regular files, the required
changes are not substantial, but they are scattered across the code. The
changes primarily focus on cleaning up potential division-by-zero errors,
resolving negative left/right shifts, and correctly handling mutually
exclusive mount options.
Added shutdown ioctl support as experimental similar to other filesystems
Suspend trigger will now cancel scrub and device replace. scrub state will be saved and will be initiated from where it left off. Device replace however will have to be re-started from the beginning.
zone stats exported in sysfs, from the perspective of the filesystem this includes active, reclaimable, relocation etc zones
improvements when processing space reservation tickets by optimizing locking and shrinking critical sections, cumulative improvements in lockstat numbers show +15%
Added more filesystem operations for blocksize > pagesize support
Prep work for fscrypt support
Other bug fixes and improvements
FUSE added iomap support for it’s buffered read and readahead operation:
FUSE had adopted iomap for buffered write operations in v6.17. In this release, it extended that support by adopting iomap for it’s buffered read and readahead operations too.Background: v6.17 buffered write support:
In v6.17, FUSE transitioned its buffered write operations to use iomap,
unlocking two key advantages:Granular large folio synchronous reads: For example, with a 1 MB large folio,
a write issued from position 1 to position 1 MB - 2 only requires reading and
marking the head and tail pages as uptodate, rather than reading the entire
folio. Non-relevant trailing pages are also skipped.
Granular large folio dirty tracking: During writeback, only the dirty portions
of a large folio need to be written instead of the entire folio. For instance,
if only 2 bytes in a 1 MB large folio are dirty, only the page containing those
bytes gets written out.
In v6.19 UpdatesFUSE now extends iomap support for buffered reads and readahead operations. To enable this, struct iomap_read_ops and struct iomap_read_folio_ctx were added.
struct iomap_read_ops: Allows callers to provide custom→read_folio_rangeand→submit_readcallbacksstruct iomap_read_folio_ctx: Context structure for managing read operations with caller-specific behavior
FUSE now implementsfuse_iomap_read_folio_rangeto handle buffered reads through iomap, enabling more granular non-uptodate reads when large folios are enabled. Instead of reading the entire folio, only the non-uptodate portions need to be read from the server, which improves performance.As per the pull request this feature is also needed for:
It also is needed in order to turn on large folios for servers that use the
writeback cache since otherwise there is a race condition that may lead to data corruption if there is a partial write, then a read and the read happens before the write has undergone writeback, since otherwise the folio will not be marked uptodate from the partial write so the read will read in the entire folio from disk, which will overwrite the partial write.
For those who are curious, FUSE still hasn’t turned on the large folio support yet.
iomap also added folio batch support for iomap_zero_range() to handle dirty
DIO Write Completions from Interrupt Context
iomap improved latencies in case of direct-io for high performance workloads like ScyllaDB: iomap instead of running the DIO completion processing inline, it used to offload it to a workqueue. This added a lot of context switching which at least on older kernels with workqueue scheduling issues caused really high tail latencies. This series queued in v6.19 ensures that the write completions can run inline (e.g. for pure overwrites) instead of being deferred to workqueues.
Directory locking: NeilBrown continues to work towards adding support for
concurrent directory updates. The ultimate goal is to lock the target dentry(s) rather than the whole parent directory. To help with changing the locking protocol, this series centralizes locking and lookup in new helper functions.
Allow filesystems to increase the minimum writeback chunk size: To ensure fairness when flushing dirty pages, writeback logic switches between inodes after writing a minimum chunk of data. The default minimum writeback chunk size is 4 MiB. In v6.19, a new superblock field “s_min_writeback_pages” allows filesystems to override this default and increase the minimum writeback chunk size per inode. This could come beneficial for:
Rotational media: Reduces seeks by writing larger contiguous chunks before switching between files
Zoned storage devices (SMR HDDs and ZNS SSDs): Prevents file fragmentation and prevents unnecessary runs of GC!
XFS sets
s_min_writeback_pagesto the zone size for zoned filesystems,ensuring that writeback writes at least one complete zone’s worth of data per
inode before switching, thereby eliminating spurious fragmentation and the
associated garbage collection overhead.
Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. This helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems
Enables per-cpu bio cache by default:
block layer manages a per-cpu bio cache mainly for usecases like polling mode io-uring which can be latency criticial workloads. This caches allows to quickly recycle the bios instead of going through the slab allocator. This series now enables the per-cpu bio cache by default to all bio based I/O submissions.
Zoned device caching support: Added zone information caching to avoid unnecessary repeated zone report queries, improving performance for zoned
block devices (SMR/ZNS)
Polled I/O performance: Improved polled I/O handling speed by manually managing hardware context (hctx) lookups instead of using xarray
Block integrity improvements: Enhanced the auto-integrity code to be less
deadlock-prone and better handle Protection Information (PI) generation/validation
REQ_NOWAIT fixes: Corrected NOWAIT handling in loop/zloop drivers. The NOWAIT flag is now properly cleared when requests are punted to threads for handling. Also reverted loop DIO nowait support due to excessive stack usage issues
blk-throttle fixes: Improvements for SSD device throttling
ublk driver updates: Series of cleanups and simplifications to the user copy
code, laying groundwork for future batching support MD/RAID updates
Scheduler switching improvements: Restructured elevator I/O scheduler switching logic & fixing lockdep warnings.
Block tracing for zoned devices: Added support for zoned device information in block layer tracing
Various other fixes and improvements.
F2FS updates (as per cover letter)
This series focuses on minor clean-ups and performance optimizations across sysfs, documentation, debugfs, tracepoints, slab allocation, and GC. Furthermore, it resolves several corner-case bugs caught by xfstests, as well as issues related to 16KB page support and f2fs_enable_checkpoint.
Enhancement:
- wrap ASCII tables in literal blocks to fix LaTeX build
- optimize trace_f2fs_write_checkpoint with enums
- support to show curseg.next_blkoff in debugfs
- add a sysfs entry to show max open zones
- add fadvise tracepoint
- use global inline_xattr_slab instead of per-sb slab cache
- set default valid_thresh_ratio to 80 for zoned devices
- maintain one time GC mode is enabled during whole zoned GC cycle
- Other bug fixes and improvements
NTFS3 updates (as per cover letter)
- support timestamps prior to epoch
- do not overwrite uptodate pages
- disable readahead for compressed files
- setting of dummy blocksize to read boot_block when mounting
- the run_lock initialization when loading $Extend
- initialization of allocated memory before use
- support for the NTFS3_IOC_SHUTDOWN ioctl
- check for minimum alignment when performing direct I/O reads
- check for shutdown in fsync
- Other bug fixes and improvements


