6.17 Merge window - Memory Management & Filesystem Updates
Linux kernel v6.16 was released just last week, and the merge window for v6.17 is now open. This release brings numerous improvements across various subsystems. In this article, we’ll provide more detail on the memory management and filesystem updates. A follow up article will cover changes in other areas like architecture, storage, drivers, etc.
Memory Management Updates
Optimize memory block registration to reduce boot time:
This patch series optimizes the memory blocks registration needed during bootup. As per the patch series, this has improved bootup time for large memory systems with discontiguous memory ranges by around ~78% (From 1min 16sec to 17sec).
Optimize mprotect() & mremap() for large folios:
mremap() & mprotect() has been converted to make use offolio_pte_batch()
&set_ptes()
call which provides PTE batching instead ofset_pte_at()
, which gets called for each pte of a folio.Adds numa node notifier similar to memory notifier:
This allows to convert the following consumers to register for numa node state change instead of state change for every{online,offline}_pages()
. memory-tier, slub, cpuset, hmat, cxl & autoweight-mempolicy are converted to use numa node notifier instead.Introduces a per-node proactive reclaim interface:
A new per-node proactive reclaim interface has been added to help systems with memory tiers. Use cases that do not use memcg can utilize this interface for proactive reclaiming. The memcg interface is anyways not NUMA aware and there are usecases that focuses on NUMA balancing rather than workload memory footprint. Proactive reclaim on top tiers will trigger demotion and reclaiming on the bottom nodes will trigger evicting to swap.mm: introduce snapshot_page:
This provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly./proc/pid/maps read is now converted to use per vma locks:
Instead of usingmmap_lock
which will block on the entire address space, reads to/proc/pid/maps
is now converted to use per-vma locks. This has shown around 2x latency improvements in few of the microbenchmarks.More preparation work for separating struct slab from struct page:
Convert struct slab to use its own flags instead of referencing page flags, which is another preparation step before separating it from struct page completely.DAMON (Data Access MONitor) Improvements:
DAMON_STAT module for production use: A new kernel module that simplifies DAMON setup and usage in production environments, making memory access monitoring more accessible for real-world deployments.
Enhanced sysfs interface: The DAMON sysfs interface now supports periodic and automated statistics updates with tunable intervals, extending beyond the previous manual userspace-requested model and reducing monitoring overhead.
Improved NUMA migration support: DAMON's memory migration actions (DAMOS_MIGRATE_HOT/COLD) now support interleaving policies, allowing dynamic alteration of inter-node allocation strategies for better NUMA-aware memory management.
Filesystem Updates
EXT4 block allocation scalability improvements:
This series brings significant scalability improvements to ext4 when running upto ~96 containers, all executing fallocate2 + will-it-scale workloads. The primary bottleneck identified with this workload was block group lock contention during block group scanning, for allocation purposes. The core changes in this series include:Introducing ext4_try_lock_group() API to check if a block group is busy, allowing the allocator to skip locked groups instead of waiting.
Converting free group order lists to XArrays, which enables linear group scanning across multiple order XArrays rather than non-adjacent block group scanning caused by the previous linked-list implementation. This approach significantly reduces contention on block group locks.
These changes, along with few other optimizations included, improved workload scalability by approximately 1400%.
Uncached buffered I/O support added to EXT4:
EXT4 now supports theIOCB_DONTCACHE
flag through refactored address_space_operationswrite_begin()
andwrite_end()
callbacks. Uncached buffered I/O allows applications to use buffered I/O while instructing the kernel not to retain pages in the page cache after the I/O operation completes (if these pages were newly instantiated for doingRWF_DONTCACHE
I/O). This is particularly useful for applications that manage their own caching mechanisms or want to avoid polluting the page cache with large sequential writes that won't be reused.EROFS metadata compression support:
EROFS filesystem now supports compressing metadata blocks, which can be particularly useful for embedded use cases or when archiving large numbers of small files. Additionally, readdir performance has been improved by enabling readahead for directory blocks.New file_getattr/file_setattr system calls:
Introduces extensible successors to theFS_IOC_FSGETXATTR
andFS_IOC_FSSETXATTR
ioctls. These new syscalls allow userspace to set filesystem inode attributes on special files (FIFO, SOCK, BLK, etc.), which was previously impossible. This is particularly useful for XFS quota projects where special files need to inherit project IDs.FALLOC_FL_WRITE_ZEROES fallocate support:
A newfallocate()
flag has been introduced that zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to file mapping metadata, making them pure overwrites. This is particularly beneficial for flash-based storage devices that support efficient write zeroes commands (SCSI UNMAP bit or NVMe DEAC bit), as it avoids write amplification and improves performance. This allows users to leverage the newfallocate()
flag for issuing zeroes instead of using "dd" with large block sizes, which can consume unnecessarily large disk bandwidth.PIDFS persistent information and extended attributes:
PIDFS now persists exit and coredump information independent of whether anyone holds a pidfd for the struct pid. The lifetime of information is now bound to the struct pid itself rather than the pidfs inode/dentry. Additionally, pidfs now supports extended attributes, allowing userspace to attach meta information to tasks. Also introduces autonomous pidfs file handles that can function as full replacement for storing PIDs in files.NFSD write delegations and compound operation improvements:
NFSD can now offer write delegations to clients that open files with O_WRONLY, which should accelerate certain corner cases as per the patch series.IOMAP writeback refactoring and FUSE support:
The iomap writeback code has been refactored to split generic and ioend/bio based writeback code. FUSE now has iomap support for buffered writes and dirty folio writeback, enabling granular uptodate and dirty tracking with large folios. This means only relevant portions need to be read instead of entire folios, and only dirty portions need to be writeback. IOMAP supports dirty blocks tracking since v6.6VFS mmap_prepare() conversion:
Major conversion of filesystems from the legacyf_op->mmap()
hook to the newf_op->mmap_prepare()
interface. This allows mmap logic to invoke the hook much earlier, prior to inserting a VMA into the virtual address space, enabling simpler error unwinding and preventing manipulation of incompletely initialized VMA state. Most filesystems including ext4, xfs, and many others have been converted.
This concludes memory management and filesystem changes which have been pulled in Linus tree for 6.17 merge window so far. The window will remain open until 10th of August, so expect more changes to follow.