Linux multi-grain timestamp making it's way to 6.13 kernel

Nov 30, 2024

During the 6.6 merge window, the multi-grain timestamp feature was introduced but ultimately reverted after it caused regressions in tools like make and rsync. With its reappearance in the 6.13 merge window, it is worth examining what the feature entails, the problems it aims to solve, and the challenges it encountered during earlier 6.6 merge window.

Inode Timestamps

Linux filesystems maintain a set of timestamps to track key events for each file:

Access Time (atime): Updated when the file is read.
Modification Time (mtime): Updated when the file’s contents are modified.
Change Time (ctime): Updated when the file’s metadata (such as permissions or ownership) changes.

These timestamps are stored in the file’s inode. One notable behavior is that any mtime update also implicitly triggers a ctime update. However, the resolution of these timestamps is coarse—typically at the granularity of a jiffy (around milliseconds). While this level of precision suffices for most applications, it presents challenges for certain filesystems and use cases.

Problems with existing coarse-grained timestamps

Network File System (NFS), especially versions like NFSv3, is one area where this limitation becomes problematic. When a server experiences frequent file updates within a single jiffy, the client cannot reliably determine whether its cached file contents have become stale. Modern NFS implementations aim to cache file contents more aggressively to improve performance, but this requires accurate mechanisms to invalidate stale data. Since NFS clients rely on mtime and ctime comparisons to detect changes on the server side, coarse-grained timestamps hinder their effectiveness. Similar issues affect backup applications like rsync, which also depend on precise timestamps to detect file modifications.

A natural question arises: why not simply switch to higher-resolution timestamps across the board?
The answer lies in filesystem performance. Updates to mtime and ctime involve changes to inode metadata. If every read or write operation triggered frequent, fine-grained timestamp updates, it would significantly increase the volume of metadata writes. These metadata updates are often journalled to ensure filesystem integrity, adding further overhead. This tradeoff explains the kernel's reliance on coarse-grained timestamps by default: they strike a balance between functionality and performance.

For example, the on-disk structure of ext4 inodes (struct ext4_inode) reflects this design, with dedicated fields for atime, mtime, and ctime timestamps.

/*
 * Structure of an inode on the disk
 */
struct ext4_inode {
	__le16	i_mode;		/* File mode */
	__le16	i_uid;		/* Low 16 bits of Owner Uid */
	__le32	i_size_lo;	/* Size in bytes */
	__le32	i_atime;	/* Access time */
	__le32	i_ctime;	/* Inode Change time */
	__le32	i_mtime;	/* Modification time */
        <...>
};

Coarse-grained timestamps strike a balance by reducing the frequency of metadata updates, thereby improving performance. However, as discussed NFS and certain applications require more finer grained timestamp updates. This is where the multi-grain timestamp updates can be helpful.

Multi-grain Timestamp

The feature addresses this limitation by dynamically adjusting the resolution of timestamps. When an inode’s attributes are being actively observed via ->getattr(), the kernel uses a higher-resolution timestamp for mtime and ctime. For inodes that are not being actively monitored, coarse-grained timestamps remain in use. This adaptive approach provides finer granularity where needed while preserving the performance benefits of coarser timestamps for other cases.

Problems with multi-grain 6.6 implementation

The initial implementation, however, exposed a significant problem. If two files, f1 and f2, are modified in close succession, but f1 receives a fine-grained timestamp while f2 is updated with a coarse-grained timestamp, the result could imply that f2 was modified before f1, violating the VFS ordering guarantees. This issue was one of the primary reasons for reverting the feature during the 6.6 merge window.

6.13 multi-grain timestamp fix and filesystem documentation

After discussions at LSFMM 2024, Jeff revisited the feature and proposed a rather simple fix to the problem. Christian Brauner described the fix in the pull request as:

To prevent this, a floor value is maintained for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead.

This approach preserves the integrity of VFS ordering guarantees while allowing the kernel to use multi-grain timestamps effectively. Jeff Layton has since converted most major filesystems to support the feature.

The documentation outlines how filesystems can opt into multi-grain timestamps with minimal changes. i.e.

For most filesystems, it's sufficient to just set the FS_MGTIME flag in the fstype->fs_flags in order to opt-in, providing the ctime is only ever set via inode_set_ctime_current(). If the filesystem has a ->getattr routine that doesn't call generic_fillattr, then it should call fill_mg_cmtime() to fill those values. For setattr, it should use setattr_copy() to update the timestamps, or otherwise mimic its behavior.

Conclusion

The inclusion of multi-grain timestamps in kernel 6.13 is a significant milestone for Linux filesystem community. For filesystems like NFS, this feature will enable more efficient caching behavior, improving performance and addressing long-standing issues with timestamp granularity.

References:
[1]: https://lore.kernel.org/all/20241115-vfs-mgtime-1dd54cc6d322@brauner/
[2]: https://lwn.net/Articles/975863/
[3]: https://lwn.net/Articles/946394/
[4]: https://lore.kernel.org/all/20240711-mgtime-v5-0-37bb5b465feb@kernel.org/
[5]: https://www.kernel.org/doc/html/next/filesystems/multigrain-ts.htm

LinuxNews.Dev 🐧

Discussion about this post