Skip to content

Commit 9b01018

Browse files
committed
iomap: buffered write failure should not truncate the page cache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2155605 Tested: With xfstests and bz reproducer iomap_file_buffered_write_punch_delalloc() currently invalidates the page cache over the unused range of the delalloc extent that was allocated. While the write allocated the delalloc extent, it does not own it exclusively as the write does not hold any locks that prevent either writeback or mmap page faults from changing the state of either the page cache or the extent state backing this range. Whilst xfs_bmap_punch_delalloc_range() already handles races in extent conversion - it will only punch out delalloc extents and it ignores any other type of extent - the page cache truncate does not discriminate between data written by this write or some other task. As a result, truncating the page cache can result in data corruption if the write races with mmap modifications to the file over the same range. generic/346 exercises this workload, and if we randomly fail writes (as will happen when iomap gets stale iomap detection later in the patchset), it will randomly corrupt the file data because it removes data written by mmap() in the same page as the write() that failed. Hence we do not want to punch out the page cache over the range of the extent we failed to write to - what we actually need to do is detect the ranges that have dirty data in cache over them and *not punch them out*. To do this, we have to walk the page cache over the range of the delalloc extent we want to remove. This is made complex by the fact we have to handle partially up-to-date folios correctly and this can happen even when the FSB size == PAGE_SIZE because we now support multi-page folios in the page cache. Because we are only interested in discovering the edges of data ranges in the page cache (i.e. hole-data boundaries) we can make use of mapping_seek_hole_data() to find those transitions in the page cache. As we hold the invalidate_lock, we know that the boundaries are not going to change while we walk the range. This interface is also byte-based and is sub-page block aware, so we can find the data ranges in the cache based on byte offsets rather than page, folio or fs block sized chunks. This greatly simplifies the logic of finding dirty cached ranges in the page cache. Once we've identified a range that contains cached data, we can then iterate the range folio by folio. This allows us to determine if the data is dirty and hence perform the correct delalloc extent punching operations. The seek interface we use to iterate data ranges will give us sub-folio start/end granularity, so we may end up looking up the same folio multiple times as the seek interface iterates across each discontiguous data region in the folio. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> (cherry picked from commit f43dc4d) Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
1 parent f323bcd commit 9b01018

File tree

1 file changed

+180
-15
lines changed

1 file changed

+180
-15
lines changed

fs/iomap/buffered-io.c

Lines changed: 180 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -810,6 +810,165 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
810810
}
811811
EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
812812

813+
/*
814+
* Scan the data range passed to us for dirty page cache folios. If we find a
815+
* dirty folio, punch out the preceeding range and update the offset from which
816+
* the next punch will start from.
817+
*
818+
* We can punch out storage reservations under clean pages because they either
819+
* contain data that has been written back - in which case the delalloc punch
820+
* over that range is a no-op - or they have been read faults in which case they
821+
* contain zeroes and we can remove the delalloc backing range and any new
822+
* writes to those pages will do the normal hole filling operation...
823+
*
824+
* This makes the logic simple: we only need to keep the delalloc extents only
825+
* over the dirty ranges of the page cache.
826+
*
827+
* This function uses [start_byte, end_byte) intervals (i.e. open ended) to
828+
* simplify range iterations.
829+
*/
830+
static int iomap_write_delalloc_scan(struct inode *inode,
831+
loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
832+
int (*punch)(struct inode *inode, loff_t offset, loff_t length))
833+
{
834+
while (start_byte < end_byte) {
835+
struct folio *folio;
836+
837+
/* grab locked page */
838+
folio = filemap_lock_folio(inode->i_mapping,
839+
start_byte >> PAGE_SHIFT);
840+
if (!folio) {
841+
start_byte = ALIGN_DOWN(start_byte, PAGE_SIZE) +
842+
PAGE_SIZE;
843+
continue;
844+
}
845+
846+
/* if dirty, punch up to offset */
847+
if (folio_test_dirty(folio)) {
848+
if (start_byte > *punch_start_byte) {
849+
int error;
850+
851+
error = punch(inode, *punch_start_byte,
852+
start_byte - *punch_start_byte);
853+
if (error) {
854+
folio_unlock(folio);
855+
folio_put(folio);
856+
return error;
857+
}
858+
}
859+
860+
/*
861+
* Make sure the next punch start is correctly bound to
862+
* the end of this data range, not the end of the folio.
863+
*/
864+
*punch_start_byte = min_t(loff_t, end_byte,
865+
folio_next_index(folio) << PAGE_SHIFT);
866+
}
867+
868+
/* move offset to start of next folio in range */
869+
start_byte = folio_next_index(folio) << PAGE_SHIFT;
870+
folio_unlock(folio);
871+
folio_put(folio);
872+
}
873+
return 0;
874+
}
875+
876+
/*
877+
* Punch out all the delalloc blocks in the range given except for those that
878+
* have dirty data still pending in the page cache - those are going to be
879+
* written and so must still retain the delalloc backing for writeback.
880+
*
881+
* As we are scanning the page cache for data, we don't need to reimplement the
882+
* wheel - mapping_seek_hole_data() does exactly what we need to identify the
883+
* start and end of data ranges correctly even for sub-folio block sizes. This
884+
* byte range based iteration is especially convenient because it means we
885+
* don't have to care about variable size folios, nor where the start or end of
886+
* the data range lies within a folio, if they lie within the same folio or even
887+
* if there are multiple discontiguous data ranges within the folio.
888+
*
889+
* It should be noted that mapping_seek_hole_data() is not aware of EOF, and so
890+
* can return data ranges that exist in the cache beyond EOF. e.g. a page fault
891+
* spanning EOF will initialise the post-EOF data to zeroes and mark it up to
892+
* date. A write page fault can then mark it dirty. If we then fail a write()
893+
* beyond EOF into that up to date cached range, we allocate a delalloc block
894+
* beyond EOF and then have to punch it out. Because the range is up to date,
895+
* mapping_seek_hole_data() will return it, and we will skip the punch because
896+
* the folio is dirty. THis is incorrect - we always need to punch out delalloc
897+
* beyond EOF in this case as writeback will never write back and covert that
898+
* delalloc block beyond EOF. Hence we limit the cached data scan range to EOF,
899+
* resulting in always punching out the range from the EOF to the end of the
900+
* range the iomap spans.
901+
*
902+
* Intervals are of the form [start_byte, end_byte) (i.e. open ended) because it
903+
* matches the intervals returned by mapping_seek_hole_data(). i.e. SEEK_DATA
904+
* returns the start of a data range (start_byte), and SEEK_HOLE(start_byte)
905+
* returns the end of the data range (data_end). Using closed intervals would
906+
* require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
907+
* the code to subtle off-by-one bugs....
908+
*/
909+
static int iomap_write_delalloc_release(struct inode *inode,
910+
loff_t start_byte, loff_t end_byte,
911+
int (*punch)(struct inode *inode, loff_t pos, loff_t length))
912+
{
913+
loff_t punch_start_byte = start_byte;
914+
loff_t scan_end_byte = min(i_size_read(inode), end_byte);
915+
int error = 0;
916+
917+
/*
918+
* Lock the mapping to avoid races with page faults re-instantiating
919+
* folios and dirtying them via ->page_mkwrite whilst we walk the
920+
* cache and perform delalloc extent removal. Failing to do this can
921+
* leave dirty pages with no space reservation in the cache.
922+
*/
923+
filemap_invalidate_lock(inode->i_mapping);
924+
while (start_byte < scan_end_byte) {
925+
loff_t data_end;
926+
927+
start_byte = mapping_seek_hole_data(inode->i_mapping,
928+
start_byte, scan_end_byte, SEEK_DATA);
929+
/*
930+
* If there is no more data to scan, all that is left is to
931+
* punch out the remaining range.
932+
*/
933+
if (start_byte == -ENXIO || start_byte == scan_end_byte)
934+
break;
935+
if (start_byte < 0) {
936+
error = start_byte;
937+
goto out_unlock;
938+
}
939+
WARN_ON_ONCE(start_byte < punch_start_byte);
940+
WARN_ON_ONCE(start_byte > scan_end_byte);
941+
942+
/*
943+
* We find the end of this contiguous cached data range by
944+
* seeking from start_byte to the beginning of the next hole.
945+
*/
946+
data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
947+
scan_end_byte, SEEK_HOLE);
948+
if (data_end < 0) {
949+
error = data_end;
950+
goto out_unlock;
951+
}
952+
WARN_ON_ONCE(data_end <= start_byte);
953+
WARN_ON_ONCE(data_end > scan_end_byte);
954+
955+
error = iomap_write_delalloc_scan(inode, &punch_start_byte,
956+
start_byte, data_end, punch);
957+
if (error)
958+
goto out_unlock;
959+
960+
/* The next data search starts at the end of this one. */
961+
start_byte = data_end;
962+
}
963+
964+
if (punch_start_byte < end_byte)
965+
error = punch(inode, punch_start_byte,
966+
end_byte - punch_start_byte);
967+
out_unlock:
968+
filemap_invalidate_unlock(inode->i_mapping);
969+
return error;
970+
}
971+
813972
/*
814973
* When a short write occurs, the filesystem may need to remove reserved space
815974
* that was allocated in ->iomap_begin from it's ->iomap_end method. For
@@ -820,8 +979,25 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
820979
* allocated for this iomap.
821980
*
822981
* This function uses [start_byte, end_byte) intervals (i.e. open ended) to
823-
* simplify range iterations, but converts them back to {offset,len} tuples for
824-
* the punch callback.
982+
* simplify range iterations.
983+
*
984+
* The punch() callback *must* only punch delalloc extents in the range passed
985+
* to it. It must skip over all other types of extents in the range and leave
986+
* them completely unchanged. It must do this punch atomically with respect to
987+
* other extent modifications.
988+
*
989+
* The punch() callback may be called with a folio locked to prevent writeback
990+
* extent allocation racing at the edge of the range we are currently punching.
991+
* The locked folio may or may not cover the range being punched, so it is not
992+
* safe for the punch() callback to lock folios itself.
993+
*
994+
* Lock order is:
995+
*
996+
* inode->i_rwsem (shared or exclusive)
997+
* inode->i_mapping->invalidate_lock (exclusive)
998+
* folio_lock()
999+
* ->punch
1000+
* internal filesystem allocation lock
8251001
*/
8261002
int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
8271003
struct iomap *iomap, loff_t pos, loff_t length,
@@ -831,7 +1007,6 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
8311007
loff_t start_byte;
8321008
loff_t end_byte;
8331009
int blocksize = i_blocksize(inode);
834-
int error = 0;
8351010

8361011
if (iomap->type != IOMAP_DELALLOC)
8371012
return 0;
@@ -855,18 +1030,8 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
8551030
if (start_byte >= end_byte)
8561031
return 0;
8571032

858-
/*
859-
* Lock the mapping to avoid races with page faults re-instantiating
860-
* folios and dirtying them via ->page_mkwrite between the page cache
861-
* truncation and the delalloc extent removal. Failing to do this can
862-
* leave dirty pages with no space reservation in the cache.
863-
*/
864-
filemap_invalidate_lock(inode->i_mapping);
865-
truncate_pagecache_range(inode, start_byte, end_byte - 1);
866-
error = punch(inode, start_byte, end_byte - start_byte);
867-
filemap_invalidate_unlock(inode->i_mapping);
868-
869-
return error;
1033+
return iomap_write_delalloc_release(inode, start_byte, end_byte,
1034+
punch);
8701035
}
8711036
EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
8721037

0 commit comments

Comments
 (0)