Skip to content

Commit 998ef75

Browse files
hansendctorvalds
authored andcommitted
fs: do not prefault sys_write() user buffer pages
=== Short summary ==== iov_iter_fault_in_readable() works around a really rare case and we can avoid the deadlock it addresses in another way: disable page faults and work around copy failures by faulting after the copy in a slow path instead of before in a hot one. I have a little microbenchmark that does repeated, small writes to tmpfs. This patch speeds that micro up by 6.2%. === Long version === When doing a sys_write() we have a source buffer in userspace and then a target file page. If both of those are the same physical page, there is a potential deadlock that we avoid. It would happen something like this: 1. We start the write to the file 2. Allocate page cache page and set it !Uptodate 3. Touch the userspace buffer to copy in the user data 4. Page fault (since source of the write not yet mapped) 5. Page fault code tries to lock the page and deadlocks (more details on this below) To avoid this, we prefault the page to guarantee that this fault does not occur. But, this prefault comes at a cost. It is one of the most expensive things that we do in a hot write() path (especially if we compare it to the read path). It is working around a pretty rare case. To fix this, it's pretty simple. We move the "prefault" code to run after we attempt the copy. We explicitly disable page faults _during_ the copy, detect the copy failure, then execute the "prefault" ouside of where the page lock needs to be held. iov_iter_copy_from_user_atomic() actually already has an implicit pagefault_disable() inside of it (at least on x86), but we add an explicit one. I don't think we can depend on every kmap_atomic() implementation to pagefault_disable() for eternity. =================================================== The stack trace when this happens looks like this: wait_on_page_bit_killable+0xc0/0xd0 __lock_page_or_retry+0x84/0xa0 filemap_fault+0x1ed/0x3d0 __do_fault+0x41/0xc0 handle_mm_fault+0x9bb/0x1210 __do_page_fault+0x17f/0x3d0 do_page_fault+0xc/0x10 page_fault+0x22/0x30 generic_perform_write+0xca/0x1a0 __generic_file_write_iter+0x190/0x1f0 ext4_file_write_iter+0xe9/0x460 __vfs_write+0xaa/0xe0 vfs_write+0xa6/0x1a0 SyS_write+0x46/0xa0 entry_SYSCALL_64_fastpath+0x12/0x6a 0xffffffffffffffff (Note, this does *NOT* happen in practice today because the kmap_atomic() does a pagefault_disable(). The trace above was obtained by taking out the pagefault_disable().) You can trigger the deadlock with this little code snippet: fd = open("foo", O_RDWR); fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0); write(fd, &fdmap[0], 1); Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Paul Cassella <cassella@cray.com> Cc: Greg Thelen <gthelen@google.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 8334b96 commit 998ef75

File tree

1 file changed

+18
-16
lines changed

1 file changed

+18
-16
lines changed

mm/filemap.c

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2473,30 +2473,24 @@ ssize_t generic_perform_write(struct file *file,
24732473
iov_iter_count(i));
24742474

24752475
again:
2476-
/*
2477-
* Bring in the user page that we will copy from _first_.
2478-
* Otherwise there's a nasty deadlock on copying from the
2479-
* same page as we're writing to, without it being marked
2480-
* up-to-date.
2481-
*
2482-
* Not only is this an optimisation, but it is also required
2483-
* to check that the address is actually valid, when atomic
2484-
* usercopies are used, below.
2485-
*/
2486-
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
2487-
status = -EFAULT;
2488-
break;
2489-
}
2490-
24912476
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
24922477
&page, &fsdata);
24932478
if (unlikely(status < 0))
24942479
break;
24952480

24962481
if (mapping_writably_mapped(mapping))
24972482
flush_dcache_page(page);
2498-
2483+
/*
2484+
* 'page' is now locked. If we are trying to copy from a
2485+
* mapping of 'page' in userspace, the copy might fault and
2486+
* would need PageUptodate() to complete. But, page can not be
2487+
* made Uptodate without acquiring the page lock, which we hold.
2488+
* Deadlock. Avoid with pagefault_disable(). Fix up below with
2489+
* iov_iter_fault_in_readable().
2490+
*/
2491+
pagefault_disable();
24992492
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
2493+
pagefault_enable();
25002494
flush_dcache_page(page);
25012495

25022496
status = a_ops->write_end(file, mapping, pos, bytes, copied,
@@ -2519,6 +2513,14 @@ ssize_t generic_perform_write(struct file *file,
25192513
*/
25202514
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
25212515
iov_iter_single_seg_count(i));
2516+
/*
2517+
* This is the fallback to recover if the copy from
2518+
* userspace above faults.
2519+
*/
2520+
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
2521+
status = -EFAULT;
2522+
break;
2523+
}
25222524
goto again;
25232525
}
25242526
pos += copied;

0 commit comments

Comments
 (0)