Message ID | 4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com |
---|---|
State | Accepted |
Commit | 08095d6310a7ce43256b4251577bc66a25c6e1a6 |
Headers | show |
Series | None | expand |
On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim <minchan@kernel.org> wrote: > On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote: > > The process_madvise() system call is expected to skip holes in vma > > passed through 'struct iovec' vector list. But do_madvise, which > > process_madvise() calls for each vma, returns ENOMEM in case of unmapped > > holes, despite the VMA is processed. > > Thus process_madvise() should treat ENOMEM as expected and consider the > > VMA passed to as processed and continue processing other vma's in the > > vector list. Returning -ENOMEM to user, despite the VMA is processed, > > will be unable to figure out where to start the next madvise. > > Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") > > Cc: <stable@vger.kernel.org> # 5.10+ > > Hmm, not sure whether it's stable material since it changes semantic of > API. It would be better to change the semantic from 5.19 with man page > update to specify the change. It's a very desirable change and it makes the code match the manpage and it's cc:stable. I think we should just absorb any transitory damage which this causes people. I doubt if there will be much - if anyone was affected by this they would have already told us that it's broken?
On Wed, 16 Mar 2022 19:49:38 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > IMO, it's worth to note in man page. > > > > Or the current patch for just ENOMEM is sufficient here and we just have > to update the man page? I think the "On success, process_madvise() returns the number of bytes advised" behaviour sounds useful. But madvise() doesn't do that. RETURN VALUE On success, madvise() returns zero. On error, it returns -1 and errno is set to indicate the error. So why is it desirable in the case of process_madvise()? And why was process_madvise() designed this way? Or was it always simply an error in the manpage?
On Wed, Mar 16, 2022 at 07:49:38PM +0530, Charan Teja Kalla wrote: > Thanks Andrew and Minchan. > > On 3/16/2022 7:13 AM, Minchan Kim wrote: > > On Tue, Mar 15, 2022 at 04:48:07PM -0700, Andrew Morton wrote: > >> On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim <minchan@kernel.org> wrote: > >> > >>> On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote: > >>>> The process_madvise() system call is expected to skip holes in vma > >>>> passed through 'struct iovec' vector list. But do_madvise, which > >>>> process_madvise() calls for each vma, returns ENOMEM in case of unmapped > >>>> holes, despite the VMA is processed. > >>>> Thus process_madvise() should treat ENOMEM as expected and consider the > >>>> VMA passed to as processed and continue processing other vma's in the > >>>> vector list. Returning -ENOMEM to user, despite the VMA is processed, > >>>> will be unable to figure out where to start the next madvise. > >>>> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") > >>>> Cc: <stable@vger.kernel.org> # 5.10+ > >>> > >>> Hmm, not sure whether it's stable material since it changes semantic of > >>> API. It would be better to change the semantic from 5.19 with man page > >>> update to specify the change. > >> > >> It's a very desirable change and it makes the code match the manpage > >> and it's cc:stable. I think we should just absorb any transitory > >> damage which this causes people. I doubt if there will be much - if > >> anyone was affected by this they would have already told us that it's > >> broken? > > > > > > process_madvise fails to return exact processed bytes at several cases > > if it encounters the error, such as, -EINVAL, -EINTR, -ENOMEM in the > > middle of processing vmas. And now we are trying to make exception for > > change for only hole? > I think EINTR will never return in the middle of processing VMA's for > the behaviours supported by process_madvise(). > > It can return EINTR when: > ------------------------- > 1) PTRACE_MODE_READ is being checked in mm_access() where it is waiting > on task->signal->exec_update_lock. EINTR returned from here guarantees > that process_madvise() didn't event start processing. > https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 --> > https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318 > > 2) The process_madvise() started processing VMA's but the required > behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is > returned. The current behaviours supported by process_madvise(), > MADV_COLD, PAGEOUT, WILLNEED, just need read lock here. > https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164 > **Thus I think no way for EINTR can be returned by process_madvise() in > the middle of processing.** . No? > > for EINVAL: > ----------- > The only case, I can think of, where EINVAL can be returned in the > middle of processing is in examples like, given range contains VMA's > with a hole in between and one of the VMA contains the pages that fails > can_madv_lru_vma() condition. > So, it's a limitation that this returns -EINVAL though some bytes are > processed. > OR > Since there exists still some invalid bytes processed it is valid to > return -EINVAL here and user has to check the address range sent? > > for ENOMEM: > ---------- > Though complete range is processed still returns ENOMEM. IMO, This > shouldn't be treated as error which the patch is targeted for. Then > there is limitation case that you mentioned below where it returns > positive processes bytes even though it didn't process anything if it > couldn't find any vma for the first iteration in madvise_walk_vmas > > I think the above limitations with EINVAL and ENOMEM are arising because > we are relying on do_madvise() functionality which madvise() call uses > to process a single VMA. When 'struct iovec' vector processing interface > is given in a system call, it is the expectation by the caller that this > system call should return the correct bytes processed to help the user > to take the correct decisions. Please correct me If i am wrong here. > > So, should we add the new function say do_process_madvise(), which take > cares of above limitations? or any alternative suggestions here please? What I am thinking now is that the process_madvise needs own iterator(i.e., do_process_madvise) and it should represent exact bytes it addressed with exacts ranges like process_vm_readv/writev. Poviding valid ranges is responsiblity from the user. > > > IMO, it's worth to note in man page. > > > > Or the current patch for just ENOMEM is sufficient here and we just have > to update the man page? > > > In addition, this change returns positive processes bytes even though > > it didn't process anything if it couldn't find any vma for the first > > iteration in madvise_walk_vmas. > > Thanks, > Charan >
On Wed, Mar 16, 2022 at 02:29:06PM -0700, Andrew Morton wrote: > On Wed, 16 Mar 2022 19:49:38 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > > > IMO, it's worth to note in man page. > > > > > > > Or the current patch for just ENOMEM is sufficient here and we just have > > to update the man page? > > I think the "On success, process_madvise() returns the number of bytes > advised" behaviour sounds useful. But madvise() doesn't do that. > > RETURN VALUE > On success, madvise() returns zero. On error, it returns -1 and errno > is set to indicate the error. > > So why is it desirable in the case of process_madvise()? Since process_madvise deal with multiple ranges and could fail at one of them in the middle or pocessing, people could decide where the call failed and then make a strategy whether they will abort at the point or continue to hint next addresses. Here, problem of the strategy is API doesn't return any error vaule if it has processed any bytes so they would have limitation to decide a policy. That's the limitation for every vector IO syscalls, unfortunately. > > > > And why was process_madvise() designed this way? Or was it > always simply an error in the manpage?
On 3/21/2022 8:32 PM, Michal Hocko wrote: >> It can return EINTR when: >> ------------------------- >> 1) PTRACE_MODE_READ is being checked in mm_access() where it is waiting >> on task->signal->exec_update_lock. EINTR returned from here guarantees >> that process_madvise() didn't event start processing. >> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 --> >> https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318 >> >> 2) The process_madvise() started processing VMA's but the required >> behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is >> returned. > Please note this will happen if the task has been killed. The return > value doesn't really matter because the process won't run in userspace. Okay, thanks here. > >> The current behaviours supported by process_madvise(), >> MADV_COLD, PAGEOUT, WILLNEED, just need read lock here. >> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164 >> **Thus I think no way for EINTR can be returned by process_madvise() in >> the middle of processing.** . No? > Maybe not with the current implementation but I can easily imagine that > there is a requirement to break out early when there is a signal pending > (e.g. to support terminating madvise on a large memory rage). You would > get EINTR then somehow need to communicate that to the userspace. Agree. Will implement this.
Thanks Michal for the inputs. On 3/21/2022 9:04 PM, Michal Hocko wrote: > On Fri 11-03-22 20:59:06, Charan Teja Kalla wrote: >> The process_madvise() system call is expected to skip holes in vma >> passed through 'struct iovec' vector list. > Where is this assumption coming from? From the man page I can see: > : The advice might be applied to only a part of iovec if one of its > : elements points to an invalid memory region in the remote > : process. No further elements will be processed beyond that > : point. I assumed this while processing a single element of a iovec. In a scenario where a range passed contains multiple VMA's + holes, on encountering the VMA with VM_LOCKED|VM_HUGETLB|VM_PFNMAP, we are immediately stopping further processing of that iovec element with EINVAL return. Where as on encountering a hole, we are simply remembering it as ENOMEM but continues processing that iovec element and in the end returns ENOMEM. This means that complete range is processed but still returning ENOMEM, hence the assumption of skipping holes in a vma. The other problem is, in an individual iovec element, though some bytes are processed we may still endup in returning EINVAL which is hard for the user to take decisions i.e. he doesn't know at which address it is exactly failed to advise. Anyway, both these will be addressed in the next version of this patch with the suggestions from minchan [1] where it mentioned that: "it should represent exact bytes it addressed with exacts ranges like process_vm_readv/writev. Poviding valid ranges is responsiblity from the user." [1] https://lore.kernel.org/linux-mm/YjNgoeg1yOocsjWC@google.com/ > >> But do_madvise, which >> process_madvise() calls for each vma, returns ENOMEM in case of unmapped >> holes, despite the VMA is processed. >> Thus process_madvise() should treat ENOMEM as expected and consider the >> VMA passed to as processed and continue processing other vma's in the >> vector list. Returning -ENOMEM to user, despite the VMA is processed, >> will be unable to figure out where to start the next madvise. > I am not sure I follow. With your previous patch and -ENOMEM from > do_madvise you get the the answer you are looking for, no? > With this applied you are loosing the information that some of the iters > are not mapped or has a hole. Which might be a useful information > especially when processing on remote tasks which are free to manipulate > their address spaces. Yes, it should return ENOMEM. The same will be fixed in the next revision. > >> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") >> Cc: <stable@vger.kernel.org> # 5.10+ >> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> >> --- >> Changes in V2: >> -- Fixed handling of ENOMEM by process_madvise(). >> -- Patch doesn't exist in V1. >> >> mm/madvise.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index e97e6a9..14fb76d 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -1426,9 +1426,16 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, >> >> while (iov_iter_count(&iter)) { >> iovec = iov_iter_iovec(&iter); >> + /* >> + * do_madvise returns ENOMEM if unmapped holes are present >> + * in the passed VMA. process_madvise() is expected to skip >> + * unmapped holes passed to it in the 'struct iovec' list >> + * and not fail because of them. Thus treat -ENOMEM return >> + * from do_madvise as valid and continue processing. >> + */ >> ret = do_madvise(mm, (unsigned long)iovec.iov_base, >> iovec.iov_len, behavior); >> - if (ret < 0) >> + if (ret < 0 && ret != -ENOMEM) >> break; >> iov_iter_advance(&iter, iovec.iov_len); >> } >> -- >> 2.7.4
diff --git a/mm/madvise.c b/mm/madvise.c index e97e6a9..14fb76d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1426,9 +1426,16 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, while (iov_iter_count(&iter)) { iovec = iov_iter_iovec(&iter); + /* + * do_madvise returns ENOMEM if unmapped holes are present + * in the passed VMA. process_madvise() is expected to skip + * unmapped holes passed to it in the 'struct iovec' list + * and not fail because of them. Thus treat -ENOMEM return + * from do_madvise as valid and continue processing. + */ ret = do_madvise(mm, (unsigned long)iovec.iov_base, iovec.iov_len, behavior); - if (ret < 0) + if (ret < 0 && ret != -ENOMEM) break; iov_iter_advance(&iter, iovec.iov_len); }
The process_madvise() system call is expected to skip holes in vma passed through 'struct iovec' vector list. But do_madvise, which process_madvise() calls for each vma, returns ENOMEM in case of unmapped holes, despite the VMA is processed. Thus process_madvise() should treat ENOMEM as expected and consider the VMA passed to as processed and continue processing other vma's in the vector list. Returning -ENOMEM to user, despite the VMA is processed, will be unable to figure out where to start the next madvise. Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Cc: <stable@vger.kernel.org> # 5.10+ Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> --- Changes in V2: -- Fixed handling of ENOMEM by process_madvise(). -- Patch doesn't exist in V1. mm/madvise.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)