mbox series

[RFC,0/6] RLIMIT_NPROC in ucounts fixups

Message ID 20220207121800.5079-1-mkoutny@suse.com
Headers show
Series RLIMIT_NPROC in ucounts fixups | expand

Message

Michal Koutný Feb. 7, 2022, 12:17 p.m. UTC
This series is a result of looking deeper into breakage of
tools/testing/selftests/rlimits/rlimits-per-userns.c after 
https://lore.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com/
is applied.

The description of the original problem that lead to RLIMIT_NPROC et al.
ucounts rewrite could be ambiguously interpretted as supporting either
the case of:
- never-fork service or
- fork (RLIMIT_NPROC-1) times service.

The scenario is weird anyway given existence of pids controller.

The realization of that scenario relies not only on tracking number of
processes per user_ns but also newly allows the root to override limit through
set*uid. The commit message didn't mention that, so it's unclear if it
was the intention too.

I also noticed that the RLIMIT_NPROC enforcing in fork seems subject to TOCTOU
race (check(nr_tasks),...,nr_tasks++) so the limit is rather advisory (but
that's not a new thing related to ucounts rewrite).

This series is RFC to discuss relevance of the subtle changes RLIMIT_NPROC to
ucounts rewrite introduced.

Michal Koutný (6):
  set_user: Perform RLIMIT_NPROC capability check against new user
    credentials
  set*uid: Check RLIMIT_PROC against new credentials
  cred: Count tasks by their real uid into RLIMIT_NPROC
  ucounts: Allow root to override RLIMIT_NPROC
  selftests: Challenge RLIMIT_NPROC in user namespaces
  selftests: Test RLIMIT_NPROC in clone-created user namespaces

 fs/exec.c                                     |   2 +-
 include/linux/cred.h                          |   2 +-
 kernel/cred.c                                 |  29 ++-
 kernel/fork.c                                 |   2 +-
 kernel/sys.c                                  |  20 +-
 kernel/ucount.c                               |   3 +
 kernel/user_namespace.c                       |   2 +-
 .../selftests/rlimits/rlimits-per-userns.c    | 233 +++++++++++++++---
 8 files changed, 229 insertions(+), 64 deletions(-)

Comments

Eric W. Biederman Feb. 11, 2022, 8:32 p.m. UTC | #1
Solar Designer <solar@openwall.com> writes:

> Hi Michal,
>
> On Mon, Feb 07, 2022 at 01:17:55PM +0100, Michal Koutný wrote:
>> The check is currently against the current->cred but since those are
>> going to change and we want to check RLIMIT_NPROC condition after the
>> switch, supply the capability check with the new cred.
>> But since we're checking new_user being INIT_USER any new cred's
>> capability-based allowance may be redundant when the check fails and the
>> alternative solution would be revert of the commit 2863643fb8b9
>> ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
>> 
>> Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
>> 
>> Cc: Solar Designer <solar@openwall.com>
>> Cc: Christian Brauner <christian.brauner@ubuntu.com>
>> Signed-off-by: Michal Koutný <mkoutny@suse.com>
>> ---
>>  kernel/sys.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 8ea20912103a..48c90dcceff3 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -481,7 +481,8 @@ static int set_user(struct cred *new)
>>  	 */
>>  	if (ucounts_limit_cmp(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) >= 0 &&
>>  			new_user != INIT_USER &&
>> -			!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>> +			!security_capable(new, &init_user_ns, CAP_SYS_RESOURCE, CAP_OPT_NONE) &&
>> +			!security_capable(new, &init_user_ns, CAP_SYS_ADMIN, CAP_OPT_NONE))
>>  		current->flags |= PF_NPROC_EXCEEDED;
>>  	else
>>  		current->flags &= ~PF_NPROC_EXCEEDED;
>
> Thank you for working on this and CC'ing me on it.  This is related to
> the discussion Christian and I had in September:
>
> https://lore.kernel.org/all/20210913100140.bxqlg47pushoqa3r@wittgenstein/
>
> Christian was going to revert 2863643fb8b9, but apparently that never
> happened.  Back then, I also suggested:
>
> "Alternatively, we could postpone the set_user() calls until we're
> running with the new user's capabilities, but that's an invasive change
> that's likely to create its own issues."

Back then you mentioned that apache suexec was broken.  Do you have
any more details?

I would like to make certain the apache suexec issue is fixed but
without a few details I can't do that.  I tried looking but I can't
find an public report about apache suexec being broken.

My goal is to come up with a very careful and conservative set of
patches that fix all of the known issues with RLIMIT_NPROC.

Eric
Etienne Dechamps Feb. 12, 2022, 3:32 p.m. UTC | #2
Hello there,

On 07/02/2022 12:17, Michal Koutný wrote:
> This series is a result of looking deeper into breakage of
> tools/testing/selftests/rlimits/rlimits-per-userns.c after
> https://lore.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com/
> is applied.

Pardon the intrusion, but I thought you might be interested to know that 
as a humble user I noticed actual user-visible breakage from 59ec715 
"ucounts: Fix rlimit max values check": 
https://bugzilla.kernel.org/show_bug.cgi?id=215596

I'm not sure I understand everything that's going on in this thread but 
it does seem very relevant. You guys might want to double-check the 
behavior in the particular scenario described there. I'm mostly sending 
this to make sure everything is cross-linked.
Solar Designer Feb. 12, 2022, 10:14 p.m. UTC | #3
On Fri, Feb 11, 2022 at 02:32:47PM -0600, Eric W. Biederman wrote:
> Solar Designer <solar@openwall.com> writes:
> > https://lore.kernel.org/all/20210913100140.bxqlg47pushoqa3r@wittgenstein/
> >
> > Christian was going to revert 2863643fb8b9, but apparently that never
> > happened.  Back then, I also suggested:
> >
> > "Alternatively, we could postpone the set_user() calls until we're
> > running with the new user's capabilities, but that's an invasive change
> > that's likely to create its own issues."
> 
> Back then you mentioned that apache suexec was broken.  Do you have
> any more details?
> 
> I would like to make certain the apache suexec issue is fixed but
> without a few details I can't do that.  I tried looking but I can't
> find an public report about apache suexec being broken.

I'm not aware of anyone actually running into this issue and reporting
it.  The systems that I personally know use suexec along with rlimits
still run older/distro kernels, so would not yet be affected.

So my mention was based on my understanding of how suexec works, and
code review.  Specifically, Apache httpd has the setting RLimitNPROC,
which makes it set RLIMIT_NPROC:

https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc

The above documentation for it includes:

"This applies to processes forked from Apache httpd children servicing
requests, not the Apache httpd children themselves. This includes CGI
scripts and SSI exec commands, but not any processes forked from the
Apache httpd parent, such as piped logs."

In code, there are:

./modules/generators/mod_cgid.c:        ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
./modules/generators/mod_cgi.c:        ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
./modules/filters/mod_ext_filter.c:    rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc);

For example, in mod_cgi.c this is in run_cgi_child().

I think this means an httpd child sets RLIMIT_NPROC shortly before it
execs suexec, which is a SUID root program.  suexec then switches to the
target user and execs the CGI script.

Before 2863643fb8b9, the setuid() in suexec would set the flag, and the
target user's process count would be checked against RLIMIT_NPROC on
execve().  After 2863643fb8b9, the setuid() in suexec wouldn't set the
flag because setuid() is (naturally) called when the process is still
running as root (thus, has those limits bypass capabilities), and
accordingly execve() would not check the target user's process count
against RLIMIT_NPROC.

> My goal is to come up with a very careful and conservative set of
> patches that fix all of the known issues with RLIMIT_NPROC.

The most conservative fix for this one would be to revert 2863643fb8b9
(preserving other changes that were made on top of it).  I think this
commit did not fix a real issue - it attempted to fix what someone
thought was a discrepancy, but actually made it worse.

However, your recent patch trying to fix that commit looks like it'd
also repair the behavior for suexec.

Thanks,

Alexander
Michal Koutný Feb. 15, 2022, 10:11 a.m. UTC | #4
On Sat, Feb 12, 2022 at 03:32:30PM +0000, Etienne Dechamps <etienne@edechamps.fr> wrote:
> I'm not sure I understand everything that's going on in this thread but it
> does seem very relevant. You guys might want to double-check the behavior in
> the particular scenario described there. I'm mostly sending this to make
> sure everything is cross-linked.

Thanks for the report with strace.

AFAICT, it's caused by setresuid() after unshare(), i.e. all root's
tasks are (wrongly) compared against the lowered RLIMIT_NPROC.

This is tackled by my RFC patch 2/6 [1] or Eric's variant but 3/8
(equivalent fix for this case but I haven't run that build).

Michal

[1] I could run your test (LimitNPROC=1 actually) against kernel with my
patches and the service starts.
Michal Koutný Feb. 15, 2022, 11:55 a.m. UTC | #5
On Thu, Feb 10, 2022 at 02:14:05AM +0100, Solar Designer <solar@openwall.com> wrote:
> However, I think you need to drop the negations of the return value from
> security_capable().
> security_capable() returns 0 or -EPERM, while capable() returns a
> bool, in kernel/capability.c: ns_capable_common():

Oops. Yeah, I only blindly applied replacement with a predicate for
(new) cred and overlooked this inverse semantics. Thanks for pointing
that out to me!

Nevertheless, this will likely be incorporated via Eric's series
anyway.


Michal
Eric W. Biederman Feb. 23, 2022, 12:57 a.m. UTC | #6
Michal Koutný <mkoutny@suse.com> writes:

> On Sat, Feb 12, 2022 at 03:32:30PM +0000, Etienne Dechamps <etienne@edechamps.fr> wrote:
>> I'm not sure I understand everything that's going on in this thread but it
>> does seem very relevant. You guys might want to double-check the behavior in
>> the particular scenario described there. I'm mostly sending this to make
>> sure everything is cross-linked.
>
> Thanks for the report with strace.
>
> AFAICT, it's caused by setresuid() after unshare(), i.e. all root's
> tasks are (wrongly) compared against the lowered RLIMIT_NPROC.
>
> This is tackled by my RFC patch 2/6 [1] or Eric's variant but 3/8
> (equivalent fix for this case but I haven't run that build).
>
> Michal
>
> [1] I could run your test (LimitNPROC=1 actually) against kernel with my
> patches and the service starts.


So I looked into this and our previous patchsets (but not my final one)
did resolve this.

What fixed it and what is needed to fix this is not enforcing
RLIMIT_NPROC when the user who creates the user namespace is INIT_USER.

AKA something like the patch below.  It is a regression so if at all
possible it needs to be fixed, and it is certainly possible.

The patch below feels right at first glance, but I am not convinced that
testing cred->user or cred->ucounts is the proper test so I am going to
sleep on this a little bit.

I did want everyone to know I looked into this and I am going to ensure
this gets fixed.

diff --git a/kernel/fork.c b/kernel/fork.c
index 17d8a8c85e3b..532ce5cbf851 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
 
 	retval = -EAGAIN;
 	if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
-		if (p->real_cred->user != INIT_USER &&
+		if (p->real_cred->ucounts != &init_ucounts &&
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
 			goto bad_fork_cleanup_count;
 	}
diff --git a/kernel/sys.c b/kernel/sys.c
index 97dc9e5d6bf9..7b5d74a7845c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -490,7 +490,7 @@ static void flag_nproc_exceeded(struct cred *new)
 	 * failure to the execve() stage.
 	 */
 	if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
-			new->user != INIT_USER)
+			new->ucounts != &init_ucounts)
 		current->flags |= PF_NPROC_EXCEEDED;
 	else
 		current->flags &= ~PF_NPROC_EXCEEDED;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..925fb3579ef3 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -123,6 +123,8 @@ int create_user_ns(struct cred *new)
 		ns->ucount_max[i] = INT_MAX;
 	}
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+	if (new->ucounts == &init_ucounts)
+		set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIM_INFINITY);
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
Eric W. Biederman Feb. 23, 2022, 6 p.m. UTC | #7
[CC'd the security list because I really don't know who the right people
 are to drag into this discussion]
 
While looking at some issues that have cropped up with making it so
that RLIMIT_NPROC cannot be escaped by creating a user namespace I have
stumbled upon a very old issue of how rlimits and suid exec interact
poorly.

This specific saga starts with commit 909cc4ae86f3 ("[PATCH] Fix two
bugs with process limits (RLIMIT_NPROC)") from
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git which
essentially replaced a capable() check with a an open-coded
implementation of suser(), for RLIMIT_NPROC.

The description from Neil Brown was:

   1/ If a setuid process swaps it's real and effective uids and then forks,
     the fork fails if the new realuid has more processes
     than the original process was limited to.
     This is particularly a problem if a user with a process limit
     (e.g. 256) runs a setuid-root program which does setuid() + fork()
     (e.g. lprng) while root already has more than 256 process (which
     is quite possible).
    
     The root problem here is that a limit which should be a per-user
     limit is being implemented as a per-process limit with
     per-process (e.g. CAP_SYS_RESOURCE) controls.
     Being a per-user limit, it should be that the root-user can over-ride
     it, not just some process with CAP_SYS_RESOURCE.
    
     This patch adds a test to ignore process limits if the real user is root.

The test to see if the real user is root was:
	if (p->real_cred->user != INIT_USER) ...
which persists to this day in fs/fork.c:copy_process().

The practical problem with this test is that it works like nothing else
in the kernel, and so does not look like what it is.  Saying:
	if (!uid_eq(p->real_cred->uid, GLOBAL_ROOT_USER)) ...

would at least be more recognizable.

Really this entire test should be if (!capable(CAP_SYS_RESOURCE) because
CAP_SYS_RESOURCE is the capability that controls if you are allowed to
exceed your rlimits.

Which brings us to the practical issues of how all of these things are
wired together today.

The per-user rlimits are accounted based upon a processes real user, not
the effective user.  All other permission checks are based upon the
effective user.  This has the practical effect that uids are swapped as
above that the processes are charged to root, but use the permissions of
an ordinary user.

The problems get worse when you realize that suid exec does not reset
any of the rlimits except for RLIMIT_STACK.

The rlimits that are particularly affected and are per-user are:
RLIMIT_NPROC, RLIMIT_MSGQUEUE, RLIMIT_SIGPENDING, RLIMIT_MEMLOCK.

But I think failing to reset rlimits during exec has the potential to
effect any suid exec.

Does anyone have any historical knowledge or sense of how this should
work?

Right now it feels like we have coded ourselves into a corner and will
have to risk breaking userspace to get out of it.  AKA I think we need
a policy of reseting rlimits on suid exec, and I think we need to store
global rlimits based upon the effective user not the real user.  Those
changes should allow making capable calls where they belong, and
removing the much too magic user == INIT_USER test for RLIMIT_NPROC.

Eric
Andy Lutomirski Feb. 23, 2022, 7:44 p.m. UTC | #8
On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
>
> [CC'd the security list because I really don't know who the right people
>  are to drag into this discussion]
>
> While looking at some issues that have cropped up with making it so
> that RLIMIT_NPROC cannot be escaped by creating a user namespace I have
> stumbled upon a very old issue of how rlimits and suid exec interact
> poorly.

Once upon a time, these resource limits were effectively the only way
to control memory consumption and consumption of historically limited
resources like processes.  (The scheduler used to have serious issues
with too many processes -- this is not so true any more.  And without
cgroups, too many processes could use too much CPU collectively.)
This all worked pretty poorly.  Now we have cgroups, fancy memory
accounting, etc.  So I'm wondering if NPROC is even useful anymore.  I
don't have a brilliant idea of how to deprecate it, but I think it
wouldn't be entirely nuts to take it much less seriously and maybe
even eventually get rid of it.

I doubt there is much existing userspace that would break if a
previously failing fork() started succeeding.

--Andy]
Linus Torvalds Feb. 23, 2022, 7:50 p.m. UTC | #9
On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Which brings us to the practical issues of how all of these things are
> wired together today.

I honestly think you should treat the limits as "approximate".

We do that for a number of reasons:

 - sometimes we have racy tests because we don't want to do excessive
locking just for a limit: nobody cares if you can go a couple of
entries past a limit because you were lucky, it's important that you
can't go *much* past the limit.

 - sometimes the limits themselves are fuzzy (example: time. it's
incremented by "ticks", but it's simply not that precise, and it
depends a bit when the ticks happen)

 - sometimes it's ambiguous who we're talking about.

I think suid execs tend to fall in that third category. Be generous.
If the limit doesn't trigger at the suid exec, nobody cares. You want
to make sure it triggers eventually.

For example, let's say that you are the admin, and you made a mistake,
and you had a runaway fork() bomb that was caught by the limits.

Optimally, you still want to be able to be able to log in (one process
that was root when it did the fork(), and did a 'setresuid()' or
similar to drop the things, and then one process that does 'sudo' to
get privileges to kill the darn fork bomb).

See how that 'user' technically went over the limit, and that was A-OK!

Basic rule: it's better to be too lenient than to be too strict.

                  Linus
Willy Tarreau Feb. 23, 2022, 9:28 p.m. UTC | #10
Hi Andy,

On Wed, Feb 23, 2022 at 11:44:51AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 23, 2022 at 10:00 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> >
> > [CC'd the security list because I really don't know who the right people
> >  are to drag into this discussion]
> >
> > While looking at some issues that have cropped up with making it so
> > that RLIMIT_NPROC cannot be escaped by creating a user namespace I have
> > stumbled upon a very old issue of how rlimits and suid exec interact
> > poorly.
> 
> Once upon a time, these resource limits were effectively the only way
> to control memory consumption and consumption of historically limited
> resources like processes.  (The scheduler used to have serious issues
> with too many processes -- this is not so true any more.  And without
> cgroups, too many processes could use too much CPU collectively.)
> This all worked pretty poorly.  Now we have cgroups, fancy memory
> accounting, etc.  So I'm wondering if NPROC is even useful anymore.  I
> don't have a brilliant idea of how to deprecate it, but I think it
> wouldn't be entirely nuts to take it much less seriously and maybe
> even eventually get rid of it.
> 
> I doubt there is much existing userspace that would break if a
> previously failing fork() started succeeding.

I strongly disagree. I've been using it for a long time as a security
measure. Setting NPROC to 0 after daemonizing remains a particularly
effective and portable method to mitigate the possible consequences of
an in-process intrusion. While I wouldn't care about approximate non-zero
values, for me it would be a significant security regression to drop the
inability to fork() when the limit is zero. Thus at least I do want to
keep that feature when NPROC is zero.

Willy
Eric W. Biederman Feb. 24, 2022, 1:24 a.m. UTC | #11
Linus Torvalds <linus@torvalds.org> writes:

> Basic rule: it's better to be too lenient than to be too strict.

Thank you.  With that guideline I can explore the space of what is
possible.

Question: Running a suid program today charges the activity of that
program to the user who ran that program, not to the user the program
runs as.  Does anyone see a problem with charging the user the program
runs as?

The reason I want to change who is charged with a process (besides it
making more sense in my head) is so that capable(CAP_SYS_RESOURCE) can
be used instead of the magic incantation (cred->user == INIT_USER).

An accidental experiment happened in v5.14-rc1 in July when the ucount
rlimit code was merged.  It was only this last week when after Michal
Koutný discovered the discrepency through code inspect a bug fix was
merged.

This changes the behavior that has existed in some form since Linux v1.0
when per user process limits were added.

The original code in v1.0 looked like:
> static int find_empty_process(void)
> {
>        int free_task;
>        int i, tasks_free;
>        int this_user_tasks;
> 
> repeat:
>        if ((++last_pid) & 0xffff8000)
>                last_pid=1;
>        this_user_tasks = 0;
>        tasks_free = 0;
>        free_task = -EAGAIN;
>        i = NR_TASKS;
>        while (--i > 0) {
>                if (!task[i]) {
>                        free_task = i;
>                        tasks_free++;
>                        continue;
>                }
>                if (task[i]->uid == current->uid)
>                        this_user_tasks++;
>                if (task[i]->pid == last_pid || task[i]->pgrp == last_pid ||
>                    task[i]->session == last_pid)
>                        goto repeat;
>        }
>        if (tasks_free <= MIN_TASKS_LEFT_FOR_ROOT ||
>            this_user_tasks > MAX_TASKS_PER_USER)
>                if (current->uid)
>                        return -EAGAIN;
>        return free_task;
> }

Having tracked the use of real uid in limits back this far my guess
is that it was an accident of the implementation and real uid vs
effective uid had not be considered.

Does anyone know if choosing the real uid was a deliberate decision
anywhere in the history of Linux?



Linus you were talking about making it possible to login as I think a
non-root user to be able to use sudo and kill a fork bomb.

The counter case is apache having a dedicated user for running
cgi-scripts and using RLIMIT_NPROC to limit how many of those processes
can exist.  Unless I am misunderstanding something that looks exactly
like your login as non-root so you can run sudo to kill a fork-bomb.

A comment from an in-process cleanup patch explains this as best I can:
         /*
         * In general rlimits are only enforced when a new resource
         * is acquired.  That would be during fork for RLIMIT_NPROC.
         * That is insufficient for RLIMIT_NPROC as many attributes of
         * a new process must be set between fork and exec.
         *
         * A case where this matter is when apache runs forks a process
         * and calls setuid to run cgi-scripts as a different user.
         * Generating those processes through a code sequence like:
         *
         *      fork()
 	 *	setrlimit(RLIMIT_NPROC, ...)         
         *      execve()  -- suid wrapper
         *      setuid()
         *      execve()  -- cgi script
         *
         * The cgi-scripts are unlikely to fork on their own so unless
         * RLIMIT_NPROC is checked after the user change and before
         * the cgi-script starts, RLIMIT_NPROC simply will not be enforced
         * for the cgi-scripts.
         *
         * So the code tracks if between fork and exec if an operation
         * occurs that could cause the RLIMIT_NPROC check to fail.  If
         * such an operation has happened re-check RLIMIT_NPROC.
         */


Answered-Question: I was trying to ask if anyone knows of a reason why
we can't just sanitize the rlimits of the process during suid exec?
Linus your guideline would appear to allow that behavior.  Unfortunately
that looks like it would break current usage of apache suexec.

Eric
Linus Torvalds Feb. 24, 2022, 1:41 a.m. UTC | #12
On Wed, Feb 23, 2022 at 5:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Question: Running a suid program today charges the activity of that
> program to the user who ran that program, not to the user the program
> runs as.  Does anyone see a problem with charging the user the program
> runs as?

So I think that there's actually two independent issues with limits
when you have situations like this where the actual user might be
ambiguous.

 - the "who to charge" question

 - the "how do we *check* the limit" question

and honestly, I think that when it comes to suid binaries, the first
question is fundamentally ambiguous, because it almost certainly
depends on the user.

Which to me implies that there probably isn't an answer that is always
right, and that what you should look at is that second option.

So I would actually suggest that the "execute a suid binary" should
charge the real user, but *because* it is suid, it should then not
check the limit (or, perhaps, should check the hard limit?).

You have to charge somebody, but at that point it's a bit ambiguous
whether it should be allowed.

Exactly so that if you're over a process limit (or something similar -
think "too many files open" or whatever because you screwed up and
opened everything) you could still log in as yourself (ssh/login
charges some admin thing, which probably has high limits or is
unlimited), and hopefully get shell access, and then be able to "exec
sudo" to actually get admin access that should be disabled from the
network.

The above is just one (traditional) example of a fork/open bomb case
where a user isn't really able to no longer function as himself, but
wants to fix things (maybe the user has another terminal open, but
then he can hopefully use a shell-buiiltin 'kill' instead).

And I'm not saying it's "the thing that needs to work". I'm more
making up an example.

So I'm only saying that the above actually has two examples to the two
sides of the coin: "login" lowering privileges to a user that may be
over some limit - and succeeding despite that - and 'suid' succeeding
despite the original user perhaps being over-committed.

So it's intended exactly as an example of "picking the new or the old
user would be wrong in either case if you check limits at the
transition point".

Hmm?

            Linus
Kees Cook Feb. 24, 2022, 4:28 p.m. UTC | #13
typo: Subject's LimigtNPROC -> LimitNPROC

On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote:
> 
> Long story short recursively enforcing RLIMIT_NPROC when it is not
> enforced on the process that creates a new user namespace, causes
> currently working code to fail.  There is no reason to enforce
> RLIMIT_NPROC recursively when we don't enforce it normally so update
> the code to detect this case.
> 
> I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
> RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
> RLIMIT_NPROC is charged and checked for enforcement based upon the
> real uid, using capable() wich is euid based is inconsistent with reality.

typo: wich -> which

> Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
> testing for when the real uid would match the conditions when
> CAP_SYS_RESOURCE would be present if the real uid was the effective
> uid.
> 
> Reported-by: Etienne Dechamps <etienne@edechamps.fr>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
> Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> The previous conversation has given me enough clarity that I can see
> which tests I am comfortable with use for this pending regression fix.
> 
> I have tested this and it works for me.  Does anyone have any concerns
> with this change?

I'd really love some kind of selftest that exercises the edge cases; do
you have your tests in some form that could be converted?

But otherwise, yes, this looks like the best option here.

Reviewed-by: Kees Cook <keescook@chromium.org>

> 
>  kernel/user_namespace.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 6b2e3ca7ee99..5481ba44a8d6 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>  	cred->user_ns = user_ns;
>  }
>  
> +static unsigned long enforced_nproc_rlimit(void)
> +{
> +	unsigned long limit = RLIM_INFINITY;
> +
> +	/* Is RLIMIT_NPROC currently enforced? */
> +	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
> +	    (current_user_ns() != &init_user_ns))
> +		limit = rlimit(RLIMIT_NPROC);
> +
> +	return limit;
> +}
> +
>  /*
>   * Create a new user namespace, deriving the creator from the user in the
>   * passed credentials, and replacing that user with the new root user for the
> @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
>  	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
>  		ns->ucount_max[i] = INT_MAX;
>  	}
> -	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
> +	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
> -- 
> 2.29.2
>
Michal Koutný Feb. 24, 2022, 6:53 p.m. UTC | #14
On Thu, Feb 24, 2022 at 08:28:41AM -0800, Kees Cook <keescook@chromium.org> wrote:
> I'd really love some kind of selftest that exercises the edge cases; do
> you have your tests in some form that could be converted?

There's the original
tools/testing/selftests/rlimits/rlimits-per-userns.c selftest.

I've been rewriting it to cover more situations, I'm sending it as one
monster patch (I'd need spend more time reordering my commits into some
logical patch order) if anyone wishes to try it.

I've tried it on 5c1ee569660d4a205dced9cb4d0306b907fb7599 + this Eric's
patch.

The test rlimit-per-userns-root passes
- together with that I claim this patch

Reviewed-by: Michal Koutný <mkoutny@suse.com>

The test rlimit-per-userns-nonroot fails. It's similar off-by-one
mistake as was in the fork path, but it's in the do_execveat_common():

        if ((current->flags & PF_NPROC_EXCEEDED) &&
            is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
                retval = -EAGAIN;
                goto out_ret;
        }

(If RLIMIT_NPROC should be strictly honored, setuid+execve should fail
when given uid's ucount is at the limit already.)

Funnily, the original
tools/testing/selftests/rlimits/rlimits-per-userns.c passes thanks to
the off-by-one check even though it should not pass because unshare(2)
is called after setuid(2).

Michal

-- >8 --
Eric W. Biederman Feb. 25, 2022, 12:29 a.m. UTC | #15
Kees Cook <keescook@chromium.org> writes:

> typo: Subject's LimigtNPROC -> LimitNPROC
>
> On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote:
>> 
>> Long story short recursively enforcing RLIMIT_NPROC when it is not
>> enforced on the process that creates a new user namespace, causes
>> currently working code to fail.  There is no reason to enforce
>> RLIMIT_NPROC recursively when we don't enforce it normally so update
>> the code to detect this case.
>> 
>> I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
>> RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
>> RLIMIT_NPROC is charged and checked for enforcement based upon the
>> real uid, using capable() wich is euid based is inconsistent with reality.
>
> typo: wich -> which

Ahh... Typos.

>> Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
>> testing for when the real uid would match the conditions when
>> CAP_SYS_RESOURCE would be present if the real uid was the effective
>> uid.
>> 
>> Reported-by: Etienne Dechamps <etienne@edechamps.fr>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
>> Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>> 
>> The previous conversation has given me enough clarity that I can see
>> which tests I am comfortable with use for this pending regression fix.
>> 
>> I have tested this and it works for me.  Does anyone have any concerns
>> with this change?
>
> I'd really love some kind of selftest that exercises the edge cases; do
> you have your tests in some form that could be converted?
>
> But otherwise, yes, this looks like the best option here.

Let's start with Michal Koutný tests.  I keep forgetting to look at
them.  This cold has really been kicking my butt.

For this issue the test case was a systemd unit file.  Which is simple
and demonstrates the real-world regression but not really minimal in the
way a kernel selftest should be.

> Reviewed-by: Kees Cook <keescook@chromium.org>
>
>> 
>>  kernel/user_namespace.c | 14 +++++++++++++-
>>  1 file changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 6b2e3ca7ee99..5481ba44a8d6 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>>  	cred->user_ns = user_ns;
>>  }
>>  
>> +static unsigned long enforced_nproc_rlimit(void)
>> +{
>> +	unsigned long limit = RLIM_INFINITY;
>> +
>> +	/* Is RLIMIT_NPROC currently enforced? */
>> +	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
>> +	    (current_user_ns() != &init_user_ns))
>> +		limit = rlimit(RLIMIT_NPROC);
>> +
>> +	return limit;
>> +}
>> +
>>  /*
>>   * Create a new user namespace, deriving the creator from the user in the
>>   * passed credentials, and replacing that user with the new root user for the
>> @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
>>  	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
>>  		ns->ucount_max[i] = INT_MAX;
>>  	}
>> -	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
>> +	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
>> -- 
>> 2.29.2
>> 

Eric