mbox series

[v7,0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus

Message ID 20210825213750.6933-1-longman@redhat.com
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Message

Waiman Long Aug. 25, 2021, 9:37 p.m. UTC
v7:
 - Simplify the documentation patch (patch 5) as suggested by Tejun.
 - Fix a typo in patch 2 and improper commit log in patch 3.

v6:
 - Remove duplicated tmpmask from update_prstate() which should fix the
   frame size too large problem reported by kernel test robot.

v5:
 - Rebased to the latest for-5.15 branch of cgroup git tree and drop the
   1st v4 patch as it has been merged.
 - Update patch 1 to always allow changing partition root back to member
   even if it invalidates child partitions undeneath it.
 - Adjust the empty effective cpu partition patch to not allow 0 effective
   cpu for terminal partition which will make it invalid).
 - Add a new patch to enable reading of cpuset.cpus.partition to display
   the reason that causes invalid partition.
 - Adjust the documentation and testing patch accordingly.

This patchset makes four enhancements to the cpuset v2 code.

 Patch 1: Properly handle partition root tree and make partition
 invalid in case changes to cpuset.cpus violate any of the partition
 root constraints.

 Patch 2: Enable the "cpuset.cpus.partition" file to show the reason
 that causes invalid partition like "root invalid (No cpu available
 due to hotplug)".

 Patch 3: Add a new partition state "isolated" to create a partition
 root without load balancing. This is for handling intermitten workloads
 that have a strict low latency requirement.

 Patch 4: Allow partition roots that are not the top cpuset to distribute
 all its cpus to child partitions as long as there is no task associated
 with that partition root. This allows more flexibility for middleware
 to manage multiple partitions.

Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
cpuset test to test the new cpuset partition code.


Waiman Long (6):
  cgroup/cpuset: Properly transition to invalid partition
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Allow non-top parent partition to distribute out all
    CPUs
  cgroup/cpuset: Update description of cpuset.cpus.partition in
    cgroup-v2.rst
  kselftest/cgroup: Add cpuset v2 partition root state test

 Documentation/admin-guide/cgroup-v2.rst       | 112 +--
 kernel/cgroup/cpuset.c                        | 337 ++++++---
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 663 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  86 +++
 5 files changed, 1050 insertions(+), 153 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

Comments

Tejun Heo Aug. 26, 2021, 5:35 p.m. UTC | #1
Hello, Waiman.

Let's stop iterating on the patchset until we reach a consensus.

On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote:
>  	1) The "cpuset.cpus" is not empty and the list of CPUs are

>  	   exclusive, i.e. they are not shared by any of its siblings.


Part of it can be reached by cpus going offline.

>  	2) The parent cgroup is a partition root.


This condition can happen if a parent stop being a partition.

> -	3) The "cpuset.cpus" is also a proper subset of the parent's

> +	3) The "cpuset.cpus" is a subset of the parent's

>  	   "cpuset.cpus.effective".


This can happen if cpus go offline.

>  	4) There is no child cgroups with cpuset enabled.  This is for

>  	   eliminating corner cases that have to be handled if such a

>  	   condition is allowed.


This may make sense as a short cut for us but doesn't really stem from
interface or behavior requirements.

Of the four conditions listed, two are bogus (the states can be
reached through a different path and the configuration success or
failure can be timing dependent if configuration racaes against cpu
hotplug operations) and one maybe makes sense half-way and one is more
of a shortcut.

Can't we just replace these with transitions to invalid state with
proper explanation? That'd get rid of the error handling duplications
from both the kernel and user side, make automated configurations
which may race against hot plug operations reliable, and consistently
provide users with why something failed.

Thank you.

-- 
tejun
Waiman Long Aug. 27, 2021, 3:01 a.m. UTC | #2
On 8/26/21 1:35 PM, Tejun Heo wrote:
> Hello, Waiman.

>

> Let's stop iterating on the patchset until we reach a consensus.

>

> On Wed, Aug 25, 2021 at 05:37:49PM -0400, Waiman Long wrote:

>>   	1) The "cpuset.cpus" is not empty and the list of CPUs are

>>   	   exclusive, i.e. they are not shared by any of its siblings.

> Part of it can be reached by cpus going offline.

>

>>   	2) The parent cgroup is a partition root.

> This condition can happen if a parent stop being a partition.

>

>> -	3) The "cpuset.cpus" is also a proper subset of the parent's

>> +	3) The "cpuset.cpus" is a subset of the parent's

>>   	   "cpuset.cpus.effective".

> This can happen if cpus go offline.

>

>>   	4) There is no child cgroups with cpuset enabled.  This is for

>>   	   eliminating corner cases that have to be handled if such a

>>   	   condition is allowed.

> This may make sense as a short cut for us but doesn't really stem from

> interface or behavior requirements.

>

> Of the four conditions listed, two are bogus (the states can be

> reached through a different path and the configuration success or

> failure can be timing dependent if configuration racaes against cpu

> hotplug operations) and one maybe makes sense half-way and one is more

> of a shortcut.

>

> Can't we just replace these with transitions to invalid state with

> proper explanation? That'd get rid of the error handling duplications

> from both the kernel and user side, make automated configurations

> which may race against hot plug operations reliable, and consistently

> provide users with why something failed.


What I am doing here is setting a high bar for transitioning from member 
to either "root" or "isolated". Once it becomes a partition, there are 
multiple ways that can make it invalid. I am fine with that. However, I 
am not sure it is a good idea to allow users to echo "root" to 
cpuset.cpus.partition anywhere in the cgroup hierarchy and require them 
to read it back to see if it succeed.

All the checking are done with cpuset_rwsem held. So there shouldn't be 
any racing. Of course, a hotplug can immediately follow and make the 
partition invalid.

Cheers,
Longman
Tejun Heo Aug. 27, 2021, 4 a.m. UTC | #3
Hello,

On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote:
> What I am doing here is setting a high bar for transitioning from member to

> either "root" or "isolated". Once it becomes a partition, there are multiple

> ways that can make it invalid. I am fine with that. However, I am not sure

> it is a good idea to allow users to echo "root" to cpuset.cpus.partition

> anywhere in the cgroup hierarchy and require them to read it back to see if

> it succeed.


The problem is that the "high" bar is rather arbitrary. It might feel like a
good idea to some but not to others. There are no clear technical reasons or
principles for rules to be set this particular way.

> All the checking are done with cpuset_rwsem held. So there shouldn't be any

> racing. Of course, a hotplug can immediately follow and make the partition

> invalid.


Imagine a system which dynamically on/offlines its cpus based on load or
whatever and also configures partitions for cases where the needed cpus are
online. If the partitions are set up while the cpus are online, it'd work as
expected - partitions are in effect when the system can support them and
ignored otherwise. However, if the partition configuration is attempted
while the cpus happen to be offline, the configuration will fail, and there
is no guaranteed way to make that configuration stick short of disabling
hotplug operations. This is a pretty jarring brekage happening exactly
because the behavior is an inconsistent amalgam.

It's usually not a good sign if interface restrictions can be added or
removed because how one feels without clear functional reasons and often
indicates that there's something broken, which seems to be the case here
too.

Thanks.

-- 
tejun
Waiman Long Aug. 27, 2021, 9:19 p.m. UTC | #4
On 8/27/21 12:00 AM, Tejun Heo wrote:
> Hello,

>

> On Thu, Aug 26, 2021 at 11:01:30PM -0400, Waiman Long wrote:

>> What I am doing here is setting a high bar for transitioning from member to

>> either "root" or "isolated". Once it becomes a partition, there are multiple

>> ways that can make it invalid. I am fine with that. However, I am not sure

>> it is a good idea to allow users to echo "root" to cpuset.cpus.partition

>> anywhere in the cgroup hierarchy and require them to read it back to see if

>> it succeed.

> The problem is that the "high" bar is rather arbitrary. It might feel like a

> good idea to some but not to others. There are no clear technical reasons or

> principles for rules to be set this particular way.

>

>> All the checking are done with cpuset_rwsem held. So there shouldn't be any

>> racing. Of course, a hotplug can immediately follow and make the partition

>> invalid.

> Imagine a system which dynamically on/offlines its cpus based on load or

> whatever and also configures partitions for cases where the needed cpus are

> online. If the partitions are set up while the cpus are online, it'd work as

> expected - partitions are in effect when the system can support them and

> ignored otherwise. However, if the partition configuration is attempted

> while the cpus happen to be offline, the configuration will fail, and there

> is no guaranteed way to make that configuration stick short of disabling

> hotplug operations. This is a pretty jarring brekage happening exactly

> because the behavior is an inconsistent amalgam.

>

> It's usually not a good sign if interface restrictions can be added or

> removed because how one feels without clear functional reasons and often

> indicates that there's something broken, which seems to be the case here

> too.


Well, that is a valid point. The cpus may have been offlined when a 
partition is being created. I can certainly relent on this check in 
forming a partition. IOW, cpus_allowed can contain some or all offline 
cpus and a valid (some are online) or invalid (all are offline) 
partition can be formed. I can also allow an invalid child partition to 
be formed with an invalid parent partition. However, the cpu exclusivity 
rules will still apply.

Other than that, do you envision any other circumstances where we should 
allow an invalid partition to be formed?

Cheers,
Longman
Tejun Heo Aug. 27, 2021, 9:27 p.m. UTC | #5
Hello,

On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote:
> Well, that is a valid point. The cpus may have been offlined when a

> partition is being created. I can certainly relent on this check in forming

> a partition. IOW, cpus_allowed can contain some or all offline cpus and a

> valid (some are online) or invalid (all are offline) partition can be

> formed. I can also allow an invalid child partition to be formed with an

> invalid parent partition. However, the cpu exclusivity rules will still

> apply.

> 

> Other than that, do you envision any other circumstances where we should

> allow an invalid partition to be formed?


Now that most restrictions are removed from configuration side, just go all
the way? Given that the user must check the status afterwards anyway, I
don't see technical or even usability reasons for leaving some pieces
behind. Going all the way would be easier to use too - bang in the target
config and read the resulting state to reliably find out why a partition
isn't valid, especially if we list *all* the reasons so that the user can
tell whether the configuration is as intended immediately.

Thanks.

-- 
tejun
Waiman Long Aug. 27, 2021, 10:50 p.m. UTC | #6
On 8/27/21 5:27 PM, Tejun Heo wrote:
> Hello,

>

> On Fri, Aug 27, 2021 at 05:19:31PM -0400, Waiman Long wrote:

>> Well, that is a valid point. The cpus may have been offlined when a

>> partition is being created. I can certainly relent on this check in forming

>> a partition. IOW, cpus_allowed can contain some or all offline cpus and a

>> valid (some are online) or invalid (all are offline) partition can be

>> formed. I can also allow an invalid child partition to be formed with an

>> invalid parent partition. However, the cpu exclusivity rules will still

>> apply.

>>

>> Other than that, do you envision any other circumstances where we should

>> allow an invalid partition to be formed?

> Now that most restrictions are removed from configuration side, just go all

> the way? Given that the user must check the status afterwards anyway, I

> don't see technical or even usability reasons for leaving some pieces

> behind. Going all the way would be easier to use too - bang in the target

> config and read the resulting state to reliably find out why a partition

> isn't valid, especially if we list *all* the reasons so that the user

> tell whether the configuration is as intended immediately.


The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. 
This is a pre-existing condition unless you want to change how the 
cpuset.cpu_exclusive works.

So the new rules will be:

1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.
2) The parent cgroup is a partition root (can be an invalid one).
3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.
4) No child cgroup with cpuset enabled.

I think they are reasonable. What do you think?

Cheers,
Longman
Tejun Heo Aug. 27, 2021, 11:35 p.m. UTC | #7
Hello,

On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote:
> The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is

> a pre-existing condition unless you want to change how the

> cpuset.cpu_exclusive works.

>

> So the new rules will be:

> 

> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.


Empty cpu list can be considered an exclusive one.

> 2) The parent cgroup is a partition root (can be an invalid one).


Does this mean a partition parent can't stop being a partition if one or
more of its children become partitions? If so, it violates the rule that a
descendant shouldn't be able to restrict what its ancestors can do.

> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.


Why not just go by effective? This would mean that a parent can't withdraw
CPUs from its allowed set once descendants are configured. Restrictions like
this are fine when the entire hierarchy is configured by a single entity but
become awkward when configurations are multi-tiered, automated and dynamic.

> 4) No child cgroup with cpuset enabled.


idk, maybe? I'm having a hard time seeing the point in adding these
restrictions when the state transitions are asynchronous anyway. Would it
help if we try to separate what's absoluately and technically necessary and
what seems reasonable or high bar and try to justify why each of the latter
should be added?

Thanks.

-- 
tejun
Waiman Long Aug. 28, 2021, 1:14 a.m. UTC | #8
On 8/27/21 7:35 PM, Tejun Heo wrote:
> Hello,

>

> On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long wrote:

>> The cpu exclusivity rule is due to the setting of CPU_EXCLUSIVE bit. This is

>> a pre-existing condition unless you want to change how the

>> cpuset.cpu_exclusive works.

>>

>> So the new rules will be:

>>

>> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.

> Empty cpu list can be considered an exclusive one.

It doesn't make sense to me to have a partition with no cpu configured 
at all. I very much prefer the users to set cpuset.cpus first before 
turning it into a partition.
>

>> 2) The parent cgroup is a partition root (can be an invalid one).

> Does this mean a partition parent can't stop being a partition if one or

> more of its children become partitions? If so, it violates the rule that a

> descendant shouldn't be able to restrict what its ancestors can do.


No. As I said in the documentation, transitioning from partition root to 
member is allowed. Against, it is illogical to allow a cpuset to become 
a potential partition if it parent is not even a partition root at all. 
In the case that the parent is reverted back to a member, the child 
partitions will stay invalid forever unless the parent become a valid 
partition again.

>

>> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.

> Why not just go by effective? This would mean that a parent can't withdraw

> CPUs from its allowed set once descendants are configured. Restrictions like

> this are fine when the entire hierarchy is configured by a single entity but

> become awkward when configurations are multi-tiered, automated and dynamic.


The original rule is to be based on effective cpus. However, to properly 
handle the case of allowing offlined cpus to be included in the 
partition, I have to change it to cpu_allowed instead. I can certainly 
change it back to effective if you prefer.

>

>> 4) No child cgroup with cpuset enabled.

> idk, maybe? I'm having a hard time seeing the point in adding these

> restrictions when the state transitions are asynchronous anyway. Would it

> help if we try to separate what's absoluately and technically necessary and

> what seems reasonable or high bar and try to justify why each of the latter

> should be added?


This rule is there mainly for ease of implementation. Otherwise, I need 
to add additional code to handle the conversion of child cpusets which 
can be rather complex and require a lot more debugging. This rule will 
no longer apply once the cpuset becomes a partition root.

Cheers,
Longman
Michal Koutný Aug. 30, 2021, 5:59 p.m. UTC | #9
Hello.

On Fri, Aug 27, 2021 at 06:50:10PM -0400, Waiman Long <llong@redhat.com> wrote:
> So the new rules will be:


When I followed the thread, it seemed to me you're talking past each
other a bit. I'd suggest the following terminology:

- config space: what's written by the user and saved,

- reality space: what's currently available (primarily subject to
  on-/offlinng but I think it'd be helpful to consider here also what's
  given by the parent),

- effect space: what's actually possible and happening.

Not all elements of config_space x reality_space (Cartesian product) can
be represented in the effect_space (e.g. root partition with no
(effective) cpus).

IIUC, Waiman's "high bar" is supposed to be defined over transitions in
the config_space. However, there can be independent changes in the
reality_space so the rules should be actually formulated in the
effect_space:

The conditions for being a valid partition root rewritten into the effect
space:

> 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive.

- effective CPUs are non-empty and exclusive wrt siblings
- (E.g. setting empty cpuset.cpus might be possible but it invalidates
  the partition root, same as offlining or removal by an ancestor.)

> 2) The parent cgroup is a partition root (can be an invalid one).

- parent cgroup is a (valid) partition
- (Being valid partition means owning "stolen" cpus from the parent, if
  the parent is not valid partition itself, you can't steal what is not
  owned.)
- (And I think it's OK that: "the child partitions will stay invalid
  forever unless the parent become a valid partition again" [1].)

> 3) The "cpuset.cpus" is a subset of the parent's cpuset.cpus.allowed.

- I'm not sure what is the use of this condition (together with the
  rewrite of the 1st condition which covers effective cpus). I think it
  would make sense if being a valid parition root guaranteed that all
  configured cpuset.cpus will be available, however, that's not the case
  IIUC (e.g. due to offlining).

> 4) No child cgroup with cpuset enabled.

- A child cgroup with cpuset enabled is OK in the effect space
  (achievable by switching first and creating children later).
- For technical reasons this may be a condition on the transitions in
  the config_space.

Generally, most config changes should succeed and user should check (or
watch) how they landed in combination with the reality_space.

Regards,
Michal

[1] This follows the general model where ancestors can "preempt"
resources from their subtree.