Message ID | 20240223143833.1509961-1-guanyulin@google.com |
---|---|
State | New |
Headers | show |
Series | [v3] PM / core: conditionally skip system pm in device/driver model | expand |
On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote: > In systems with a main processor and a co-processor, asynchronous > controller management can lead to conflicts. One example is the main > processor attempting to suspend a device while the co-processor is > actively using it. To address this, we introduce a new sysfs entry > called "conditional_skip". This entry allows the system to selectively > skip certain device power management state transitions. To use this > feature, set the value in "conditional_skip" to indicate the type of > state transition you want to avoid. Please review /Documentation/ABI/ > testing/sysfs-devices-power for more detailed information. ... > +static ssize_t conditional_skip_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t n) > +{ > + int ret; > + if (kstrtoint(buf, 0, &ret)) Why is it int? It seems like flags, should not be unsigned as u32 or so? > + return -EINVAL; Do not shadow the real error code without justification. > + ret &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE); > + > + dev->power.conditional_skip_pm = ret; > + > + return n; > +} > + Redundant blank line. > +static DEVICE_ATTR_RW(conditional_skip);
On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote: > > In systems with a main processor and a co-processor, asynchronous > controller management can lead to conflicts. One example is the main > processor attempting to suspend a device while the co-processor is > actively using it. To address this, we introduce a new sysfs entry > called "conditional_skip". This entry allows the system to selectively > skip certain device power management state transitions. To use this > feature, set the value in "conditional_skip" to indicate the type of > state transition you want to avoid. Please review /Documentation/ABI/ > testing/sysfs-devices-power for more detailed information. > > Signed-off-by: Guan-Yu Lin <guanyulin@google.com> Please explain how this is intended to work. That is, what exactly you expect to happen when the new attribute is set.
On Fri, Feb 23, 2024 at 11:18 PM Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote: > > On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote: > > In systems with a main processor and a co-processor, asynchronous > > controller management can lead to conflicts. One example is the main > > processor attempting to suspend a device while the co-processor is > > actively using it. To address this, we introduce a new sysfs entry > > called "conditional_skip". This entry allows the system to selectively > > skip certain device power management state transitions. To use this > > feature, set the value in "conditional_skip" to indicate the type of > > state transition you want to avoid. Please review /Documentation/ABI/ > > testing/sysfs-devices-power for more detailed information. > > ... > > > +static ssize_t conditional_skip_store(struct device *dev, > > + struct device_attribute *attr, > > + const char *buf, size_t n) > > +{ > > > + int ret; > > > + if (kstrtoint(buf, 0, &ret)) > > Why is it int? It seems like flags, should not be unsigned as u32 or so? > The ".event" member in struct pm_message is an int, but the values assigned to it are used like bit flags (e.g. PM_EVENT_FREEZE=0x1, PM_EVENT_SUSPEND=0x2, PM_EVENT_HIBERNATE=0x4). Is this an intentional design choice? We might need to change the design accordingly. > > + return -EINVAL; > > Do not shadow the real error code without justification. > Thanks for suggesting the desired implementation. I'll refactor it in the next version. > > + ret &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE); > > + > > + dev->power.conditional_skip_pm = ret; > > + > > + return n; > > +} > > > + > > Redundant blank line. > Thanks for the heads-up. > > +static DEVICE_ATTR_RW(conditional_skip); > > -- > With Best Regards, > Andy Shevchenko > >
On Sat, Feb 24, 2024 at 1:44 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote: > > > > In systems with a main processor and a co-processor, asynchronous > > controller management can lead to conflicts. One example is the main > > processor attempting to suspend a device while the co-processor is > > actively using it. To address this, we introduce a new sysfs entry > > called "conditional_skip". This entry allows the system to selectively > > skip certain device power management state transitions. To use this > > feature, set the value in "conditional_skip" to indicate the type of > > state transition you want to avoid. Please review /Documentation/ABI/ > > testing/sysfs-devices-power for more detailed information. > > > > Signed-off-by: Guan-Yu Lin <guanyulin@google.com> > > Please explain how this is intended to work. That is, what exactly > you expect to happen when the new attribute is set. The sysfs entry 'conditional_skip' for a device indicates which power transitions (e.g. PM_EVENT_SUSPEND) are omitted within the system power management flow. During the processing of an identified power transition, the device's power.entry will not be added to the dpm_prepared_list, preventing the device from undergoing that transition. As 'conditional_skip' is modifiable at runtime, a device's participation in system power management can be dynamically enabled or disabled.
On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote: > > On 2/23/24 06:38, Guan-Yu Lin wrote: > > In systems with a main processor and a co-processor, asynchronous > > controller management can lead to conflicts. One example is the main > > processor attempting to suspend a device while the co-processor is > > actively using it. To address this, we introduce a new sysfs entry > > called "conditional_skip". This entry allows the system to selectively > > skip certain device power management state transitions. To use this > > feature, set the value in "conditional_skip" to indicate the type of > > state transition you want to avoid. Please review /Documentation/ABI/ > > testing/sysfs-devices-power for more detailed information. > > This looks like a poor way of dealing with a lack of adequate resource > tracking from Linux on behalf of the co-processor(s) and I really do not > understand how someone is supposed to use that in a way that works. > > Cannot you use a HW maintained spinlock between your host processor and > the co-processor such that they can each claim exclusive access to the > hardware and you can busy-wait until one or the other is done using the > device? How is your partitioning between host processor owned blocks and > co-processor(s) owned blocks? Is it static or is it dynamic? > -- > Florian > This patch enables devices to selectively participate in system power transitions. This is crucial when multiple processors, managed by different operating system kernels, share the same controller. One processor shouldn't enforce the same power transition procedures on the controller – another processor might be using it at that moment. While a spinlock is necessary for synchronizing controller access, we still need to add the flexibility to dynamically customize power transition behavior for each device. And that's what this patch is trying to do. In our use case, the host processor and co-processor are managed by separate operating system kernels. This arrangement is static.
On Mon, Feb 26, 2024 at 05:15:00PM +0800, Guan-Yu Lin wrote: > On Fri, Feb 23, 2024 at 11:18 PM Andy Shevchenko > <andriy.shevchenko@linux.intel.com> wrote: > > On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote: ... > > > + if (kstrtoint(buf, 0, &ret)) > > > > Why is it int? It seems like flags, should not be unsigned as u32 or so? > > The ".event" member in struct pm_message is an int, but the values > assigned to it are used like bit flags (e.g. PM_EVENT_FREEZE=0x1, > PM_EVENT_SUSPEND=0x2, PM_EVENT_HIBERNATE=0x4). Is this an intentional > design choice? We might need to change the design accordingly. It might give a subtle errors related to promoted signdness.
On 2/26/24 02:28, Guan-Yu Lin wrote: > On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote: >> >> On 2/23/24 06:38, Guan-Yu Lin wrote: >>> In systems with a main processor and a co-processor, asynchronous >>> controller management can lead to conflicts. One example is the main >>> processor attempting to suspend a device while the co-processor is >>> actively using it. To address this, we introduce a new sysfs entry >>> called "conditional_skip". This entry allows the system to selectively >>> skip certain device power management state transitions. To use this >>> feature, set the value in "conditional_skip" to indicate the type of >>> state transition you want to avoid. Please review /Documentation/ABI/ >>> testing/sysfs-devices-power for more detailed information. >> >> This looks like a poor way of dealing with a lack of adequate resource >> tracking from Linux on behalf of the co-processor(s) and I really do not >> understand how someone is supposed to use that in a way that works. >> >> Cannot you use a HW maintained spinlock between your host processor and >> the co-processor such that they can each claim exclusive access to the >> hardware and you can busy-wait until one or the other is done using the >> device? How is your partitioning between host processor owned blocks and >> co-processor(s) owned blocks? Is it static or is it dynamic? >> -- >> Florian >> > > This patch enables devices to selectively participate in system power > transitions. This is crucial when multiple processors, managed by > different operating system kernels, share the same controller. One > processor shouldn't enforce the same power transition procedures on > the controller – another processor might be using it at that moment. > While a spinlock is necessary for synchronizing controller access, we > still need to add the flexibility to dynamically customize power > transition behavior for each device. And that's what this patch is > trying to do. > In our use case, the host processor and co-processor are managed by > separate operating system kernels. This arrangement is static. OK, so now the question is whether the peripheral is entirely visible to Linux, or is it entirely owned by the co-processor, or is there a combination of both and the usage of the said device driver is dynamic between Linux and your co-processor? A sysfs entry does not seem like the appropriate way to described which states need to be skipped and which ones can remain under control of Linux, you would have to use your firmware's description for that (ACPI, Device Tree, etc.) such that you have a more comprehensive solution that can span a bigger scope.
On Mon, Feb 26, 2024 at 10:03 PM Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote: > > On Mon, Feb 26, 2024 at 05:15:00PM +0800, Guan-Yu Lin wrote: > > On Fri, Feb 23, 2024 at 11:18 PM Andy Shevchenko > > <andriy.shevchenko@linux.intel.com> wrote: > > > On Fri, Feb 23, 2024 at 02:38:29PM +0000, Guan-Yu Lin wrote: > > ... > > > > > + if (kstrtoint(buf, 0, &ret)) > > > > > > Why is it int? It seems like flags, should not be unsigned as u32 or so? > > > > The ".event" member in struct pm_message is an int, but the values > > assigned to it are used like bit flags (e.g. PM_EVENT_FREEZE=0x1, > > PM_EVENT_SUSPEND=0x2, PM_EVENT_HIBERNATE=0x4). Is this an intentional > > design choice? We might need to change the design accordingly. > > It might give a subtle errors related to promoted signdness. > Should we refrain from using the bitwise operation here? Or should we just change the type here to u32? > -- > With Best Regards, > Andy Shevchenko > >
On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote: > > On 2/26/24 02:28, Guan-Yu Lin wrote: > > On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote: > >> > >> On 2/23/24 06:38, Guan-Yu Lin wrote: > >>> In systems with a main processor and a co-processor, asynchronous > >>> controller management can lead to conflicts. One example is the main > >>> processor attempting to suspend a device while the co-processor is > >>> actively using it. To address this, we introduce a new sysfs entry > >>> called "conditional_skip". This entry allows the system to selectively > >>> skip certain device power management state transitions. To use this > >>> feature, set the value in "conditional_skip" to indicate the type of > >>> state transition you want to avoid. Please review /Documentation/ABI/ > >>> testing/sysfs-devices-power for more detailed information. > >> > >> This looks like a poor way of dealing with a lack of adequate resource > >> tracking from Linux on behalf of the co-processor(s) and I really do not > >> understand how someone is supposed to use that in a way that works. > >> > >> Cannot you use a HW maintained spinlock between your host processor and > >> the co-processor such that they can each claim exclusive access to the > >> hardware and you can busy-wait until one or the other is done using the > >> device? How is your partitioning between host processor owned blocks and > >> co-processor(s) owned blocks? Is it static or is it dynamic? > >> -- > >> Florian > >> > > > > This patch enables devices to selectively participate in system power > > transitions. This is crucial when multiple processors, managed by > > different operating system kernels, share the same controller. One > > processor shouldn't enforce the same power transition procedures on > > the controller – another processor might be using it at that moment. > > While a spinlock is necessary for synchronizing controller access, we > > still need to add the flexibility to dynamically customize power > > transition behavior for each device. And that's what this patch is > > trying to do. > > In our use case, the host processor and co-processor are managed by > > separate operating system kernels. This arrangement is static. > > OK, so now the question is whether the peripheral is entirely visible to > Linux, or is it entirely owned by the co-processor, or is there a > combination of both and the usage of the said device driver is dynamic > between Linux and your co-processor? > > A sysfs entry does not seem like the appropriate way to described which > states need to be skipped and which ones can remain under control of > Linux, you would have to use your firmware's description for that (ACPI, > Device Tree, etc.) such that you have a more comprehensive solution that > can span a bigger scope. > -- > Florian > We anticipate that control of the peripheral (e.g., controller) will be shared between operating system kernels. Each kernel will need its own driver for peripheral communication. To accommodate different tasks, the operating system managing the peripheral can change dynamically at runtime. We dynamically select the operating system kernel controlling the target peripheral based on the task at hand, which looks more like a software behavior rather than hardware behavior to me. I agree that we might need a firmware description for "whether another operating system exists for this peripheral", but we also need to store the information about "whether another operating system is actively using this peripheral". To me, the latter one looks more like a sysfs entry rather than a firmware description as it's not determined statically.
On Tue, Feb 27, 2024 at 04:56:00PM +0800, Guan-Yu Lin wrote: > On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote: > > > > On 2/26/24 02:28, Guan-Yu Lin wrote: > > > On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote: > > >> > > >> On 2/23/24 06:38, Guan-Yu Lin wrote: > > >>> In systems with a main processor and a co-processor, asynchronous > > >>> controller management can lead to conflicts. One example is the main > > >>> processor attempting to suspend a device while the co-processor is > > >>> actively using it. To address this, we introduce a new sysfs entry > > >>> called "conditional_skip". This entry allows the system to selectively > > >>> skip certain device power management state transitions. To use this > > >>> feature, set the value in "conditional_skip" to indicate the type of > > >>> state transition you want to avoid. Please review /Documentation/ABI/ > > >>> testing/sysfs-devices-power for more detailed information. > > >> > > >> This looks like a poor way of dealing with a lack of adequate resource > > >> tracking from Linux on behalf of the co-processor(s) and I really do not > > >> understand how someone is supposed to use that in a way that works. > > >> > > >> Cannot you use a HW maintained spinlock between your host processor and > > >> the co-processor such that they can each claim exclusive access to the > > >> hardware and you can busy-wait until one or the other is done using the > > >> device? How is your partitioning between host processor owned blocks and > > >> co-processor(s) owned blocks? Is it static or is it dynamic? > > >> -- > > >> Florian > > >> > > > > > > This patch enables devices to selectively participate in system power > > > transitions. This is crucial when multiple processors, managed by > > > different operating system kernels, share the same controller. One > > > processor shouldn't enforce the same power transition procedures on > > > the controller – another processor might be using it at that moment. > > > While a spinlock is necessary for synchronizing controller access, we > > > still need to add the flexibility to dynamically customize power > > > transition behavior for each device. And that's what this patch is > > > trying to do. > > > In our use case, the host processor and co-processor are managed by > > > separate operating system kernels. This arrangement is static. > > > > OK, so now the question is whether the peripheral is entirely visible to > > Linux, or is it entirely owned by the co-processor, or is there a > > combination of both and the usage of the said device driver is dynamic > > between Linux and your co-processor? > > > > A sysfs entry does not seem like the appropriate way to described which > > states need to be skipped and which ones can remain under control of > > Linux, you would have to use your firmware's description for that (ACPI, > > Device Tree, etc.) such that you have a more comprehensive solution that > > can span a bigger scope. > > -- > > Florian > > > > We anticipate that control of the peripheral (e.g., controller) will > be shared between operating system kernels. Each kernel will need its > own driver for peripheral communication. To accommodate different > tasks, the operating system managing the peripheral can change > dynamically at runtime. That sounds like a nightmare of control and handling, how are you going to do any of that? Where is the code for that? > We dynamically select the operating system kernel controlling the > target peripheral based on the task at hand, which looks more like a > software behavior rather than hardware behavior to me. I agree that we > might need a firmware description for "whether another operating > system exists for this peripheral", but we also need to store the > information about "whether another operating system is actively using > this peripheral". To me, the latter one looks more like a sysfs entry > rather than a firmware description as it's not determined statically. So you want to download different firmware to the device depending on "something". What is going to control that "something"? Is that coming from the kernel, or from userspace? If userspace, why is any of this an issue and just load whatever firmware you decide at that point in time? Why does the kernel care? confused, greg k-h
On Mon, Feb 26, 2024 at 10:45 AM Guan-Yu Lin <guanyulin@google.com> wrote: > > On Sat, Feb 24, 2024 at 1:44 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote: > > > > > > In systems with a main processor and a co-processor, asynchronous > > > controller management can lead to conflicts. One example is the main > > > processor attempting to suspend a device while the co-processor is > > > actively using it. To address this, we introduce a new sysfs entry > > > called "conditional_skip". This entry allows the system to selectively > > > skip certain device power management state transitions. To use this > > > feature, set the value in "conditional_skip" to indicate the type of > > > state transition you want to avoid. Please review /Documentation/ABI/ > > > testing/sysfs-devices-power for more detailed information. > > > > > > Signed-off-by: Guan-Yu Lin <guanyulin@google.com> > > > > Please explain how this is intended to work. That is, what exactly > > you expect to happen when the new attribute is set. > > The sysfs entry 'conditional_skip' for a device indicates which power > transitions (e.g. PM_EVENT_SUSPEND) are omitted within the system > power management flow. During the processing of an identified power > transition, the device's power.entry will not be added to the > dpm_prepared_list, preventing the device from undergoing that > transition. As 'conditional_skip' is modifiable at runtime, a device's > participation in system power management can be dynamically enabled or > disabled. So this idea is completely misguided AFAICS. First off, why would a device be skipped in system-wide suspend and not in hibernation? Or the other way around? Or why would it be skipped in one phase of hibernation and not in the other? Second, but not less important, why is skipping a device in system-wide transitions a good idea at all? What about dependencies between that device and the other devices in the system? Generally speaking, system-wide PM is designed to cover the entire system and there are good reasons for that. If you don't want it to cover the entire system, you cannot use it at all.
On 2/27/24 00:56, Guan-Yu Lin wrote: > On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote: >> >> On 2/26/24 02:28, Guan-Yu Lin wrote: >>> On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote: >>>> >>>> On 2/23/24 06:38, Guan-Yu Lin wrote: >>>>> In systems with a main processor and a co-processor, asynchronous >>>>> controller management can lead to conflicts. One example is the main >>>>> processor attempting to suspend a device while the co-processor is >>>>> actively using it. To address this, we introduce a new sysfs entry >>>>> called "conditional_skip". This entry allows the system to selectively >>>>> skip certain device power management state transitions. To use this >>>>> feature, set the value in "conditional_skip" to indicate the type of >>>>> state transition you want to avoid. Please review /Documentation/ABI/ >>>>> testing/sysfs-devices-power for more detailed information. >>>> >>>> This looks like a poor way of dealing with a lack of adequate resource >>>> tracking from Linux on behalf of the co-processor(s) and I really do not >>>> understand how someone is supposed to use that in a way that works. >>>> >>>> Cannot you use a HW maintained spinlock between your host processor and >>>> the co-processor such that they can each claim exclusive access to the >>>> hardware and you can busy-wait until one or the other is done using the >>>> device? How is your partitioning between host processor owned blocks and >>>> co-processor(s) owned blocks? Is it static or is it dynamic? >>>> -- >>>> Florian >>>> >>> >>> This patch enables devices to selectively participate in system power >>> transitions. This is crucial when multiple processors, managed by >>> different operating system kernels, share the same controller. One >>> processor shouldn't enforce the same power transition procedures on >>> the controller – another processor might be using it at that moment. >>> While a spinlock is necessary for synchronizing controller access, we >>> still need to add the flexibility to dynamically customize power >>> transition behavior for each device. And that's what this patch is >>> trying to do. >>> In our use case, the host processor and co-processor are managed by >>> separate operating system kernels. This arrangement is static. >> >> OK, so now the question is whether the peripheral is entirely visible to >> Linux, or is it entirely owned by the co-processor, or is there a >> combination of both and the usage of the said device driver is dynamic >> between Linux and your co-processor? >> >> A sysfs entry does not seem like the appropriate way to described which >> states need to be skipped and which ones can remain under control of >> Linux, you would have to use your firmware's description for that (ACPI, >> Device Tree, etc.) such that you have a more comprehensive solution that >> can span a bigger scope. >> -- >> Florian >> > > We anticipate that control of the peripheral (e.g., controller) will > be shared between operating system kernels. Each kernel will need its > own driver for peripheral communication. To accommodate different > tasks, the operating system managing the peripheral can change > dynamically at runtime. OK, that seems like this ought to be resolved at various layer other than just user-space, starting possibly with an overarching/reconciliation layer between the various operating systems? > > We dynamically select the operating system kernel controlling the > target peripheral based on the task at hand, which looks more like a > software behavior rather than hardware behavior to me. I agree that we > might need a firmware description for "whether another operating > system exists for this peripheral", but we also need to store the > information about "whether another operating system is actively using > this peripheral". To me, the latter one looks more like a sysfs entry > rather than a firmware description as it's not determined statically. I can understand why moving this sort of decisions to user-space might sound appealing, but it also seems like if the peripheral is going to be "stolen" away from Linux, then maybe Linux should not be managing it at all, e.g.: unbind the device from its driver, and then rebind it when Linux needs to use it. You would have to write your drivers such that they can skip the peripheral's initialization if you need to preserve state from the previous agent after an ownership change for instance? I do not think you are painting a full picture of your use case, hopefully not intentionally but at first glance it sounds like you need a combination of kernel-level changes to your drivers, and possibly more. Seems like more details need to be provided about the overall intended use cases such that people can guide you with a fuller picture of the use cases.
On Tue, Feb 27, 2024 at 7:28 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Feb 26, 2024 at 10:45 AM Guan-Yu Lin <guanyulin@google.com> wrote: > > > > On Sat, Feb 24, 2024 at 1:44 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > On Fri, Feb 23, 2024 at 3:38 PM Guan-Yu Lin <guanyulin@google.com> wrote: > > > > > > > > In systems with a main processor and a co-processor, asynchronous > > > > controller management can lead to conflicts. One example is the main > > > > processor attempting to suspend a device while the co-processor is > > > > actively using it. To address this, we introduce a new sysfs entry > > > > called "conditional_skip". This entry allows the system to selectively > > > > skip certain device power management state transitions. To use this > > > > feature, set the value in "conditional_skip" to indicate the type of > > > > state transition you want to avoid. Please review /Documentation/ABI/ > > > > testing/sysfs-devices-power for more detailed information. > > > > > > > > Signed-off-by: Guan-Yu Lin <guanyulin@google.com> > > > > > > Please explain how this is intended to work. That is, what exactly > > > you expect to happen when the new attribute is set. > > > > The sysfs entry 'conditional_skip' for a device indicates which power > > transitions (e.g. PM_EVENT_SUSPEND) are omitted within the system > > power management flow. During the processing of an identified power > > transition, the device's power.entry will not be added to the > > dpm_prepared_list, preventing the device from undergoing that > > transition. As 'conditional_skip' is modifiable at runtime, a device's > > participation in system power management can be dynamically enabled or > > disabled. > > So this idea is completely misguided AFAICS. > > First off, why would a device be skipped in system-wide suspend and > not in hibernation? Or the other way around? Or why would it be > skipped in one phase of hibernation and not in the other? > The motto is to set less constraints and let users configure them properly by themselves. But, we could redesign it to better match with existing conventions/regulations. > Second, but not less important, why is skipping a device in > system-wide transitions a good idea at all? What about dependencies > between that device and the other devices in the system? > If a device is also used by another operating system kernel, then Linux kernel shouldn't always do the system power transition on that device to avoid affecting another operating system kernel. Since device dependencies can be highly system-specific, maybe we should let users handle it as users are the only ones who understand their system's unique configuration. > Generally speaking, system-wide PM is designed to cover the entire > system and there are good reasons for that. If you don't want it to > cover the entire system, you cannot use it at all. We discovered a use case that the existing system-wide PM may not fully support: devices shared by multiple operating system kernels. I think supporting this case would be beneficial to the system-wide PM, as it would increase its compatibility.
On Tue, Feb 27, 2024 at 5:15 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Tue, Feb 27, 2024 at 04:56:00PM +0800, Guan-Yu Lin wrote: > > On Tue, Feb 27, 2024 at 2:40 AM Florian Fainelli <f.fainelli@gmail.com> wrote: > > > > > > On 2/26/24 02:28, Guan-Yu Lin wrote: > > > > On Sat, Feb 24, 2024 at 2:20 AM Florian Fainelli <f.fainelli@gmail.com> wrote: > > > >> > > > >> On 2/23/24 06:38, Guan-Yu Lin wrote: > > > >>> In systems with a main processor and a co-processor, asynchronous > > > >>> controller management can lead to conflicts. One example is the main > > > >>> processor attempting to suspend a device while the co-processor is > > > >>> actively using it. To address this, we introduce a new sysfs entry > > > >>> called "conditional_skip". This entry allows the system to selectively > > > >>> skip certain device power management state transitions. To use this > > > >>> feature, set the value in "conditional_skip" to indicate the type of > > > >>> state transition you want to avoid. Please review /Documentation/ABI/ > > > >>> testing/sysfs-devices-power for more detailed information. > > > >> > > > >> This looks like a poor way of dealing with a lack of adequate resource > > > >> tracking from Linux on behalf of the co-processor(s) and I really do not > > > >> understand how someone is supposed to use that in a way that works. > > > >> > > > >> Cannot you use a HW maintained spinlock between your host processor and > > > >> the co-processor such that they can each claim exclusive access to the > > > >> hardware and you can busy-wait until one or the other is done using the > > > >> device? How is your partitioning between host processor owned blocks and > > > >> co-processor(s) owned blocks? Is it static or is it dynamic? > > > >> -- > > > >> Florian > > > >> > > > > > > > > This patch enables devices to selectively participate in system power > > > > transitions. This is crucial when multiple processors, managed by > > > > different operating system kernels, share the same controller. One > > > > processor shouldn't enforce the same power transition procedures on > > > > the controller – another processor might be using it at that moment. > > > > While a spinlock is necessary for synchronizing controller access, we > > > > still need to add the flexibility to dynamically customize power > > > > transition behavior for each device. And that's what this patch is > > > > trying to do. > > > > In our use case, the host processor and co-processor are managed by > > > > separate operating system kernels. This arrangement is static. > > > > > > OK, so now the question is whether the peripheral is entirely visible to > > > Linux, or is it entirely owned by the co-processor, or is there a > > > combination of both and the usage of the said device driver is dynamic > > > between Linux and your co-processor? > > > > > > A sysfs entry does not seem like the appropriate way to described which > > > states need to be skipped and which ones can remain under control of > > > Linux, you would have to use your firmware's description for that (ACPI, > > > Device Tree, etc.) such that you have a more comprehensive solution that > > > can span a bigger scope. > > > -- > > > Florian > > > > > > > We anticipate that control of the peripheral (e.g., controller) will > > be shared between operating system kernels. Each kernel will need its > > own driver for peripheral communication. To accommodate different > > tasks, the operating system managing the peripheral can change > > dynamically at runtime. > > That sounds like a nightmare of control and handling, how are you going > to do any of that? Where is the code for that? > Since the peripheral can issue different types of interrupts, we plan to assign the handling of those interrupts to separate operating system kernels. Additionally, only one operating system kernel will have the privilege to issue commands to the peripheral. We think that this could resolve potential conflicts. > > We dynamically select the operating system kernel controlling the > > target peripheral based on the task at hand, which looks more like a > > software behavior rather than hardware behavior to me. I agree that we > > might need a firmware description for "whether another operating > > system exists for this peripheral", but we also need to store the > > information about "whether another operating system is actively using > > this peripheral". To me, the latter one looks more like a sysfs entry > > rather than a firmware description as it's not determined statically. > > So you want to download different firmware to the device depending on > "something". What is going to control that "something"? Is that coming > from the kernel, or from userspace? If userspace, why is any of this an > issue and just load whatever firmware you decide at that point in time? > Why does the kernel care? > > confused, > > greg k-h In our design, no single firmware can fully control or communicate with the peripheral. Different functions of the peripheral are supported by different operating system kernels. Therefore, we should keep both firmwares active to use the peripheral effectively.
On Thu, Feb 29, 2024 at 05:08:00PM +0800, Guan-Yu Lin wrote: > We want to introduce a mechanism that allows the Linux kernel to make > power transitions for the peripheral based on whether the other > operating system kernel is actively using it. To achieve this, we > propose this patch that adds a sysfs attribute, providing the Linux > kernel with the necessary information. Don't create random user/kernel apis in sysfs for no good reason just because it is "easy" :( If the "other operating system is actively using it" isn't able to be detected by Linux, then Linux shouldn't be able to change the PM state, so this sounds like you need to fix your Linux driver to properly know this information, just like any other device type (think about a sound device that needs to know if it is being used or not, nothing different here.) So please post your Linux driver and we can see what needs to be done there to get this to work properly, odds are you are just missing something. Have a pointer to the code anywhere? Also, as you know, we can NOT add interfaces to the kernel without any real user, so without a driver for your hardware, none of this is able to go anywhere at all, sorry. thanks, greg k-h
On Fri, Mar 1, 2024 at 4:34 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Thu, Feb 29, 2024 at 05:08:00PM +0800, Guan-Yu Lin wrote: > > We want to introduce a mechanism that allows the Linux kernel to make > > power transitions for the peripheral based on whether the other > > operating system kernel is actively using it. To achieve this, we > > propose this patch that adds a sysfs attribute, providing the Linux > > kernel with the necessary information. > > Don't create random user/kernel apis in sysfs for no good reason just > because it is "easy" :( > We initially considered using sysfs because it could provide a universal interface regardless of which operating system kernel shares the device with the Linux kernel. This would allow users to modify the feature through simple sysfs interactions. However, the current method of using information in sysfs doesn't seem to integrate well with the existing system power management framework. Could we refine how sysfs is used in system power management to enable cross-kernel communication? Alternatively, should we avoid exposing the information of whether a device is used by multiple operating systems to user space? > If the "other operating system is actively using it" isn't able to be > detected by Linux, then Linux shouldn't be able to change the PM state, > so this sounds like you need to fix your Linux driver to properly know > this information, just like any other device type (think about a sound > device that needs to know if it is being used or not, nothing different > here.) > I think the variable `usage_count` in struct `dev_pm_info` records whether the device is being used. Could we leverage this information in the design? We could modify the device tree to record which devices are shared across operating system kernels. Then, we could conditionally skip system power management steps for those devices if they're actively in use. We'll need to carefully consider potential corner cases and assess any impact on runtime power management. If this proposal seems worthwhile, I can prepare a more detailed v4 for discussion. > So please post your Linux driver and we can see what needs to be done > there to get this to work properly, odds are you are just missing > something. Have a pointer to the code anywhere? > > Also, as you know, we can NOT add interfaces to the kernel without any > real user, so without a driver for your hardware, none of this is able > to go anywhere at all, sorry. > The Linux device driver we're using here is the upstream xhci driver. We configure only partial interrupts for the controller in the device tree. This prevents the Linux kernel from accessing other interrupts handled by another operating system kernel. Consequently, the controller can function normally even with two operating system kernels accessing it, as long as we completely disable system power management functionality. > thanks, > > greg k-h
diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power index 54195530e97a..3ac4e40f07a0 100644 --- a/Documentation/ABI/testing/sysfs-devices-power +++ b/Documentation/ABI/testing/sysfs-devices-power @@ -305,3 +305,14 @@ Description: Reports the runtime PM children usage count of a device, or 0 if the children will be ignored. +What: /sys/devices/.../power/conditional_skip +Date: Feburary 2024 +Contact: Guan-Yu Lin <guanyulin@google.com> +Description: + The /sys/devices/.../conditional_skip attribute provides a way + to selectively skip system-wide power transitions like + suspend-to-RAM or hibernation. To skip a specific transition, + write its corresponding value to this attribute. For skipping + multiple transitions, combine their values using a bitwise OR + and write the result to this attribute. + diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index fadcd0379dc2..d507626c7892 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1881,6 +1881,7 @@ static int device_prepare(struct device *dev, pm_message_t state) */ int dpm_prepare(pm_message_t state) { + struct list_head list; int error = 0; trace_suspend_resume(TPS("dpm_prepare"), state.event, true); @@ -1900,12 +1901,26 @@ int dpm_prepare(pm_message_t state) */ device_block_probing(); + INIT_LIST_HEAD(&list); mutex_lock(&dpm_list_mtx); while (!list_empty(&dpm_list) && !error) { struct device *dev = to_device(dpm_list.next); get_device(dev); + if (dev->power.conditional_skip_pm & state.event) { + dev_info(dev, "skip system PM transition (event = 0x%04x)\n", + state.event); + + if (!list_empty(&dev->power.entry)) + list_move_tail(&dev->power.entry, &list); + + mutex_unlock(&dpm_list_mtx); + put_device(dev); + mutex_lock(&dpm_list_mtx); + continue; + } + mutex_unlock(&dpm_list_mtx); trace_device_pm_callback_start(dev, "", state.event); @@ -1931,6 +1946,7 @@ int dpm_prepare(pm_message_t state) mutex_lock(&dpm_list_mtx); } + list_splice(&list, &dpm_list); mutex_unlock(&dpm_list_mtx); trace_suspend_resume(TPS("dpm_prepare"), state.event, false); return error; diff --git a/drivers/base/power/sysfs.c b/drivers/base/power/sysfs.c index a1474fb67db9..1feacb01b1e9 100644 --- a/drivers/base/power/sysfs.c +++ b/drivers/base/power/sysfs.c @@ -610,6 +610,31 @@ static DEVICE_ATTR_RW(async); #endif /* CONFIG_PM_SLEEP */ #endif /* CONFIG_PM_ADVANCED_DEBUG */ +static ssize_t conditional_skip_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "0x%04x\n", dev->power.conditional_skip_pm); +} + +static ssize_t conditional_skip_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t n) +{ + int ret; + + if (kstrtoint(buf, 0, &ret)) + return -EINVAL; + + ret &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE); + + dev->power.conditional_skip_pm = ret; + + return n; +} + +static DEVICE_ATTR_RW(conditional_skip); + static struct attribute *power_attrs[] = { #ifdef CONFIG_PM_ADVANCED_DEBUG #ifdef CONFIG_PM_SLEEP @@ -620,6 +645,7 @@ static struct attribute *power_attrs[] = { &dev_attr_runtime_active_kids.attr, &dev_attr_runtime_enabled.attr, #endif /* CONFIG_PM_ADVANCED_DEBUG */ + &dev_attr_conditional_skip.attr, NULL, }; static const struct attribute_group pm_attr_group = { diff --git a/include/linux/device.h b/include/linux/device.h index 97c4b046c09d..f2c73dd00211 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -968,6 +968,13 @@ static inline void device_set_pm_not_required(struct device *dev) dev->power.no_pm = true; } +static inline void device_set_pm_conditional_skip(struct device *dev, + int condition) +{ + condition &= (PM_EVENT_FREEZE|PM_EVENT_SUSPEND|PM_EVENT_HIBERNATE); + dev->power.conditional_skip_pm = condition; +} + static inline void dev_pm_syscore_device(struct device *dev, bool val) { #ifdef CONFIG_PM_SLEEP diff --git a/include/linux/pm.h b/include/linux/pm.h index a2f3e53a8196..890c7a601c2a 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -713,6 +713,7 @@ struct dev_pm_info { enum rpm_status last_status; int runtime_error; int autosuspend_delay; + int conditional_skip_pm; u64 last_busy; u64 active_time; u64 suspended_time;
In systems with a main processor and a co-processor, asynchronous controller management can lead to conflicts. One example is the main processor attempting to suspend a device while the co-processor is actively using it. To address this, we introduce a new sysfs entry called "conditional_skip". This entry allows the system to selectively skip certain device power management state transitions. To use this feature, set the value in "conditional_skip" to indicate the type of state transition you want to avoid. Please review /Documentation/ABI/ testing/sysfs-devices-power for more detailed information. Signed-off-by: Guan-Yu Lin <guanyulin@google.com> --- V2 -> V3: Integrate the feature with the pm core framework. V1 -> V2: Cosmetics changes on coding style. [v2] usb: host: enable suspend-to-RAM control in userspace [v1] [RFC] usb: host: Allow userspace to control usb suspend flows --- Documentation/ABI/testing/sysfs-devices-power | 11 ++++++++ drivers/base/power/main.c | 16 ++++++++++++ drivers/base/power/sysfs.c | 26 +++++++++++++++++++ include/linux/device.h | 7 +++++ include/linux/pm.h | 1 + 5 files changed, 61 insertions(+)