Message ID | 20241018144755.7875-1-ilpo.jarvinen@linux.intel.com |
---|---|
Headers | show |
Series | PCI: Add PCIe bandwidth controller | expand |
On Fri, Oct 18, 2024 at 05:47:46PM +0300, Ilpo Järvinen wrote: > Hi all, > > This series adds PCIe bandwidth controller (bwctrl) and associated PCIe > cooling driver to the thermal core side for limiting PCIe Link Speed > due to thermal reasons. PCIe bandwidth controller is a PCI express bus > port service driver. A cooling device is created for each port the > service driver finds to support changing speeds. > > This series only adds support for controlling PCIe Link Speed. > Controlling PCIe Link Width might also be useful but there is no > mechanism for that until PCIe 6.0 (L0p) so Link Width throttling is not > added by this series. > > > v9: > - Split RMW ops doc reformat into own patch before adding LNKCTL2. > - Comment reserved 0 LSB even better than it already was. > - Consider portdrv future plans: > - Use devm helpers for mem alloc, IRQ, and mutex init. > - Don't use get/set_service_data(). > - Split rwsem into two to avoid recursive locking splat through > pcie_retrain_link(). > - Small wording improvements to commit messages (from Jonathan) > > v8: > - Removed CONFIG_PCIE_BWCTRL (use CONFIG_PCIEPORTBUS) > - Removed locking wrappers that dealt with the CONFIG variations > - Protect macro parameter with parenthesis to be on the safe side > > v7: > - Rebased on top of Maciej's latest Target Speed quirk patches > - Target Speed quirk runs very early, w/o ->subordinate existing yet. > This required adapting logic: > - Move Supported Link Speeds back to pci_dev > - Check for ->subordinate == NULL where necessary > - Cannot always take bwctrl's per port mutex (in pcie_bwctrl_data) > - Cleaned up locking in pcie_set_target_speed() using wrappers > - Allowed removing confusing __pcie_set_target_speed() > - Fix building with CONFIG_PCI=n > - Correct error check in pcie_lbms_seen() > - Don't return error for an empty bus that remains at 2.5GT > - Use rwsem to protect ->link_bwctrl setup and bwnotif enable > - Clear LBMS in remove_board() > - Adding export for pcie_get_supported_speeds() was unnecessary > - Call bwctrl's init before hotplug. > - Added local variable 'bus' into a few functions > > v6: > - Removed unnecessary PCI_EXP_LNKCAP_SLS mask from PCIE_LNKCAP_SLS2SPEED() > - Split error handling in pcie_bwnotif_irq_thread() > - pci_info() -> pci_dbg() on bwctrl probe success path > - Handle cooling device pointer -Exx codes in bwctrl probe > - Reorder port->link_bwctrl setup / bwnotif enable for symmetry > - Handle LBMS count == 0 in PCIe quirk by checking LBMS (avoids a race > between quirk and bwctrl) > - Use cleanup.h in PCIe cooling device's register > > v5: > - Removed patches: LNKCTL2 RMW driver patches went in separately > - Refactor pcie_update_link_speed() to read LNKSTA + add __ variant > for hotplug that has LNKSTA value at hand > - Make series fully compatible with the Target Speed quirk > - LBMS counter added, quirk falls back to LBMS bit when bwctrl =n > - Separate LBMS patch from set target speed patches > - Always provide pcie_bwctrl_change_speed() even if bwctrl =n so drivers > don't need to come up their own version (also required by the Target > Speed quirk) > - Remove devm_* (based on Lukas' comment on some other service > driver patch) > - Convert to use cleanup.h > - Renamed functions/struct to have shorter names > > v4: > - Merge Port's and Endpoint's Supported Link Speeds Vectors into > supported_speeds in the struct pci_bus > - Reuse pcie_get_speed_cap()'s code for pcie_get_supported_speeds() > - Setup supported_speeds with PCI_EXP_LNKCAP2_SLS_2_5GB when no > Endpoint exists > - Squash revert + add bwctrl patches into one > - Change to use threaded IRQ + IRQF_ONESHOT > - Enable also LABIE / LABS > - Convert Link Speed selection to use bit logic instead of loop > - Allocate before requesting IRQ during probe > - Use devm_*() > - Use u8 for speed_conv array instead of u16 > - Removed READ_ONCE() > - Improve changelogs, comments, and Kconfig > - Name functions slightly more consistently > - Use bullet list for RMW protected registers in docs > > v3: > - Correct hfi1 shortlog prefix > - Improve error prints in hfi1 > - Add L: linux-pci to the MAINTAINERS entry > > v2: > - Adds LNKCTL2 to RMW safe list in Documentation/PCI/pciebus-howto.rst > - Renamed cooling devices from PCIe_Port_* to PCIe_Port_Link_Speed_* in > order to plan for possibility of adding Link Width cooling devices > later on > - Moved struct thermal_cooling_device declaration to the correct patch > - Small tweaks to Kconfig texts > - Series rebased to resolve conflict (in the selftest list) > > Ilpo Järvinen (9): > Documentation PCI: Reformat RMW ops documentation > PCI: Protect Link Control 2 Register with RMW locking > PCI: Store all PCIe Supported Link Speeds > PCI: Refactor pcie_update_link_speed() > PCI/quirks: Abstract LBMS seen check into own function > PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller > PCI/bwctrl: Add API to set PCIe Link Speed > thermal: Add PCIe cooling driver > selftests/pcie_bwctrl: Create selftests > > Documentation/PCI/pciebus-howto.rst | 14 +- > MAINTAINERS | 9 + > drivers/pci/hotplug/pciehp_ctrl.c | 5 + > drivers/pci/hotplug/pciehp_hpc.c | 2 +- > drivers/pci/pci.c | 62 ++- > drivers/pci/pci.h | 38 +- > drivers/pci/pcie/Makefile | 2 +- > drivers/pci/pcie/bwctrl.c | 366 ++++++++++++++++++ > drivers/pci/pcie/portdrv.c | 9 +- > drivers/pci/pcie/portdrv.h | 6 +- > drivers/pci/probe.c | 15 +- > drivers/pci/quirks.c | 32 +- > drivers/thermal/Kconfig | 9 + > drivers/thermal/Makefile | 2 + > drivers/thermal/pcie_cooling.c | 80 ++++ > include/linux/pci-bwctrl.h | 28 ++ > include/linux/pci.h | 24 +- > include/uapi/linux/pci_regs.h | 1 + > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/pcie_bwctrl/Makefile | 2 + > .../pcie_bwctrl/set_pcie_cooling_state.sh | 122 ++++++ > .../selftests/pcie_bwctrl/set_pcie_speed.sh | 67 ++++ > 22 files changed, 843 insertions(+), 53 deletions(-) > create mode 100644 drivers/pci/pcie/bwctrl.c > create mode 100644 drivers/thermal/pcie_cooling.c > create mode 100644 include/linux/pci-bwctrl.h > create mode 100644 tools/testing/selftests/pcie_bwctrl/Makefile > create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh > create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh Applied to pci/bwctrl for v6.13, thanks Ilpo (and Alexandru, for the bandwidth notification interrupt support)!
[+cc Bart, Amit, Neil, Caleb] On Wed, Oct 23, 2024 at 05:19:04PM -0500, Bjorn Helgaas wrote: > On Fri, Oct 18, 2024 at 05:47:46PM +0300, Ilpo Järvinen wrote: > > This series adds PCIe bandwidth controller (bwctrl) and associated PCIe > > cooling driver to the thermal core side for limiting PCIe Link Speed > > due to thermal reasons. PCIe bandwidth controller is a PCI express bus > > port service driver. A cooling device is created for each port the > > service driver finds to support changing speeds. > > drivers/pci/pcie/bwctrl.c | 366 ++++++++++++++++++ > > include/linux/pci-bwctrl.h | 28 ++ > > tools/testing/selftests/pcie_bwctrl/Makefile | 2 + > > .../pcie_bwctrl/set_pcie_cooling_state.sh | 122 ++++++ > > .../selftests/pcie_bwctrl/set_pcie_speed.sh | 67 ++++ > Applied to pci/bwctrl for v6.13, thanks Ilpo (and Alexandru, for the > bandwidth notification interrupt support)! How attached are we to "bwctrl" and "pwrctl" as the names? I just noticed that we have "ctrl" for bandwidth control but "ctl" for power control, which is slightly annoying to keep straight. "ctrl" is more common in the tree: $ find -name \*ctrl\*[ch] | wc -l 691 $ find -name \*ctl\*[ch] | wc -l 291 so I would prefer that, although "pwrctl" is already in the tree (as of v6.11), so it would be more disruptive to change it than to rename "bwctrl".
On Wed, 13 Nov 2024 at 22:48, Bjorn Helgaas <helgaas@kernel.org> wrote: > > [+cc Bart, Amit, Neil, Caleb] > > On Wed, Oct 23, 2024 at 05:19:04PM -0500, Bjorn Helgaas wrote: > > On Fri, Oct 18, 2024 at 05:47:46PM +0300, Ilpo Järvinen wrote: > > > This series adds PCIe bandwidth controller (bwctrl) and associated PCIe > > > cooling driver to the thermal core side for limiting PCIe Link Speed > > > due to thermal reasons. PCIe bandwidth controller is a PCI express bus > > > port service driver. A cooling device is created for each port the > > > service driver finds to support changing speeds. > > > > drivers/pci/pcie/bwctrl.c | 366 ++++++++++++++++++ > > > include/linux/pci-bwctrl.h | 28 ++ > > > tools/testing/selftests/pcie_bwctrl/Makefile | 2 + > > > .../pcie_bwctrl/set_pcie_cooling_state.sh | 122 ++++++ > > > .../selftests/pcie_bwctrl/set_pcie_speed.sh | 67 ++++ > > > Applied to pci/bwctrl for v6.13, thanks Ilpo (and Alexandru, for the > > bandwidth notification interrupt support)! > > How attached are we to "bwctrl" and "pwrctl" as the names? > > I just noticed that we have "ctrl" for bandwidth control but "ctl" for > power control, which is slightly annoying to keep straight. > > "ctrl" is more common in the tree: > > $ find -name \*ctrl\*[ch] | wc -l > 691 > $ find -name \*ctl\*[ch] | wc -l > 291 > > so I would prefer that, although "pwrctl" is already in the tree (as > of v6.11), so it would be more disruptive to change it than to rename > "bwctrl". "pwrctl" is shorter and it was the only reason for me to choose it over pwrctrl. Bart
Hello, [...] > How attached are we to "bwctrl" and "pwrctl" as the names? > > I just noticed that we have "ctrl" for bandwidth control but "ctl" for > power control, which is slightly annoying to keep straight. > > "ctrl" is more common in the tree: > > $ find -name \*ctrl\*[ch] | wc -l > 691 > $ find -name \*ctl\*[ch] | wc -l > 291 > > so I would prefer that, although "pwrctl" is already in the tree (as > of v6.11), so it would be more disruptive to change it than to rename > "bwctrl". If I may, I would also lean towards the "pwrctrl" name. The "ctl" suffix makes me think of a command-line utility, a CLI, so to speak, where such suffix is commonly used e.g., sysctl, etcdctl, kubectl; also, all the systemd binaries, etc. Krzysztof
On Tue, 12 Nov 2024 18:01:50 +0200 (EET) Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote: > On Tue, 12 Nov 2024, Lukas Wunner wrote: > > > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote: > > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed); > > > > My apologies for another belated comment on this series. > > This patch is now a688ab21eb72 on pci/bwctrl: > > > > I note that pcie_set_target_speed() is not called my a modular user > > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export > > isn't really necessary right now. I don't know if it was added > > intentionally because some modular user is expected to show up > > in the near future. > > Its probably a thinko to add it at all but then there have been talk about > other users interested in the API too so it's not far fetched we could see > a user. No idea about timelines though. > > There are some AMD GPU drivers tweaking the TLS field on their own but > they also touch some HW specific registers (although, IIRC, they only > touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if > that yields something very straightforward and ends up producing a working > conversion or not (without ability to test with the HW). But TBH, not on > my highest priority item. > > > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > > > if (!data) > > > return -ENOMEM; > > > > > > + devm_mutex_init(&srv->device, &data->set_speed_mutex); > > > ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL, > > > pcie_bwnotif_irq_thread, > > > IRQF_SHARED | IRQF_ONESHOT, > > > > We generally try to avoid devm_*() functions in port service drivers > > because if we later on move them into the PCI core (which is the plan), > > we'll have to unroll them. Not the end of the world that they're used > > here, just not ideal. > > I think Jonathan disagrees with you on that: > > https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/ Indeed - you beat me to it ;) There is no practical way to move most of the port driver code into the PCI core and definitely not interrupts. It is a shame though as I'd much prefer if we could do so. At LPC other issues some as power management were called out as being very hard to handle, but to me the interrupt bit is a single relatively easy to understand blocker. I've been very slow on getting back to this, but my current plan would 1) Split up the bits of portdrv subdrivers that are actually core code (things like the AER counters etc) The current mix in the same files makes it hard to reason about lifetimes etc. 2) Squash all the portdrv sub drivers into simple library style calls so no pretend devices, everything registered directly. That cleans up a lot of the layering and still provides reusable code if it makes sense to have multiple drivers for ports or to reuse this code for something else. Note that along the way I'll check we can build the portdrv as a module - I don't plan to actually do that, but it shows the layering / abstractions all work if it is possible. That will probably make use of devm_ where appropriate as it simplifies a lot of paths. 3) Add the MSIX stuff to support 'new' child drivers but only where dynamic MSIX is supported. 4) Register new child devices where the layering does make sense. So where we are actually binding drivers that can be unbound etc. Only for cases where dynamic MSI-X is available. I hope to get back to this in a week or so. Jonathan > > :-) >
On Mon, 18 Nov 2024, Jonathan Cameron wrote: > On Tue, 12 Nov 2024 18:01:50 +0200 (EET) > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote: > > > On Tue, 12 Nov 2024, Lukas Wunner wrote: > > > > > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote: > > > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed); > > > > > > My apologies for another belated comment on this series. > > > This patch is now a688ab21eb72 on pci/bwctrl: > > > > > > I note that pcie_set_target_speed() is not called my a modular user > > > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export > > > isn't really necessary right now. I don't know if it was added > > > intentionally because some modular user is expected to show up > > > in the near future. > > > > Its probably a thinko to add it at all but then there have been talk about > > other users interested in the API too so it's not far fetched we could see > > a user. No idea about timelines though. > > > > There are some AMD GPU drivers tweaking the TLS field on their own but > > they also touch some HW specific registers (although, IIRC, they only > > touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if > > that yields something very straightforward and ends up producing a working > > conversion or not (without ability to test with the HW). But TBH, not on > > my highest priority item. > > > > > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > > > > if (!data) > > > > return -ENOMEM; > > > > > > > > + devm_mutex_init(&srv->device, &data->set_speed_mutex); > > > > ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL, > > > > pcie_bwnotif_irq_thread, > > > > IRQF_SHARED | IRQF_ONESHOT, > > > > > > We generally try to avoid devm_*() functions in port service drivers > > > because if we later on move them into the PCI core (which is the plan), > > > we'll have to unroll them. Not the end of the world that they're used > > > here, just not ideal. > > > > I think Jonathan disagrees with you on that: > > > > https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/ > > Indeed - you beat me to it ;) > > There is no practical way to move most of the port driver code into the PCI > core and definitely not interrupts. It is a shame though as I'd much prefer > if we could do so. At LPC other issues some as power management were called > out as being very hard to handle, but to me the interrupt bit is a single > relatively easy to understand blocker. > > I've been very slow on getting back to this, but my current plan would > > 1) Split up the bits of portdrv subdrivers that are actually core code > (things like the AER counters etc) The current mix in the same files > makes it hard to reason about lifetimes etc. > > 2) Squash all the portdrv sub drivers into simple library style calls so > no pretend devices, everything registered directly. That cleans up > a lot of the layering and still provides reusable code if it makes > sense to have multiple drivers for ports or to reuse this code for > something else. Note that along the way I'll check we can build the > portdrv as a module - I don't plan to actually do that, but it shows > the layering / abstractions all work if it is possible. That will > probably make use of devm_ where appropriate as it simplifies a lot > of paths. I'm sorry to be a bit of a spoilsport here but quirks make calls to ASPM code and now also to bwctrl set Link Speed API so I'm not entire sure if you can actual succeed in that module test. (It's just something that again indicates both would belong to PCI core but sadly that direction is out of options).
On Mon, 18 Nov 2024 15:17:53 +0200 (EET) Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote: > On Mon, 18 Nov 2024, Jonathan Cameron wrote: > > > On Tue, 12 Nov 2024 18:01:50 +0200 (EET) > > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote: > > > > > On Tue, 12 Nov 2024, Lukas Wunner wrote: > > > > > > > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote: > > > > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed); > > > > > > > > My apologies for another belated comment on this series. > > > > This patch is now a688ab21eb72 on pci/bwctrl: > > > > > > > > I note that pcie_set_target_speed() is not called my a modular user > > > > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export > > > > isn't really necessary right now. I don't know if it was added > > > > intentionally because some modular user is expected to show up > > > > in the near future. > > > > > > Its probably a thinko to add it at all but then there have been talk about > > > other users interested in the API too so it's not far fetched we could see > > > a user. No idea about timelines though. > > > > > > There are some AMD GPU drivers tweaking the TLS field on their own but > > > they also touch some HW specific registers (although, IIRC, they only > > > touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if > > > that yields something very straightforward and ends up producing a working > > > conversion or not (without ability to test with the HW). But TBH, not on > > > my highest priority item. > > > > > > > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv) > > > > > if (!data) > > > > > return -ENOMEM; > > > > > > > > > > + devm_mutex_init(&srv->device, &data->set_speed_mutex); > > > > > ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL, > > > > > pcie_bwnotif_irq_thread, > > > > > IRQF_SHARED | IRQF_ONESHOT, > > > > > > > > We generally try to avoid devm_*() functions in port service drivers > > > > because if we later on move them into the PCI core (which is the plan), > > > > we'll have to unroll them. Not the end of the world that they're used > > > > here, just not ideal. > > > > > > I think Jonathan disagrees with you on that: > > > > > > https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/ > > > > Indeed - you beat me to it ;) > > > > There is no practical way to move most of the port driver code into the PCI > > core and definitely not interrupts. It is a shame though as I'd much prefer > > if we could do so. At LPC other issues some as power management were called > > out as being very hard to handle, but to me the interrupt bit is a single > > relatively easy to understand blocker. > > > > I've been very slow on getting back to this, but my current plan would > > > > 1) Split up the bits of portdrv subdrivers that are actually core code > > (things like the AER counters etc) The current mix in the same files > > makes it hard to reason about lifetimes etc. > > > > 2) Squash all the portdrv sub drivers into simple library style calls so > > no pretend devices, everything registered directly. That cleans up > > a lot of the layering and still provides reusable code if it makes > > sense to have multiple drivers for ports or to reuse this code for > > something else. Note that along the way I'll check we can build the > > portdrv as a module - I don't plan to actually do that, but it shows > > the layering / abstractions all work if it is possible. That will > > probably make use of devm_ where appropriate as it simplifies a lot > > of paths. > > I'm sorry to be a bit of a spoilsport here but quirks make calls to ASPM > code and now also to bwctrl set Link Speed API so I'm not entire sure if > you can actual succeed in that module test. > > (It's just something that again indicates both would belong to PCI core > but sadly that direction is out of options). > It may involve some bodges, but it is still a path to checking the layer splits at least make 'some' sense. Also that the resulting library style code is suitable for reuse. Possibly with an exception for a few parts. Thanks for the pointers to where the pitfalls lie! Jonathan