Message ID | 20220511171754.avfrrqg6eihku55s@bsd-mbp.dhcp.thefacebook.com |
---|---|
State | New |
Headers | show |
Series | : Revert "ACPI: Remove side effect of partly creating a node in acpi_get_node()" | expand |
On Wed, May 11, 2022 at 7:24 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > > This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. > > The change calls pxm_to_node(), which ends up returning -1 > (NUMA_NO_NODE) on some systems for the pci bus, as opposed > to the prior call to acpi_map_pxm_to_node(), which returns 0. > > The default numa node is then inherited by all pci devices, and is > visible in /sys/bus/pci/devices/*/numa_node > > The prior behavior shows: > # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > 122 0 > > While the new behavior has: > # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > 1 0 > 121 -1 > > While arguably NUMA_NO_NODE is correct on single-socket systems which > have only one numa domain, this breaks scripts that attempt to read the > NIC numa_node and pass that to numactl in order to pin memory allocation > when running applications (like iperf). E.g.: > > # numactl -p -1 iperf3 > libnuma: Warning: node argument -1 is out of range > <-1> is invalid > > Reverting this change restores the prior behavior. Well, that's not a recent commit and it fixed a real and serious issue. Isn't there a way to fix this other than reverting it? > > Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> > --- > drivers/acpi/numa/srat.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c > index 3b818ab186be..f150c5c1d0a8 100644 > --- a/drivers/acpi/numa/srat.c > +++ b/drivers/acpi/numa/srat.c > @@ -564,6 +564,6 @@ int acpi_get_node(acpi_handle handle) > > pxm = acpi_get_pxm(handle); > > - return pxm_to_node(pxm); > + return acpi_map_pxm_to_node(pxm); > } > EXPORT_SYMBOL(acpi_get_node); > -- > 2.30.2 >
On 11 May 2022, at 10:33, Rafael J. Wysocki wrote: > On Wed, May 11, 2022 at 7:24 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: >> >> This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. >> >> The change calls pxm_to_node(), which ends up returning -1 >> (NUMA_NO_NODE) on some systems for the pci bus, as opposed >> to the prior call to acpi_map_pxm_to_node(), which returns 0. >> >> The default numa node is then inherited by all pci devices, and is >> visible in /sys/bus/pci/devices/*/numa_node >> >> The prior behavior shows: >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c >> 122 0 >> >> While the new behavior has: >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c >> 1 0 >> 121 -1 >> >> While arguably NUMA_NO_NODE is correct on single-socket systems which >> have only one numa domain, this breaks scripts that attempt to read the >> NIC numa_node and pass that to numactl in order to pin memory allocation >> when running applications (like iperf). E.g.: >> >> # numactl -p -1 iperf3 >> libnuma: Warning: node argument -1 is out of range >> <-1> is invalid >> >> Reverting this change restores the prior behavior. > > Well, that's not a recent commit and it fixed a real and serious issue. > > Isn't there a way to fix this other than reverting it? The userspace behavior changed - is there another way to fix things so that a valid numa_node is returned?
On Wed, May 11, 2022 at 7:42 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > > On 11 May 2022, at 10:33, Rafael J. Wysocki wrote: > > > On Wed, May 11, 2022 at 7:24 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > >> > >> This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. > >> > >> The change calls pxm_to_node(), which ends up returning -1 > >> (NUMA_NO_NODE) on some systems for the pci bus, as opposed > >> to the prior call to acpi_map_pxm_to_node(), which returns 0. > >> > >> The default numa node is then inherited by all pci devices, and is > >> visible in /sys/bus/pci/devices/*/numa_node > >> > >> The prior behavior shows: > >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > >> 122 0 > >> > >> While the new behavior has: > >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > >> 1 0 > >> 121 -1 > >> > >> While arguably NUMA_NO_NODE is correct on single-socket systems which > >> have only one numa domain, this breaks scripts that attempt to read the > >> NIC numa_node and pass that to numactl in order to pin memory allocation > >> when running applications (like iperf). E.g.: > >> > >> # numactl -p -1 iperf3 > >> libnuma: Warning: node argument -1 is out of range > >> <-1> is invalid > >> > >> Reverting this change restores the prior behavior. > > > > Well, that's not a recent commit and it fixed a real and serious issue. > > > > Isn't there a way to fix this other than reverting it? > > The userspace behavior changed - is there another way to fix things > so that a valid numa_node is returned? Well, that's my question.
On Wed, 11 May 2022 19:44:14 +0200 "Rafael J. Wysocki" <rafael@kernel.org> wrote: > On Wed, May 11, 2022 at 7:42 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > > > > On 11 May 2022, at 10:33, Rafael J. Wysocki wrote: > > > > > On Wed, May 11, 2022 at 7:24 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > > >> > > >> This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. > > >> > > >> The change calls pxm_to_node(), which ends up returning -1 > > >> (NUMA_NO_NODE) on some systems for the pci bus, as opposed > > >> to the prior call to acpi_map_pxm_to_node(), which returns 0. > > >> > > >> The default numa node is then inherited by all pci devices, and is > > >> visible in /sys/bus/pci/devices/*/numa_node > > >> > > >> The prior behavior shows: > > >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > > >> 122 0 > > >> > > >> While the new behavior has: > > >> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c > > >> 1 0 Curious, which device is turning up in node 0? > > >> 121 -1 > > >> > > >> While arguably NUMA_NO_NODE is correct on single-socket systems which > > >> have only one numa domain, this breaks scripts that attempt to read the > > >> NIC numa_node and pass that to numactl in order to pin memory allocation > > >> when running applications (like iperf). E.g.: > > >> > > >> # numactl -p -1 iperf3 > > >> libnuma: Warning: node argument -1 is out of range > > >> <-1> is invalid > > >> > > >> Reverting this change restores the prior behavior. > > > > > > Well, that's not a recent commit and it fixed a real and serious issue. > > > > > > Isn't there a way to fix this other than reverting it? > > > > The userspace behavior changed - is there another way to fix things > > so that a valid numa_node is returned? > > Well, that's my question. As Rafael noted, we don't want to change the internal kernel representation because previous kernel behavior resulting in several paths where you could get NULL pointer de-references, but maybe we could special case it at the userspace boundary. e.g. override dev_to_node() return value here https://elixir.bootlin.com/linux/v5.18-rc6/source/drivers/pci/pci-sysfs.c#L358 What's problematic is we missed this being being an issue until now and hence have shipping kernels with both behaviors. +CC Bjorn and linux-pci Jonathan
On 12 May 2022, at 3:15, Jonathan Cameron wrote: > On Wed, 11 May 2022 19:44:14 +0200 > "Rafael J. Wysocki" <rafael@kernel.org> wrote: > >> On Wed, May 11, 2022 at 7:42 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: >>> >>> On 11 May 2022, at 10:33, Rafael J. Wysocki wrote: >>> >>>> On Wed, May 11, 2022 at 7:24 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: >>>>> >>>>> This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. >>>>> >>>>> The change calls pxm_to_node(), which ends up returning -1 >>>>> (NUMA_NO_NODE) on some systems for the pci bus, as opposed >>>>> to the prior call to acpi_map_pxm_to_node(), which returns 0. >>>>> >>>>> The default numa node is then inherited by all pci devices, and is >>>>> visible in /sys/bus/pci/devices/*/numa_node >>>>> >>>>> The prior behavior shows: >>>>> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c >>>>> 122 0 >>>>> >>>>> While the new behavior has: >>>>> # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c >>>>> 1 0 > > Curious, which device is turning up in node 0? Oddly enough, the NVME drive: 01:00.0 Non-Volatile memory controller: SK hynix PC401 NVMe Solid State Drive 256GB (prog-if 02 [NVM Express]) Subsystem: SK hynix PC401 NVMe Solid State Drive 256GB NUMA node: 0 These are single-socket Skylake DE platforms. >>>>> >>>>> While arguably NUMA_NO_NODE is correct on single-socket systems which >>>>> have only one numa domain, this breaks scripts that attempt to read the >>>>> NIC numa_node and pass that to numactl in order to pin memory allocation >>>>> when running applications (like iperf). E.g.: >>>>> >>>>> # numactl -p -1 iperf3 >>>>> libnuma: Warning: node argument -1 is out of range >>>>> <-1> is invalid >>>>> >>>>> Reverting this change restores the prior behavior. >>>> >>>> Well, that's not a recent commit and it fixed a real and serious issue. >>>> >>>> Isn't there a way to fix this other than reverting it? >>> >>> The userspace behavior changed - is there another way to fix things >>> so that a valid numa_node is returned? >> >> Well, that's my question. This also could be a BIOS issue that wasn’t noticed until the platforms were updated to a newer kernel. — Jonathan > As Rafael noted, we don't want to change the internal kernel representation because > previous kernel behavior resulting in several paths where you could > get NULL pointer de-references, but maybe we could special case > it at the userspace boundary. > > e.g. override dev_to_node() return value here > https://elixir.bootlin.com/linux/v5.18-rc6/source/drivers/pci/pci-sysfs.c#L358 > > What's problematic is we missed this being being an issue until now and hence > have shipping kernels with both behaviors. > > +CC Bjorn and linux-pci > > Jonathan
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index 3b818ab186be..f150c5c1d0a8 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -564,6 +564,6 @@ int acpi_get_node(acpi_handle handle) pxm = acpi_get_pxm(handle); - return pxm_to_node(pxm); + return acpi_map_pxm_to_node(pxm); } EXPORT_SYMBOL(acpi_get_node);
This reverts commit a62d07e0006a3a3ce77041ca07f3c488ec880790. The change calls pxm_to_node(), which ends up returning -1 (NUMA_NO_NODE) on some systems for the pci bus, as opposed to the prior call to acpi_map_pxm_to_node(), which returns 0. The default numa node is then inherited by all pci devices, and is visible in /sys/bus/pci/devices/*/numa_node The prior behavior shows: # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c 122 0 While the new behavior has: # cat /sys/bus/pci/devices/*/numa_node | sort | uniq -c 1 0 121 -1 While arguably NUMA_NO_NODE is correct on single-socket systems which have only one numa domain, this breaks scripts that attempt to read the NIC numa_node and pass that to numactl in order to pin memory allocation when running applications (like iperf). E.g.: # numactl -p -1 iperf3 libnuma: Warning: node argument -1 is out of range <-1> is invalid Reverting this change restores the prior behavior. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> --- drivers/acpi/numa/srat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)