mbox series

[00/18] CXL RAM and the 'Soft Reserved' => 'System RAM' default

Message ID 167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com
Headers show
Series CXL RAM and the 'Soft Reserved' => 'System RAM' default | expand

Message

Dan Williams Feb. 6, 2023, 1:02 a.m. UTC
Summary:
--------

CXL RAM support allows for the dynamic provisioning of new CXL RAM
regions, and more routinely, assembling a region from an existing
configuration established by platform-firmware. The latter is motivated
by CXL memory RAS (Reliability, Availability and Serviceability)
support, that requires associating device events with System Physical
Address ranges and vice versa.

The 'Soft Reserved' policy rework arranges for performance
differentiated memory like CXL attached DRAM, or high-bandwidth memory,
to be designated for 'System RAM' by default, rather than the device-dax
dedicated access mode. That current device-dax default is confusing and
surprising for the Pareto of users that do not expect memory to be
quarantined for dedicated access by default. Most users expect all
'System RAM'-capable memory to show up in FREE(1).


Details:
--------

Recall that the Linux 'Soft Reserved' designation for memory is a
reaction to platform-firmware, like EFI EDK2, delineating memory with
the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
alternative way to think of that attribute is that it specifies the
*not* general-purpose memory pool. It is memory that may be too precious
for general usage or not performant enough for some hot data structures.
However, in the absence of explicit policy it should just be 'System
RAM' by default.

Rather than require every distribution to ship a udev policy to assign
dax devices to dax_kmem (the device-memory hotplug driver) just make
that the kernel default. This is similar to the rationale in:

commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")

With this change the relatively niche use case of accessing this memory
via mapping a device-dax instance can be achieved by building with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
memhp_default_state=offline at boot, and then use:

    daxctl reconfigure-device $device -m devdax --force

...to shift the corresponding address range to device-dax access.

The process of assembling a device-dax instance for a given CXL region
device configuration is similar to the process of assembling a
Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
probing by the PCI and driver core enumerates all CXL endpoints and
their decoders. Then, once enough decoders have arrived to a describe a
given region, that region is passed to the device-dax subsystem where it
is subject to the above 'dax_kmem' policy. This assignment and policy
choice is only possible if memory is set aside by the 'Soft Reserved'
designation. Otherwise, CXL that is mapped as 'System RAM' becomes
immutable by CXL driver mechanisms, but is still enumerated for RAS
purposes.

This series is also available via:

https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

...and has gone through some preview testing in various forms.

---

Dan Williams (18):
      cxl/Documentation: Update references to attributes added in v6.0
      cxl/region: Add a mode attribute for regions
      cxl/region: Support empty uuids for non-pmem regions
      cxl/region: Validate region mode vs decoder mode
      cxl/region: Add volatile region creation support
      cxl/region: Refactor attach_target() for autodiscovery
      cxl/region: Move region-position validation to a helper
      kernel/range: Uplevel the cxl subsystem's range_contains() helper
      cxl/region: Enable CONFIG_CXL_REGION to be toggled
      cxl/region: Fix passthrough-decoder detection
      cxl/region: Add region autodiscovery
      tools/testing/cxl: Define a fixed volatile configuration to parse
      dax/hmem: Move HMAT and Soft reservation probe initcall level
      dax/hmem: Drop unnecessary dax_hmem_remove()
      dax/hmem: Convey the dax range via memregion_info()
      dax/hmem: Move hmem device registration to dax_hmem.ko
      dax: Assign RAM regions to memory-hotplug by default
      cxl/dax: Create dax devices for CXL RAM regions


 Documentation/ABI/testing/sysfs-bus-cxl |   64 +-
 MAINTAINERS                             |    1 
 drivers/acpi/numa/hmat.c                |    4 
 drivers/cxl/Kconfig                     |   12 
 drivers/cxl/acpi.c                      |    3 
 drivers/cxl/core/core.h                 |    7 
 drivers/cxl/core/hdm.c                  |    8 
 drivers/cxl/core/pci.c                  |    5 
 drivers/cxl/core/port.c                 |   34 +
 drivers/cxl/core/region.c               |  848 ++++++++++++++++++++++++++++---
 drivers/cxl/cxl.h                       |   46 ++
 drivers/cxl/cxlmem.h                    |    3 
 drivers/cxl/port.c                      |   26 +
 drivers/dax/Kconfig                     |   17 +
 drivers/dax/Makefile                    |    2 
 drivers/dax/bus.c                       |   53 +-
 drivers/dax/bus.h                       |   12 
 drivers/dax/cxl.c                       |   53 ++
 drivers/dax/device.c                    |    3 
 drivers/dax/hmem/Makefile               |    3 
 drivers/dax/hmem/device.c               |  102 ++--
 drivers/dax/hmem/hmem.c                 |  148 +++++
 drivers/dax/kmem.c                      |    1 
 include/linux/dax.h                     |    7 
 include/linux/memregion.h               |    2 
 include/linux/range.h                   |    5 
 lib/stackinit_kunit.c                   |    6 
 tools/testing/cxl/test/cxl.c            |  146 +++++
 28 files changed, 1355 insertions(+), 266 deletions(-)
 create mode 100644 drivers/dax/cxl.c

base-commit: 172738bbccdb4ef76bdd72fc72a315c741c39161

Comments

Gregory Price Feb. 6, 2023, 5:36 a.m. UTC | #1
On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 
> The 'Soft Reserved' policy rework arranges for performance
> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> to be designated for 'System RAM' by default, rather than the device-dax
> dedicated access mode. That current device-dax default is confusing and
> surprising for the Pareto of users that do not expect memory to be
> quarantined for dedicated access by default. Most users expect all
> 'System RAM'-capable memory to show up in FREE(1).

Leverage the same QEMU branch, machine, and configuration as my prior
tests, i'm now experiencing a kernel panic on boot.  Will debug a bit
in the morning, but here is the stack trace i'm seeing

Saw this in both 1 and 2 root port configurations

(note: I also have the region reset issue previously discussed on top of
your branch).  

QEMU configuration:

sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=1G,share=true \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=1G

[   13.936817] Call Trace:
[   13.970691]  <TASK>
[   13.990690]  device_add+0x39d/0x9a0
[   14.024690]  ? kobject_set_name_vargs+0x6d/0x90
[   14.066690]  ? dev_set_name+0x4b/0x60
[   14.090691]  devm_cxl_add_port+0x29a/0x4d0
[   14.135946]  cxl_acpi_probe+0xd9/0x2f0
[   14.167691]  ? device_pm_check_callbacks+0x36/0x100
[   14.203691]  platform_probe+0x44/0x90
[   14.247691]  really_probe+0xde/0x380
[   14.277690]  ? pm_runtime_barrier+0x50/0x90
[   14.324693]  __driver_probe_device+0x78/0x170
[   14.356694]  driver_probe_device+0x1f/0x90
[   14.396692]  __driver_attach+0xce/0x1c0
[   14.435691]  ? __pfx___driver_attach+0x10/0x10
[   14.471692]  bus_for_each_dev+0x73/0xa0
[   14.508693]  bus_add_driver+0x1ae/0x200
[   14.551691]  driver_register+0x89/0xe0
[   14.587691]  ? __pfx_cxl_acpi_init+0x10/0x10
[   14.625690]  do_one_initcall+0x59/0x230
[   14.814691]  kernel_init_freeable+0x204/0x24e
[   14.846710]  ? __pfx_kernel_init+0x10/0x10
[   14.899692]  kernel_init+0x16/0x140
[   14.954691]  ret_from_fork+0x2c/0x50
[   14.986692]  </TASK>
[   15.023689] Modules linked in:
[   15.057693] CR2: 0000000000000060
[   15.105691] ---[ end trace 0000000000000000 ]---
[   15.162837] RIP: 0010:bus_add_device+0x5b/0x150
[   15.217693] Code: 49 8b 74 24 20 48 89 df e8 92 88 ff ff 89 c5 85 c0 75 3b 48 8b 53 50 48 85 d2 75 03 48 8b 13 49 8b 84 24 a8 00 00 00 48 89 0
[   15.427859] RSP: 0000:ffffbd2a40013bc0 EFLAGS: 00010246
[   15.475692] RAX: 0000000000000000 RBX: ffff955c419e1800 RCX: 0000000000000000
[   15.591691] RDX: ffff955c41921778 RSI: ffff955c419e1800 RDI: ffff955c419e1800
[   15.703693] RBP: 0000000000000000 R08: 0000000000000228 R09: ffff955c4119e550
[   15.802711] R10: ffff955c414bedc8 R11: 0000000000000000 R12: ffffffff9e259d60
[   15.907692] R13: ffffffff9d3cee40 R14: ffff955c419e1800 R15: ffff955c41f06010
[   15.983693] FS:  0000000000000000(0000) GS:ffff955cbdd00000(0000) knlGS:0000000000000000
[   16.126698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.206694] CR2: 0000000000000060 CR3: 0000000036010000 CR4: 00000000000006e0
[   16.347694] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[   16.348686] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
Jonathan Cameron Feb. 6, 2023, 3:17 p.m. UTC | #2
On Sun, 05 Feb 2023 17:02:35 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Prior to Linus deciding that the kernel that following v5.19 would be
> v6.0, the CXL ABI documentation already referenced v5.20. In preparation
> for updating these entries update the kernel version to v6.0.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fair enough though that fun bit of history disappears :)

Might as well queue this up ahead of the rest if you want to.
FWIW
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
 
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |   30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 329a7e46c805..5be032313e29 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -198,7 +198,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/endpointX/CDAT
>  Date:		July, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RO) If this sysfs entry is not present no DOE mailbox was
> @@ -209,7 +209,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/mode
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> @@ -229,7 +229,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RO) When a CXL decoder is of devtype "cxl_decoder_endpoint",
> @@ -240,7 +240,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/dpa_size
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> @@ -260,7 +260,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/interleave_ways
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RO) The number of targets across which this decoder's host
> @@ -275,7 +275,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/interleave_granularity
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RO) The number of consecutive bytes of host physical address
> @@ -287,7 +287,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/create_pmem_region
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a string in the form 'regionZ' to start the process
> @@ -303,7 +303,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(WO) Write a string in the form 'regionZ' to delete that region,
> @@ -312,7 +312,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/regionZ/uuid
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a unique identifier for the region. This field must
> @@ -322,7 +322,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Set the number of consecutive bytes each device in the
> @@ -333,7 +333,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/regionZ/interleave_ways
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Configures the number of devices participating in the
> @@ -343,7 +343,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/regionZ/size
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) System physical address space to be consumed by the region.
> @@ -360,7 +360,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/regionZ/resource
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RO) A region is a contiguous partition of a CXL root decoder
> @@ -372,7 +372,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/regionZ/target[0..N]
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write an endpoint decoder object name to 'targetX' where X
> @@ -391,7 +391,7 @@ Description:
>  
>  What:		/sys/bus/cxl/devices/regionZ/commit
>  Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>  Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a boolean 'true' string value to this attribute to
> 
>
Jonathan Cameron Feb. 6, 2023, 3:54 p.m. UTC | #3
On Sun, 05 Feb 2023 17:02:45 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Shipping versions of the cxl-cli utility expect all regions to have a
> 'uuid' attribute. In preparation for 'ram' regions, update the 'uuid'
> attribute to return an empty string which satisfies the current
> expectations of 'cxl list -R'. Otherwise, 'cxl list -R' fails in the
> presence of regions with the 'uuid' attribute missing. Force the
> attribute to be read-only as there is no facility or expectation for a
> 'ram' region to recall its uuid from one boot to the next.

This is new functionality, so do we need the backwards compatibility?
Or does that cxl list -R not just skip these, but fail hard?
It's inelegant to have this visible at all (and it confused me when
I was running tests as I assumed I needed to poke something in it).

If nothing else, good to have a comment to remind us in the code
why there is the oddity of setting it to read only.

Jonathan

> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |    3 ++-
>  drivers/cxl/core/region.c               |    7 +++++--
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 058b0c45001f..4c4e1cbb1169 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -317,7 +317,8 @@ Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a unique identifier for the region. This field must
>  		be set for persistent regions and it must not conflict with the
> -		UUID of another region.
> +		UUID of another region. For volatile ram regions this
> +		attribute is a read-only empty string.
>  
>  
>  What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 17d2d0c12725..c9e7f05caa0f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -45,7 +45,10 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  	rc = down_read_interruptible(&cxl_region_rwsem);
>  	if (rc)
>  		return rc;
> -	rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> +	if (cxlr->mode != CXL_DECODER_PMEM)
> +		rc = sysfs_emit(buf, "\n");
> +	else
> +		rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
>  	up_read(&cxl_region_rwsem);
>  
>  	return rc;
> @@ -301,7 +304,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>  	struct cxl_region *cxlr = to_cxl_region(dev);
>  
>  	if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> -		return 0;
> +		return 0444;

Definitely a comment here as this is very odd and we may forget why
it is like this.

>  	return a->mode;
>  }
>  
>
Gregory Price Feb. 6, 2023, 4:37 p.m. UTC | #4
On Sun, Feb 05, 2023 at 05:02:35PM -0800, Dan Williams wrote:
> Prior to Linus deciding that the kernel that following v5.19 would be
> v6.0, the CXL ABI documentation already referenced v5.20. In preparation
> for updating these entries update the kernel version to v6.0.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |   30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 

Reviewed-by: Gregory Price <gregory.price@memverge.com>
Davidlohr Bueso Feb. 6, 2023, 4:40 p.m. UTC | #5
On Mon, 06 Feb 2023, Gregory Price wrote:

>On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
>> Summary:
>> --------
>>
>> CXL RAM support allows for the dynamic provisioning of new CXL RAM
>> regions, and more routinely, assembling a region from an existing
>> configuration established by platform-firmware. The latter is motivated
>> by CXL memory RAS (Reliability, Availability and Serviceability)
>> support, that requires associating device events with System Physical
>> Address ranges and vice versa.
>>
>> The 'Soft Reserved' policy rework arranges for performance
>> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
>> to be designated for 'System RAM' by default, rather than the device-dax
>> dedicated access mode. That current device-dax default is confusing and
>> surprising for the Pareto of users that do not expect memory to be
>> quarantined for dedicated access by default. Most users expect all
>> 'System RAM'-capable memory to show up in FREE(1).
>
>Leverage the same QEMU branch, machine, and configuration as my prior
>tests, i'm now experiencing a kernel panic on boot.  Will debug a bit
>in the morning, but here is the stack trace i'm seeing
>
>Saw this in both 1 and 2 root port configurations

I also see it in "regular" pmem setups, and narrowed it down to this
change in the last patch:

-module_init(cxl_acpi_init);
+/* load before dax_hmem sees 'Soft Reserved' CXL ranges */
+subsys_initcall(cxl_acpi_init);
Gregory Price Feb. 6, 2023, 4:55 p.m. UTC | #6
On Sun, Feb 05, 2023 at 05:02:56PM -0800, Dan Williams wrote:
> Expand the region creation infrastructure to enable 'ram'
> (volatile-memory) regions. The internals of create_pmem_region_store()
> and create_pmem_region_show() are factored out into helpers
> __create_region() and __create_region_show() for the 'ram' case to
> reuse.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  
[...]
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 8566451cb22f..47e450c3a5a9 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -294,6 +294,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>  	&dev_attr_cap_type3.attr,
>  	&dev_attr_target_list.attr,
>  	SET_CXL_REGION_ATTR(create_pmem_region)
> +	SET_CXL_REGION_ATTR(create_ram_region)
>  	SET_CXL_REGION_ATTR(delete_region)
>  	NULL,
>  };
> @@ -305,6 +306,13 @@ static bool can_create_pmem(struct cxl_root_decoder *cxlrd)
>  	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
>  }
>  
> +static bool can_create_ram(struct cxl_root_decoder *cxlrd)
> +{
> +	unsigned long flags = CXL_DECODER_F_TYPE3 | CXL_DECODER_F_RAM;
> +
> +	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
> +}
> +

does this collide with either CXL_DECODE_F_ENABLE or CXL_DECODER_F_AUTO?

I think obviously if it's already enabled creating new regions in a
decoder doesn't make sense, but if F_AUTO is set, does that imply
the region settings cannot be changed?
Gregory Price Feb. 6, 2023, 5:03 p.m. UTC | #7
On Sun, Feb 05, 2023 at 05:03:18PM -0800, Dan Williams wrote:
> Add help text and a label so the CXL_REGION config option can be
> toggled. This is mainly to enable compile testing without region
> support.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/Kconfig |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 

Reviewed-by: Gregory Price <gregory.price@memverge.com
Gregory Price Feb. 6, 2023, 5:26 p.m. UTC | #8
On Sun, Feb 05, 2023 at 05:04:05PM -0800, Dan Williams wrote:
> The default mode for device-dax instances is backwards for RAM-regions
> as evidenced by the fact that it tends to catch end users by surprise.
> "Where is my memory?". Recall that platforms are increasingly shipping
> with performance-differentiated memory pools beyond typical DRAM and
> NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic
> interleave, varied media types, and future fabric attached
> possibilities).
> 
> For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux
> 'Soft Reserved') attribute is expected to be applied to all memory-pools
> that are not the general purpose pool. This designation gives an
> Operating System a chance to defer usage of a memory pool until later in
> the boot process where its performance properties can be interrogated
> and administrator policy can be applied.
> 
> 'Soft Reserved' memory can be anything from too limited and precious to
> be part of the general purpose pool (HBM), too slow to host hot kernel
> data structures (some PMEM media), or anything in between. However, in
> the absence of an explicit policy, the memory should at least be made
> usable by default. The current device-dax default hides all
> non-general-purpose memory behind a device interface.
> 
> The expectation is that the distribution of users that want the memory
> online by default vs device-dedicated-access by default follows the
> Pareto principle. A small number of enlightened users may want to do
> userspace memory management through a device, but general users just
> want the kernel to make the memory available with an option to get more
> advanced later.
> 
> Arrange for all device-dax instances not backed by PMEM to default to
> attaching to the dax_kmem driver. From there the baseline memory hotplug
> policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=)
> gates whether the memory comes online or stays offline. Where, if it
> stays offline, it can be reliably converted back to device-mode where it
> can be partitioned, or fronted by a userspace allocator.
> 
> So, if someone wants device-dax instances for their 'Soft Reserved'
> memory:
> 
> 1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot
>    with memhp_default_state=offline, or roll the dice and hope that the
>    kernel has not pinned a page in that memory before step 2.
> 
> 2/ Write a udev rule to convert the target dax device(s) from
>    'system-ram' mode to 'devdax' mode:
> 
>    daxctl reconfigure-device $dax -m devdax -f
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Stupid question: when defaulting to online, do these devices get placed
into Zone Normal?  Is there a way for us, at a minimum, to online this
as Zone Moveable in an effort to assist the "hope the kernel has not
pinned a page" problem (and to try to keep kernel resources out of this
zone in general).

If this is covered by a different patch or already set up this way,
ignore me :]

~Gregory
Ira Weiny Feb. 6, 2023, 5:44 p.m. UTC | #9
Dan Williams wrote:
> In preparation for region autodiscovery, that needs all devices
> discovered before their relative position in the region can be
> determined, consolidate all position dependent validation in a helper.
> 
> Recall that in the on-demand region creation flow the end-user picks the
> position of a given endpoint decoder in a region. In the autodiscovery
> case the position of an endpoint decoder can only be determined after
> all other endpoint decoders that claim to decode the region's address
> range have been enumerated and attached. So, in the autodiscovery case
> endpoint decoders may be attached before their relative position is
> known. Once all decoders arrive, then positions can be determined and
> validated with cxl_region_validate_position() the same as user initiated
> on-demand creation.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/region.c |  119 +++++++++++++++++++++++++++++----------------
>  1 file changed, 76 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 97eafdd75675..c82d3b6f3d1f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1207,35 +1207,13 @@ static int cxl_region_setup_targets(struct cxl_region *cxlr)
>  	return 0;
>  }
>  

[snip]

> @@ -1274,6 +1252,71 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  		}
>  	}
>  
> +	return 0;
> +}
> +
> +static int cxl_region_attach_position(struct cxl_region *cxlr,
> +				      struct cxl_root_decoder *cxlrd,
> +				      struct cxl_endpoint_decoder *cxled,
> +				      const struct cxl_dport *dport, int pos)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *iter;
> +	int rc;
> +
> +	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> +		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			dev_name(&cxlrd->cxlsd.cxld.dev));
> +		return -ENXIO;
> +	}

I think I know the answer but I'm curious why this check is not part of
validating the position?  Is it because this is validating the position
relative to the hostbridge which is not strictly part of the region
validation?

Ira
Dan Williams Feb. 6, 2023, 6:19 p.m. UTC | #10
Jonathan Cameron wrote:
> On Sun, 05 Feb 2023 17:02:56 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Expand the region creation infrastructure to enable 'ram'
> > (volatile-memory) regions. The internals of create_pmem_region_store()
> > and create_pmem_region_show() are factored out into helpers
> > __create_region() and __create_region_show() for the 'ram' case to
> > reuse.
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Entirely trivial comments inline.
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
[..]
> > @@ -1727,29 +1762,37 @@ static ssize_t create_pmem_region_store(struct device *dev,
> >  {
> >  	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> >  	struct cxl_region *cxlr;
> > -	int id, rc;
> > +	int rc, id;
> >  
> >  	rc = sscanf(buf, "region%d\n", &id);
> >  	if (rc != 1)
> >  		return -EINVAL;
> >  
> > -	rc = memregion_alloc(GFP_KERNEL);
> > -	if (rc < 0)
> > -		return rc;
> > +	cxlr = __create_region(cxlrd, CXL_DECODER_PMEM, id);
> > +	if (IS_ERR(cxlr))
> > +		return PTR_ERR(cxlr);
> 
> I'd have a blank line here - see below.
> 
[..]
> >  	if (IS_ERR(cxlr))
> >  		return PTR_ERR(cxlr);
> > -
> 
> Just so you know I read to the end!
> 
> Spurious unrelated white space change :)

...and I can't even blame that on clang-format! Will fix.
Ira Weiny Feb. 6, 2023, 7:02 p.m. UTC | #11
Dan Williams wrote:

[snip]

> +
> +static int cmp_decode_pos(const void *a, const void *b)
> +{
> +	struct cxl_endpoint_decoder *cxled_a = *(typeof(cxled_a) *)a;
> +	struct cxl_endpoint_decoder *cxled_b = *(typeof(cxled_b) *)b;
> +	struct cxl_memdev *cxlmd_a = cxled_to_memdev(cxled_a);
> +	struct cxl_memdev *cxlmd_b = cxled_to_memdev(cxled_b);
> +	struct cxl_port *port_a = cxled_to_port(cxled_a);
> +	struct cxl_port *port_b = cxled_to_port(cxled_b);
> +	struct cxl_port *iter_a, *iter_b;
> +
> +	/* Exit early if any prior sorting failed */
> +	if (cxled_a->pos < 0 || cxled_b->pos < 0)
> +		return 0;
> +
> +	/*
> +	 * Walk up the hierarchy to find a shared port, find the decoder
> +	 * that maps the range, compare the relative position of those
> +	 * dport mappings.
> +	 */
> +	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
> +		struct cxl_port *next_a, *next_b, *port;
> +		struct cxl_switch_decoder *cxlsd;
> +
> +		next_a = next_port(iter_a);
> +		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
> +			int a_pos, b_pos, result;
> +			struct device *dev;
> +			unsigned int seq;
> +
> +			next_b = next_port(iter_b);
> +			if (next_a != next_b)
> +				continue;
> +			if (!next_a)
> +				goto out;

To me this check makes more sense before the inner loop.

> +			port = next_a;
> +			dev = device_find_child(&port->dev, cxled_a,
> +						decoder_match_range);
> +			if (!dev) {
> +				struct range *range = &cxled_a->cxld.hpa_range;
> +
> +				dev_err(port->uport,
> +					"failed to find decoder that maps %#llx-:%#llx\n",
> +					range->start, range->end);
> +				cxled_a->pos = -1;
> +				return 0;
> +			}
> +
> +			cxlsd = to_cxl_switch_decoder(dev);
> +			do {
> +				seq = read_seqbegin(&cxlsd->target_lock);
> +				find_positions(cxlsd, iter_a, iter_b, &a_pos,
> +					       &b_pos);
> +			} while (read_seqretry(&cxlsd->target_lock, seq));
> +
> +			if (a_pos < 0 || b_pos < 0) {
> +				dev_err(port->uport,
> +					"failed to find shared decoder for %s and %s\n",
> +					dev_name(cxlmd_a->dev.parent),
> +					dev_name(cxlmd_b->dev.parent));
> +				cxled_a->pos = -1;
> +				result = 0;
> +			} else {
> +				result = a_pos - b_pos;
> +				dev_dbg(port->uport, "%s: %s comes %s %s\n",
> +					dev_name(&cxlsd->cxld.dev),
> +					dev_name(cxlmd_a->dev.parent),
> +					result < 0 ? "before" : "after",
> +					dev_name(cxlmd_b->dev.parent));
> +			}
> +
> +			put_device(dev);
> +
> +			return result;
> +		}
> +	}
> +out:
> +	dev_err(cxlmd_a->dev.parent, "failed to find shared port with %s\n",
> +		dev_name(cxlmd_b->dev.parent));
> +	cxled_a->pos = -1;
> +	return 0;
> +}
> +

[snip]

> @@ -1500,8 +1766,8 @@ static int detach_target(struct cxl_region *cxlr, int pos)
>  	return rc;
>  }
>  
> -static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> -			    size_t len)
> +static ssize_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> +			     size_t len)

Is this a separate fix?

>  {
>  	int rc;
>  

[snip]

> +
> +/* Establish an empty region covering the given HPA range */
> +static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> +					   struct cxl_endpoint_decoder *cxled)

Rather than a comment would this be better named construct_empty_region()?

Ira
Gregory Price Feb. 6, 2023, 7:05 p.m. UTC | #12
On Mon, Feb 06, 2023 at 02:15:56PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > On Sun, Feb 05, 2023 at 05:04:05PM -0800, Dan Williams wrote:
> [..]
> > 
> > Stupid question: when defaulting to online, do these devices get placed
> > into Zone Normal?  Is there a way for us, at a minimum, to online this
> > as Zone Moveable in an effort to assist the "hope the kernel has not
> > pinned a page" problem (and to try to keep kernel resources out of this
> > zone in general).
> > 
> > If this is covered by a different patch or already set up this way,
> > ignore me :]
> 
> Have a look in Documentation/admin-guide/mm/memory-hotplug.rst, the
> 'daxctl recconfigure-device' man page and the the policy options of how
> hot-added memory is brought online.
> 
> The routing can be anything from fully offline device-dax, to fully
> online ZONE_NORMAL, or even a mix of device-dax subdivision,
> ZONE_NORMAL, and ZONE_MOVABLE memblocks all within one CXL ram region.

Hm.

I'm just thinking, for early expander devices the assumption that the
default behavior should be auto-online is good.  Just kinda working
through this in my head for multi-headed devices and early pools, and I
suppose the kernel paremeter covers that. Most of those will be used in
specialty, vertically integrated systems, so we're good to go.

In the future when we get fully featured DCD's, I imagine we will have a
memory region of size X, with N memory blocks, but only some of those
blocks are online.  I suppose it could also be done such that DCD capacity
add events drive the creation of new blocks under the region, as opposed
to having the region pre-create the blocks.

Either way I think the kernel parameter probably covers this case as
well, I'm just wondering if at some point the default will yet again
wish to be false, but i think that's a "CXL has become so successful all
memory is now CXL DCDs and computers look radically different" scenario.

Anyway, this seems good.

Reviewed-by: Gregory Price <gregory.price@memverge.com>
Ira Weiny Feb. 6, 2023, 7:22 p.m. UTC | #13
Dan Williams wrote:
> Shipping versions of the cxl-cli utility expect all regions to have a
> 'uuid' attribute. In preparation for 'ram' regions, update the 'uuid'
> attribute to return an empty string which satisfies the current
> expectations of 'cxl list -R'. Otherwise, 'cxl list -R' fails in the
> presence of regions with the 'uuid' attribute missing.

Would this be more appropriate as a change to cxl-cli?

> Force the
> attribute to be read-only as there is no facility or expectation for a
> 'ram' region to recall its uuid from one boot to the next.

This seems reasonable.

> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |    3 ++-
>  drivers/cxl/core/region.c               |    7 +++++--
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 058b0c45001f..4c4e1cbb1169 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -317,7 +317,8 @@ Contact:	linux-cxl@vger.kernel.org
>  Description:
>  		(RW) Write a unique identifier for the region. This field must
>  		be set for persistent regions and it must not conflict with the
> -		UUID of another region.
> +		UUID of another region. For volatile ram regions this
> +		attribute is a read-only empty string.
>  
>  
>  What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 17d2d0c12725..c9e7f05caa0f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -45,7 +45,10 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  	rc = down_read_interruptible(&cxl_region_rwsem);
>  	if (rc)
>  		return rc;
> -	rc = sysfs_emit(buf, "%pUb\n", &p->uuid);

I guess it all depends on what p->uuid is...  Shouldn't this be all 0's
for a ram region?  Does sysfs_emit() choke on that?

Ira
Ira Weiny Feb. 6, 2023, 7:25 p.m. UTC | #14
Dan Williams wrote:
> Expand the region creation infrastructure to enable 'ram'
> (volatile-memory) regions. The internals of create_pmem_region_store()
> and create_pmem_region_show() are factored out into helpers
> __create_region() and __create_region_show() for the 'ram' case to
> reuse.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Gregory Price Feb. 6, 2023, 7:56 p.m. UTC | #15
On Mon, Feb 06, 2023 at 01:57:05PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > On Sun, Feb 05, 2023 at 05:02:56PM -0800, Dan Williams wrote:
> > > Expand the region creation infrastructure to enable 'ram'
> > > (volatile-memory) regions. The internals of create_pmem_region_store()
> > > and create_pmem_region_show() are factored out into helpers
> > > __create_region() and __create_region_show() for the 'ram' case to
> > > reuse.
> > > 
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > ---
> > >  
> [..]
> > > @@ -305,6 +306,13 @@ static bool can_create_pmem(struct cxl_root_decoder *cxlrd)
> > >  	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
> > >  }
> > >  
> > > +static bool can_create_ram(struct cxl_root_decoder *cxlrd)
> > > +{
> > > +	unsigned long flags = CXL_DECODER_F_TYPE3 | CXL_DECODER_F_RAM;
> > > +
> > > +	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
> > > +}
> > > +
> > 
> > does this collide with either CXL_DECODE_F_ENABLE or CXL_DECODER_F_AUTO?
> > 
> > I think obviously if it's already enabled creating new regions in a
> > decoder doesn't make sense, but if F_AUTO is set, does that imply
> > the region settings cannot be changed?
> 
> That just cares if the root decoder supports TYPE3 and RAM independent
> of ENABLE or AUTO. Root decoders are always enabled. The AUTO flag,
> which is not applicable to root decoders, is just there to hold off
> userspace racing the attachment of endpoint decoders to a region until
> the autodiscovery process has completed. Once that completes and the
> region has been enabled then it can be destroyed to clear AUTO.

Reviewed-by: Gregory Price <gregory.price@memverge.com>
Dave Jiang Feb. 6, 2023, 9:04 p.m. UTC | #16
On 2/5/23 6:02 PM, Dan Williams wrote:
> Prior to Linus deciding that the kernel that following v5.19 would be
> v6.0, the CXL ABI documentation already referenced v5.20. In preparation
> for updating these entries update the kernel version to v6.0.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   Documentation/ABI/testing/sysfs-bus-cxl |   30 +++++++++++++++---------------
>   1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 329a7e46c805..5be032313e29 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -198,7 +198,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/endpointX/CDAT
>   Date:		July, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RO) If this sysfs entry is not present no DOE mailbox was
> @@ -209,7 +209,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/mode
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> @@ -229,7 +229,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/dpa_resource
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RO) When a CXL decoder is of devtype "cxl_decoder_endpoint",
> @@ -240,7 +240,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/dpa_size
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> @@ -260,7 +260,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/interleave_ways
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RO) The number of targets across which this decoder's host
> @@ -275,7 +275,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/interleave_granularity
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RO) The number of consecutive bytes of host physical address
> @@ -287,7 +287,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/create_pmem_region
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Write a string in the form 'regionZ' to start the process
> @@ -303,7 +303,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(WO) Write a string in the form 'regionZ' to delete that region,
> @@ -312,7 +312,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/regionZ/uuid
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Write a unique identifier for the region. This field must
> @@ -322,7 +322,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/regionZ/interleave_granularity
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Set the number of consecutive bytes each device in the
> @@ -333,7 +333,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/regionZ/interleave_ways
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Configures the number of devices participating in the
> @@ -343,7 +343,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/regionZ/size
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) System physical address space to be consumed by the region.
> @@ -360,7 +360,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/regionZ/resource
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RO) A region is a contiguous partition of a CXL root decoder
> @@ -372,7 +372,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/regionZ/target[0..N]
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Write an endpoint decoder object name to 'targetX' where X
> @@ -391,7 +391,7 @@ Description:
>   
>   What:		/sys/bus/cxl/devices/regionZ/commit
>   Date:		May, 2022
> -KernelVersion:	v5.20
> +KernelVersion:	v6.0
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Write a boolean 'true' string value to this attribute to
>
Dan Williams Feb. 6, 2023, 9:57 p.m. UTC | #17
Gregory Price wrote:
> On Sun, Feb 05, 2023 at 05:02:56PM -0800, Dan Williams wrote:
> > Expand the region creation infrastructure to enable 'ram'
> > (volatile-memory) regions. The internals of create_pmem_region_store()
> > and create_pmem_region_show() are factored out into helpers
> > __create_region() and __create_region_show() for the 'ram' case to
> > reuse.
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  
[..]
> > @@ -305,6 +306,13 @@ static bool can_create_pmem(struct cxl_root_decoder *cxlrd)
> >  	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
> >  }
> >  
> > +static bool can_create_ram(struct cxl_root_decoder *cxlrd)
> > +{
> > +	unsigned long flags = CXL_DECODER_F_TYPE3 | CXL_DECODER_F_RAM;
> > +
> > +	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
> > +}
> > +
> 
> does this collide with either CXL_DECODE_F_ENABLE or CXL_DECODER_F_AUTO?
> 
> I think obviously if it's already enabled creating new regions in a
> decoder doesn't make sense, but if F_AUTO is set, does that imply
> the region settings cannot be changed?

That just cares if the root decoder supports TYPE3 and RAM independent
of ENABLE or AUTO. Root decoders are always enabled. The AUTO flag,
which is not applicable to root decoders, is just there to hold off
userspace racing the attachment of endpoint decoders to a region until
the autodiscovery process has completed. Once that completes and the
region has been enabled then it can be destroyed to clear AUTO.
Dan Williams Feb. 6, 2023, 10:15 p.m. UTC | #18
Gregory Price wrote:
> On Sun, Feb 05, 2023 at 05:04:05PM -0800, Dan Williams wrote:
[..]
> 
> Stupid question: when defaulting to online, do these devices get placed
> into Zone Normal?  Is there a way for us, at a minimum, to online this
> as Zone Moveable in an effort to assist the "hope the kernel has not
> pinned a page" problem (and to try to keep kernel resources out of this
> zone in general).
> 
> If this is covered by a different patch or already set up this way,
> ignore me :]

Have a look in Documentation/admin-guide/mm/memory-hotplug.rst, the
'daxctl recconfigure-device' man page and the the policy options of how
hot-added memory is brought online.

The routing can be anything from fully offline device-dax, to fully
online ZONE_NORMAL, or even a mix of device-dax subdivision,
ZONE_NORMAL, and ZONE_MOVABLE memblocks all within one CXL ram region.
Dave Jiang Feb. 6, 2023, 10:31 p.m. UTC | #19
On 2/5/23 6:02 PM, Dan Williams wrote:
> Expand the region creation infrastructure to enable 'ram'
> (volatile-memory) regions. The internals of create_pmem_region_store()
> and create_pmem_region_show() are factored out into helpers
> __create_region() and __create_region_show() for the 'ram' case to
> reuse.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Minor comment below, otherwise:
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   Documentation/ABI/testing/sysfs-bus-cxl |   22 +++++-----
>   drivers/cxl/core/core.h                 |    1
>   drivers/cxl/core/port.c                 |   14 ++++++
>   drivers/cxl/core/region.c               |   71 +++++++++++++++++++++++++------
>   4 files changed, 82 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 4c4e1cbb1169..3acf2f17a73f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -285,20 +285,20 @@ Description:
>   		interleave_granularity).
>   
>   
> -What:		/sys/bus/cxl/devices/decoderX.Y/create_pmem_region
> -Date:		May, 2022
> -KernelVersion:	v6.0
> +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> +Date:		May, 2022, January, 2023
> +KernelVersion:	v6.0 (pmem), v6.3 (ram)
>   Contact:	linux-cxl@vger.kernel.org
>   Description:
>   		(RW) Write a string in the form 'regionZ' to start the process
> -		of defining a new persistent memory region (interleave-set)
> -		within the decode range bounded by root decoder 'decoderX.Y'.
> -		The value written must match the current value returned from
> -		reading this attribute. An atomic compare exchange operation is
> -		done on write to assign the requested id to a region and
> -		allocate the region-id for the next creation attempt. EBUSY is
> -		returned if the region name written does not match the current
> -		cached value.
> +		of defining a new persistent, or volatile memory region
> +		(interleave-set) within the decode range bounded by root decoder
> +		'decoderX.Y'. The value written must match the current value
> +		returned from reading this attribute. An atomic compare exchange
> +		operation is done on write to assign the requested id to a
> +		region and allocate the region-id for the next creation attempt.
> +		EBUSY is returned if the region name written does not match the

-EBUSY?

DJ
> +		current cached value.
>   
>   
>   What:		/sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 8c04672dca56..5eb873da5a30 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -11,6 +11,7 @@ extern struct attribute_group cxl_base_attribute_group;
>   
>   #ifdef CONFIG_CXL_REGION
>   extern struct device_attribute dev_attr_create_pmem_region;
> +extern struct device_attribute dev_attr_create_ram_region;
>   extern struct device_attribute dev_attr_delete_region;
>   extern struct device_attribute dev_attr_region;
>   extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 8566451cb22f..47e450c3a5a9 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -294,6 +294,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>   	&dev_attr_cap_type3.attr,
>   	&dev_attr_target_list.attr,
>   	SET_CXL_REGION_ATTR(create_pmem_region)
> +	SET_CXL_REGION_ATTR(create_ram_region)
>   	SET_CXL_REGION_ATTR(delete_region)
>   	NULL,
>   };
> @@ -305,6 +306,13 @@ static bool can_create_pmem(struct cxl_root_decoder *cxlrd)
>   	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
>   }
>   
> +static bool can_create_ram(struct cxl_root_decoder *cxlrd)
> +{
> +	unsigned long flags = CXL_DECODER_F_TYPE3 | CXL_DECODER_F_RAM;
> +
> +	return (cxlrd->cxlsd.cxld.flags & flags) == flags;
> +}
> +
>   static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *a, int n)
>   {
>   	struct device *dev = kobj_to_dev(kobj);
> @@ -313,7 +321,11 @@ static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *
>   	if (a == CXL_REGION_ATTR(create_pmem_region) && !can_create_pmem(cxlrd))
>   		return 0;
>   
> -	if (a == CXL_REGION_ATTR(delete_region) && !can_create_pmem(cxlrd))
> +	if (a == CXL_REGION_ATTR(create_ram_region) && !can_create_ram(cxlrd))
> +		return 0;
> +
> +	if (a == CXL_REGION_ATTR(delete_region) &&
> +	    !(can_create_pmem(cxlrd) || can_create_ram(cxlrd)))
>   		return 0;
>   
>   	return a->mode;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 53d6dbe4de6d..8dea49c021b8 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1685,6 +1685,15 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>   	struct device *dev;
>   	int rc;
>   
> +	switch (mode) {
> +	case CXL_DECODER_RAM:
> +	case CXL_DECODER_PMEM:
> +		break;
> +	default:
> +		dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
>   	cxlr = cxl_region_alloc(cxlrd, id);
>   	if (IS_ERR(cxlr))
>   		return cxlr;
> @@ -1713,12 +1722,38 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>   	return ERR_PTR(rc);
>   }
>   
> +static ssize_t __create_region_show(struct cxl_root_decoder *cxlrd, char *buf)
> +{
> +	return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
> +}
> +
>   static ssize_t create_pmem_region_show(struct device *dev,
>   				       struct device_attribute *attr, char *buf)
>   {
> -	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
>   
> -	return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
> +static ssize_t create_ram_region_show(struct device *dev,
> +				      struct device_attribute *attr, char *buf)
> +{
> +	return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> +					  enum cxl_decoder_mode mode, int id)
> +{
> +	int rc;
> +
> +	rc = memregion_alloc(GFP_KERNEL);
> +	if (rc < 0)
> +		return ERR_PTR(rc);
> +
> +	if (atomic_cmpxchg(&cxlrd->region_id, id, rc) != id) {
> +		memregion_free(rc);
> +		return ERR_PTR(-EBUSY);
> +	}
> +
> +	return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_EXPANDER);
>   }
>   
>   static ssize_t create_pmem_region_store(struct device *dev,
> @@ -1727,29 +1762,37 @@ static ssize_t create_pmem_region_store(struct device *dev,
>   {
>   	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
>   	struct cxl_region *cxlr;
> -	int id, rc;
> +	int rc, id;
>   
>   	rc = sscanf(buf, "region%d\n", &id);
>   	if (rc != 1)
>   		return -EINVAL;
>   
> -	rc = memregion_alloc(GFP_KERNEL);
> -	if (rc < 0)
> -		return rc;
> +	cxlr = __create_region(cxlrd, CXL_DECODER_PMEM, id);
> +	if (IS_ERR(cxlr))
> +		return PTR_ERR(cxlr);
> +	return len;
> +}
> +DEVICE_ATTR_RW(create_pmem_region);
>   
> -	if (atomic_cmpxchg(&cxlrd->region_id, id, rc) != id) {
> -		memregion_free(rc);
> -		return -EBUSY;
> -	}
> +static ssize_t create_ram_region_store(struct device *dev,
> +				       struct device_attribute *attr,
> +				       const char *buf, size_t len)
> +{
> +	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> +	struct cxl_region *cxlr;
> +	int rc, id;
>   
> -	cxlr = devm_cxl_add_region(cxlrd, id, CXL_DECODER_PMEM,
> -				   CXL_DECODER_EXPANDER);
> +	rc = sscanf(buf, "region%d\n", &id);
> +	if (rc != 1)
> +		return -EINVAL;
> +
> +	cxlr = __create_region(cxlrd, CXL_DECODER_RAM, id);
>   	if (IS_ERR(cxlr))
>   		return PTR_ERR(cxlr);
> -
>   	return len;
>   }
> -DEVICE_ATTR_RW(create_pmem_region);
> +DEVICE_ATTR_RW(create_ram_region);
>   
>   static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>   			   char *buf)
>
Dan Williams Feb. 6, 2023, 10:37 p.m. UTC | #20
Dave Jiang wrote:
> 
> 
> On 2/5/23 6:02 PM, Dan Williams wrote:
> > Expand the region creation infrastructure to enable 'ram'
> > (volatile-memory) regions. The internals of create_pmem_region_store()
> > and create_pmem_region_show() are factored out into helpers
> > __create_region() and __create_region_show() for the 'ram' case to
> > reuse.
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> Minor comment below, otherwise:
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
[..]
> > diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> > index 4c4e1cbb1169..3acf2f17a73f 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-cxl
> > +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> > @@ -285,20 +285,20 @@ Description:
> >   		interleave_granularity).
> >   
> >   
> > -What:		/sys/bus/cxl/devices/decoderX.Y/create_pmem_region
> > -Date:		May, 2022
> > -KernelVersion:	v6.0
> > +What:		/sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> > +Date:		May, 2022, January, 2023
> > +KernelVersion:	v6.0 (pmem), v6.3 (ram)
> >   Contact:	linux-cxl@vger.kernel.org
> >   Description:
> >   		(RW) Write a string in the form 'regionZ' to start the process
> > -		of defining a new persistent memory region (interleave-set)
> > -		within the decode range bounded by root decoder 'decoderX.Y'.
> > -		The value written must match the current value returned from
> > -		reading this attribute. An atomic compare exchange operation is
> > -		done on write to assign the requested id to a region and
> > -		allocate the region-id for the next creation attempt. EBUSY is
> > -		returned if the region name written does not match the current
> > -		cached value.
> > +		of defining a new persistent, or volatile memory region
> > +		(interleave-set) within the decode range bounded by root decoder
> > +		'decoderX.Y'. The value written must match the current value
> > +		returned from reading this attribute. An atomic compare exchange
> > +		operation is done on write to assign the requested id to a
> > +		region and allocate the region-id for the next creation attempt.
> > +		EBUSY is returned if the region name written does not match the
> 
> -EBUSY?

Userspace errno values are positive. So "$?" after "echo $region >
create_pmem_region" will be 16 not -16 in the busy case.
Dan Williams Feb. 6, 2023, 11:20 p.m. UTC | #21
Gregory Price wrote:
> On Mon, Feb 06, 2023 at 02:15:56PM -0800, Dan Williams wrote:
> > Gregory Price wrote:
> > > On Sun, Feb 05, 2023 at 05:04:05PM -0800, Dan Williams wrote:
> > [..]
> > > 
> > > Stupid question: when defaulting to online, do these devices get placed
> > > into Zone Normal?  Is there a way for us, at a minimum, to online this
> > > as Zone Moveable in an effort to assist the "hope the kernel has not
> > > pinned a page" problem (and to try to keep kernel resources out of this
> > > zone in general).
> > > 
> > > If this is covered by a different patch or already set up this way,
> > > ignore me :]
> > 
> > Have a look in Documentation/admin-guide/mm/memory-hotplug.rst, the
> > 'daxctl recconfigure-device' man page and the the policy options of how
> > hot-added memory is brought online.
> > 
> > The routing can be anything from fully offline device-dax, to fully
> > online ZONE_NORMAL, or even a mix of device-dax subdivision,
> > ZONE_NORMAL, and ZONE_MOVABLE memblocks all within one CXL ram region.
> 
> Hm.
> 
> I'm just thinking, for early expander devices the assumption that the
> default behavior should be auto-online is good.  Just kinda working
> through this in my head for multi-headed devices and early pools, and I
> suppose the kernel paremeter covers that. Most of those will be used in
> specialty, vertically integrated systems, so we're good to go.
> 
> In the future when we get fully featured DCD's, I imagine we will have a
> memory region of size X, with N memory blocks, but only some of those
> blocks are online.  I suppose it could also be done such that DCD capacity
> add events drive the creation of new blocks under the region, as opposed
> to having the region pre-create the blocks.

In the case of DCD the proposal is "sparse" dax regions. So there will
be nothing to online by default until extents start arriving. I do
expect DCD systems will opt for memhp_default_state=offline so that
userspace policy can be applied.

> Either way I think the kernel parameter probably covers this case as
> well, I'm just wondering if at some point the default will yet again
> wish to be false, but i think that's a "CXL has become so successful all
> memory is now CXL DCDs and computers look radically different" scenario.

It's a distro policy. Fedora and RHEL, for example, pick different
defaults.
Dave Jiang Feb. 6, 2023, 11:57 p.m. UTC | #22
On 2/5/23 6:03 PM, Dan Williams wrote:
> Add help text and a label so the CXL_REGION config option can be
> toggled. This is mainly to enable compile testing without region
> support.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/cxl/Kconfig |   12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 0ac53c422c31..163c094e67ae 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -104,12 +104,22 @@ config CXL_SUSPEND
>   	depends on SUSPEND && CXL_MEM
>   
>   config CXL_REGION
> -	bool
> +	bool "CXL: Region Support"
>   	default CXL_BUS
>   	# For MAX_PHYSMEM_BITS
>   	depends on SPARSEMEM
>   	select MEMREGION
>   	select GET_FREE_REGION
> +	help
> +	  Enable the CXL core to enumerate and provision CXL regions. A CXL
> +	  region is defined by one or more CXL expanders that decode a given
> +	  system-physical address range. For CXL regions established by
> +	  platform-firmware this option enables memory error handling to
> +	  identify the devices participating in a given interleaved memory
> +	  range. Otherwise, platform-firmware managed CXL is enabled by being
> +	  placed in the system address map and does not need a driver.
> +
> +	  If unsure say 'y'
>   
>   config CXL_REGION_INVALIDATION_TEST
>   	bool "CXL: Region Cache Management Bypass (TEST)"
>
Dave Jiang Feb. 7, 2023, 11:54 p.m. UTC | #23
On 2/5/23 6:03 PM, Dan Williams wrote:
> Region autodiscovery is an asynchrounous state machine advanced by
> cxl_port_probe(). After the decoders on an endpoint port are enumerated
> they are scanned for actively enabled instances. Each active decoder is
> flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> If a region does not already exist for the address range setting of the
> decoder one is created. That creation process may race with other
> decoders of the same region being discovered since cxl_port_probe() is
> asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> introduced to mitigate that race.
> 
> Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> they are sorted by their relative decode position. The sort algorithm
> involves finding the point in the cxl_port topology where one leg of the
> decode leads to deviceA and the other deviceB. At that point in the
> topology the target order in the 'struct cxl_switch_decoder' indicates
> the relative position of those endpoint decoders in the region.
> 
>>From that point the region goes through the same setup and validation
> steps as user-created regions, but instead of programming the decoders
> it validates that driver would have written the same values to the
> decoders as were already present.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

May want to run checkpatch on this one.

DJ

> ---
>   drivers/cxl/core/hdm.c    |    5
>   drivers/cxl/core/port.c   |    2
>   drivers/cxl/core/region.c |  496 ++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/cxl/cxl.h         |   16 +
>   drivers/cxl/port.c        |   26 ++
>   5 files changed, 531 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index dcc16d7cb8f3..174cddfec6e8 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -674,7 +674,7 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
>   	up_read(&cxl_dpa_rwsem);
>   
>   	port->commit_end--;
> -	cxld->flags &= ~CXL_DECODER_F_ENABLE;
> +	cxld->flags &= ~(CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO);
>   
>   	return 0;
>   }
> @@ -719,7 +719,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>   
>   	/* decoders are enabled if committed */
>   	if (committed) {
> -		cxld->flags |= CXL_DECODER_F_ENABLE;
> +		cxld->flags |= CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO;
>   		if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
>   			cxld->flags |= CXL_DECODER_F_LOCK;
>   		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl))
> @@ -783,6 +783,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>   		return rc;
>   	}
>   	*dpa_base += dpa_size + skip;
> +
>   	return 0;
>   }
>   
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 47e450c3a5a9..8130430ffbcf 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -446,6 +446,7 @@ bool is_endpoint_decoder(struct device *dev)
>   {
>   	return dev->type == &cxl_decoder_endpoint_type;
>   }
> +EXPORT_SYMBOL_NS_GPL(is_endpoint_decoder, CXL);
>   
>   bool is_root_decoder(struct device *dev)
>   {
> @@ -1622,6 +1623,7 @@ struct cxl_root_decoder *cxl_root_decoder_alloc(struct cxl_port *port,
>   	}
>   
>   	cxlrd->calc_hb = calc_hb;
> +	mutex_init(&cxlrd->range_lock);
>   
>   	cxld = &cxlsd->cxld;
>   	cxld->dev.type = &cxl_decoder_root_type;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 34cf95217901..6fe8c70790df 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -6,6 +6,7 @@
>   #include <linux/module.h>
>   #include <linux/slab.h>
>   #include <linux/uuid.h>
> +#include <linux/sort.h>
>   #include <linux/idr.h>
>   #include <cxlmem.h>
>   #include <cxl.h>
> @@ -520,7 +521,12 @@ static void cxl_region_iomem_release(struct cxl_region *cxlr)
>   	if (device_is_registered(&cxlr->dev))
>   		lockdep_assert_held_write(&cxl_region_rwsem);
>   	if (p->res) {
> -		remove_resource(p->res);
> +		/*
> +		 * Autodiscovered regions may not have been able to insert their
> +		 * resource.
> +		 */
> +		if (p->res->parent)
> +			remove_resource(p->res);
>   		kfree(p->res);
>   		p->res = NULL;
>   	}
> @@ -1101,12 +1107,35 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>   		return rc;
>   	}
>   
> -	cxld->interleave_ways = iw;
> -	cxld->interleave_granularity = ig;
> -	cxld->hpa_range = (struct range) {
> -		.start = p->res->start,
> -		.end = p->res->end,
> -	};
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		if (cxld->interleave_ways != iw ||
> +		    cxld->interleave_granularity != ig ||
> +		    cxld->hpa_range.start != p->res->start ||
> +		    cxld->hpa_range.end != p->res->end ||
> +		    ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
> +			dev_err(&cxlr->dev,
> +				"%s:%s %s expected iw: %d ig: %d %pr\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				__func__, iw, ig, p->res);
> +			dev_err(&cxlr->dev,
> +				"%s:%s %s got iw: %d ig: %d state: %s %#llx:%#llx\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				__func__, cxld->interleave_ways,
> +				cxld->interleave_granularity,
> +				(cxld->flags & CXL_DECODER_F_ENABLE) ?
> +					"enabled" :
> +					"disabled",
> +				cxld->hpa_range.start, cxld->hpa_range.end);
> +			return -ENXIO;
> +		}
> +	} else {
> +		cxld->interleave_ways = iw;
> +		cxld->interleave_granularity = ig;
> +		cxld->hpa_range = (struct range) {
> +			.start = p->res->start,
> +			.end = p->res->end,
> +		};
> +	}
>   	dev_dbg(&cxlr->dev, "%s:%s iw: %d ig: %d\n", dev_name(port->uport),
>   		dev_name(&port->dev), iw, ig);
>   add_target:
> @@ -1117,7 +1146,17 @@ static int cxl_port_setup_targets(struct cxl_port *port,
>   			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), pos);
>   		return -ENXIO;
>   	}
> -	cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		if (cxlsd->target[cxl_rr->nr_targets_set] != ep->dport) {
> +			dev_dbg(&cxlr->dev, "%s:%s: %s expected %s at %d\n",
> +				dev_name(port->uport), dev_name(&port->dev),
> +				dev_name(&cxlsd->cxld.dev),
> +				dev_name(ep->dport->dport),
> +				cxl_rr->nr_targets_set);
> +			return -ENXIO;
> +		}
> +	} else
> +		cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
>   	inc = 1;
>   out_target_set:
>   	cxl_rr->nr_targets_set += inc;
> @@ -1159,6 +1198,13 @@ static void cxl_region_teardown_targets(struct cxl_region *cxlr)
>   	struct cxl_ep *ep;
>   	int i;
>   
> +	/*
> +	 * In the auto-discovery case skip automatic teardown since the
> +	 * address space is already active
> +	 */
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags))
> +		return;
> +
>   	for (i = 0; i < p->nr_targets; i++) {
>   		cxled = p->targets[i];
>   		cxlmd = cxled_to_memdev(cxled);
> @@ -1191,8 +1237,8 @@ static int cxl_region_setup_targets(struct cxl_region *cxlr)
>   			iter = to_cxl_port(iter->dev.parent);
>   
>   		/*
> -		 * Descend the topology tree programming targets while
> -		 * looking for conflicts.
> +		 * Descend the topology tree programming / validating
> +		 * targets while looking for conflicts.
>   		 */
>   		for (ep = cxl_ep_load(iter, cxlmd); iter;
>   		     iter = ep->next, ep = cxl_ep_load(iter, cxlmd)) {
> @@ -1287,6 +1333,182 @@ static int cxl_region_attach_position(struct cxl_region *cxlr,
>   	return rc;
>   }
>   
> +static int cxl_region_attach_auto(struct cxl_region *cxlr,
> +				  struct cxl_endpoint_decoder *cxled, int pos)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +
> +	if (!(cxled->cxld.flags & CXL_DECODER_F_AUTO)) {
> +		dev_err(&cxlr->dev,
> +			"%s: unable to add decoder to autodetected region\n",
> +			dev_name(&cxled->cxld.dev));
> +		return -EINVAL;
> +	}
> +
> +	if (pos >= 0) {
> +		dev_dbg(&cxlr->dev, "%s: expected auto position, not %d\n",
> +			dev_name(&cxled->cxld.dev), pos);
> +		return -EINVAL;
> +	}
> +
> +	if (p->nr_targets >= p->interleave_ways) {
> +		dev_err(&cxlr->dev, "%s: no more target slots available\n",
> +			dev_name(&cxled->cxld.dev));
> +		return -ENXIO;
> +	}
> +
> +	/*
> +	 * Temporarily record the endpoint decoder into the target array. Yes,
> +	 * this means that userspace can view devices in the wrong position
> +	 * before the region activates, and must be careful to understand when
> +	 * it might be racing region autodiscovery.
> +	 */
> +	pos = p->nr_targets;
> +	p->targets[pos] = cxled;
> +	cxled->pos = pos;
> +	p->nr_targets++;
> +
> +	return 0;
> +}
> +
> +static struct cxl_port *next_port(struct cxl_port *port)
> +{
> +	if (!port->parent_dport)
> +		return NULL;
> +	return port->parent_dport->port;
> +}
> +
> +static int decoder_match_range(struct device *dev, void *data)
> +{
> +	struct cxl_endpoint_decoder *cxled = data;
> +	struct cxl_switch_decoder *cxlsd;
> +
> +	if (!is_switch_decoder(dev))
> +		return 0;
> +
> +	cxlsd = to_cxl_switch_decoder(dev);
> +	return range_contains(&cxlsd->cxld.hpa_range, &cxled->cxld.hpa_range);
> +}
> +
> +static void find_positions(const struct cxl_switch_decoder *cxlsd,
> +			   const struct cxl_port *iter_a,
> +			   const struct cxl_port *iter_b, int *a_pos,
> +			   int *b_pos)
> +{
> +	int i;
> +
> +	for (i = 0, *a_pos = -1, *b_pos = -1; i < cxlsd->nr_targets; i++) {
> +		if (cxlsd->target[i] == iter_a->parent_dport)
> +			*a_pos = i;
> +		else if (cxlsd->target[i] == iter_b->parent_dport)
> +			*b_pos = i;
> +		if (*a_pos >= 0 && *b_pos >= 0)
> +			break;
> +	}
> +}
> +
> +static int cmp_decode_pos(const void *a, const void *b)
> +{
> +	struct cxl_endpoint_decoder *cxled_a = *(typeof(cxled_a) *)a;
> +	struct cxl_endpoint_decoder *cxled_b = *(typeof(cxled_b) *)b;
> +	struct cxl_memdev *cxlmd_a = cxled_to_memdev(cxled_a);
> +	struct cxl_memdev *cxlmd_b = cxled_to_memdev(cxled_b);
> +	struct cxl_port *port_a = cxled_to_port(cxled_a);
> +	struct cxl_port *port_b = cxled_to_port(cxled_b);
> +	struct cxl_port *iter_a, *iter_b;
> +
> +	/* Exit early if any prior sorting failed */
> +	if (cxled_a->pos < 0 || cxled_b->pos < 0)
> +		return 0;
> +
> +	/*
> +	 * Walk up the hierarchy to find a shared port, find the decoder
> +	 * that maps the range, compare the relative position of those
> +	 * dport mappings.
> +	 */
> +	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
> +		struct cxl_port *next_a, *next_b, *port;
> +		struct cxl_switch_decoder *cxlsd;
> +
> +		next_a = next_port(iter_a);
> +		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
> +			int a_pos, b_pos, result;
> +			struct device *dev;
> +			unsigned int seq;
> +
> +			next_b = next_port(iter_b);
> +			if (next_a != next_b)
> +				continue;
> +			if (!next_a)
> +				goto out;
> +			port = next_a;
> +			dev = device_find_child(&port->dev, cxled_a,
> +						decoder_match_range);
> +			if (!dev) {
> +				struct range *range = &cxled_a->cxld.hpa_range;
> +
> +				dev_err(port->uport,
> +					"failed to find decoder that maps %#llx-:%#llx\n",
> +					range->start, range->end);
> +				cxled_a->pos = -1;
> +				return 0;
> +			}
> +
> +			cxlsd = to_cxl_switch_decoder(dev);
> +			do {
> +				seq = read_seqbegin(&cxlsd->target_lock);
> +				find_positions(cxlsd, iter_a, iter_b, &a_pos,
> +					       &b_pos);
> +			} while (read_seqretry(&cxlsd->target_lock, seq));
> +
> +			if (a_pos < 0 || b_pos < 0) {
> +				dev_err(port->uport,
> +					"failed to find shared decoder for %s and %s\n",
> +					dev_name(cxlmd_a->dev.parent),
> +					dev_name(cxlmd_b->dev.parent));
> +				cxled_a->pos = -1;
> +				result = 0;
> +			} else {
> +				result = a_pos - b_pos;
> +				dev_dbg(port->uport, "%s: %s comes %s %s\n",
> +					dev_name(&cxlsd->cxld.dev),
> +					dev_name(cxlmd_a->dev.parent),
> +					result < 0 ? "before" : "after",
> +					dev_name(cxlmd_b->dev.parent));
> +			}
> +
> +			put_device(dev);
> +
> +			return result;
> +		}
> +	}
> +out:
> +	dev_err(cxlmd_a->dev.parent, "failed to find shared port with %s\n",
> +		dev_name(cxlmd_b->dev.parent));
> +	cxled_a->pos = -1;
> +	return 0;
> +}
> +
> +static int cxl_region_sort_targets(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	int i, rc = 0;
> +
> +	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
> +	     NULL);
> +
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +
> +		if (cxled->pos < 0)
> +			rc = -ENXIO;
> +		cxled->pos = i;
> +	}
> +
> +	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
> +	return rc;
> +}
> +
>   static int cxl_region_attach(struct cxl_region *cxlr,
>   			     struct cxl_endpoint_decoder *cxled, int pos)
>   {
> @@ -1350,6 +1572,50 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>   		return -EINVAL;
>   	}
>   
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		int i;
> +
> +		rc = cxl_region_attach_auto(cxlr, cxled, pos);
> +		if (rc)
> +			return rc;
> +
> +		/* await more targets to arrive... */
> +		if (p->nr_targets < p->interleave_ways)
> +			return 0;
> +
> +		/*
> +		 * All targets are here, which implies all PCI enumeration that
> +		 * affects this region has been completed. Walk the topology to
> +		 * sort the devices into their relative region decode position.
> +		 */
> +		rc = cxl_region_sort_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		for (i = 0; i < p->nr_targets; i++) {
> +			cxled = p->targets[i];
> +			ep_port = cxled_to_port(cxled);
> +			dport = cxl_find_dport_by_dev(root_port,
> +						      ep_port->host_bridge);
> +			rc = cxl_region_attach_position(cxlr, cxlrd, cxled,
> +							dport, i);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		rc = cxl_region_setup_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		/*
> +		 * If target setup succeeds in the autodiscovery case
> +		 * then the region is already committed.
> +		 */
> +		p->state = CXL_CONFIG_COMMIT;
> +
> +		return 0;
> +	}
> +
>   	rc = cxl_region_validate_position(cxlr, cxled, pos);
>   	if (rc)
>   		return rc;
> @@ -1500,8 +1766,8 @@ static int detach_target(struct cxl_region *cxlr, int pos)
>   	return rc;
>   }
>   
> -static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> -			    size_t len)
> +static ssize_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> +			     size_t len)
>   {
>   	int rc;
>   
> @@ -2079,6 +2345,191 @@ static int devm_cxl_add_pmem_region(struct cxl_region *cxlr)
>   	return rc;
>   }
>   
> +static int match_decoder_by_range(struct device *dev, void *data)
> +{
> +	struct range *r1, *r2 = data;
> +	struct cxl_root_decoder *cxlrd;
> +
> +	if (!is_root_decoder(dev))
> +		return 0;
> +
> +	cxlrd = to_cxl_root_decoder(dev);
> +	r1 = &cxlrd->cxlsd.cxld.hpa_range;
> +	return range_contains(r1, r2);
> +}
> +
> +static int match_region_by_range(struct device *dev, void *data)
> +{
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct range *r = data;
> +	int rc = 0;
> +
> +	if (!is_cxl_region(dev))
> +		return 0;
> +
> +	cxlr = to_cxl_region(dev);
> +	p = &cxlr->params;
> +
> +	down_read(&cxl_region_rwsem);
> +	if (p->res && p->res->start == r->start && p->res->end == r->end)
> +		rc = 1;
> +	up_read(&cxl_region_rwsem);
> +
> +	return rc;
> +}
> +
> +/* Establish an empty region covering the given HPA range */
> +static struct cxl_region *construct_region(struct cxl_root_decoder *cxlrd,
> +					   struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_port *port = cxlrd_to_port(cxlrd);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct resource *res;
> +	int rc;
> +
> +	do {
> +		cxlr = __create_region(cxlrd, cxled->mode,
> +				       atomic_read(&cxlrd->region_id));
> +	} while (IS_ERR(cxlr) && PTR_ERR(cxlr) == -EBUSY);
> +
> +	if (IS_ERR(cxlr)) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s: %s failed assign region: %ld\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			__func__, PTR_ERR(cxlr));
> +		return cxlr;
> +	}
> +
> +	down_write(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	if (p->state >= CXL_CONFIG_INTERLEAVE_ACTIVE) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s: %s autodiscovery interrupted\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			__func__);
> +		rc = -EBUSY;
> +		goto err;
> +	}
> +
> +	set_bit(CXL_REGION_F_AUTO, &cxlr->flags);
> +
> +	res = kmalloc(sizeof(*res), GFP_KERNEL);
> +	if (!res) {
> +		rc = -ENOMEM;
> +		goto err;
> +	}
> +
> +	*res = DEFINE_RES_MEM_NAMED(hpa->start, range_len(hpa),
> +				    dev_name(&cxlr->dev));
> +	rc = insert_resource(cxlrd->res, res);
> +	if (rc) {
> +		/*
> +		 * Platform-firmware may not have split resources like "System
> +		 * RAM" on CXL window boundaries see cxl_region_iomem_release()
> +		 */
> +		dev_warn(cxlmd->dev.parent,
> +			 "%s:%s: %s %s cannot insert resource\n",
> +			 dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +			 __func__, dev_name(&cxlr->dev));
> +	}
> +
> +	p->res = res;
> +	p->interleave_ways = cxled->cxld.interleave_ways;
> +	p->interleave_granularity = cxled->cxld.interleave_granularity;
> +	p->state = CXL_CONFIG_INTERLEAVE_ACTIVE;
> +
> +	rc = sysfs_update_group(&cxlr->dev.kobj, get_cxl_region_target_group());
> +	if (rc)
> +		goto err;
> +
> +	dev_dbg(cxlmd->dev.parent, "%s:%s: %s %s res: %pr iw: %d ig: %d\n",
> +		dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev), __func__,
> +		dev_name(&cxlr->dev), p->res, p->interleave_ways,
> +		p->interleave_granularity);
> +
> +	/* ...to match put_device() in cxl_add_to_region() */
> +	get_device(&cxlr->dev);
> +	up_write(&cxl_region_rwsem);
> +
> +	return cxlr;
> +
> +err:
> +	up_write(&cxl_region_rwsem);
> +	devm_release_action(port->uport, unregister_region, cxlr);
> +	return ERR_PTR(rc);
> +}
> +
> +void cxl_add_to_region(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_decoder *cxld = &cxled->cxld;
> +	struct cxl_root_decoder *cxlrd;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct cxl_port *root;
> +	bool attach = false;
> +	struct device *dev;
> +
> +	root = find_cxl_root(&cxlmd->dev);
> +	if (!root) {
> +		dev_err(cxlmd->dev.parent, "%s: failed to map CXL root\n",
> +			dev_name(&cxlmd->dev));
> +		return;
> +	}
> +
> +	dev = device_find_child(&root->dev, &cxld->hpa_range,
> +				  match_decoder_by_range);
> +	if (!dev) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s no CXL window for range %#llx:%#llx\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
> +			cxld->hpa_range.start, cxld->hpa_range.end);
> +		goto out;
> +	}
> +	cxlrd = to_cxl_root_decoder(dev);
> +
> +	mutex_lock(&cxlrd->range_lock);
> +	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
> +				match_region_by_range);
> +	if (!dev)
> +		cxlr = construct_region(cxlrd, cxled);
> +	else
> +		cxlr = to_cxl_region(dev);
> +	mutex_unlock(&cxlrd->range_lock);
> +
> +	if (IS_ERR(cxlr))
> +		goto out;
> +
> +	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
> +
> +	down_read(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	attach = p->state >= CXL_CONFIG_COMMIT;
> +	up_read(&cxl_region_rwsem);
> +
> +	if (attach) {
> +		int rc = device_attach(&cxlr->dev);
> +
> +		/*
> +		 * If device_attach() fails the range may still be
> +		 * active via the platform-firmware memory map
> +		 */
> +		if (rc < 0)
> +			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
> +				p->res);
> +	}
> +
> +	put_device(&cxlr->dev);
> +out:
> +	put_device(&root->dev);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
> +
>   static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>   {
>   	if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
> @@ -2103,6 +2554,15 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>   	return 0;
>   }
>   
> +static int is_system_ram(struct resource *res, void *arg)
> +{
> +	struct cxl_region *cxlr = arg;
> +	struct cxl_region_params *p = &cxlr->params;
> +
> +	dev_dbg(&cxlr->dev, "%pr has System RAM: %pr\n", p->res, res);
> +	return 1;
> +}
> +
>   static int cxl_region_probe(struct device *dev)
>   {
>   	struct cxl_region *cxlr = to_cxl_region(dev);
> @@ -2136,6 +2596,18 @@ static int cxl_region_probe(struct device *dev)
>   	switch (cxlr->mode) {
>   	case CXL_DECODER_PMEM:
>   		return devm_cxl_add_pmem_region(cxlr);
> +	case CXL_DECODER_RAM:
> +		/*
> +		 * The region can not be manged by CXL if any portion of
> +		 * it is already online as 'System RAM'
> +		 */
> +		if (walk_iomem_res_desc(IORES_DESC_NONE,
> +					IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> +					p->res->start, p->res->end, cxlr,
> +					is_system_ram) > 0)
> +			return 0;
> +		dev_dbg(dev, "TODO: hookup devdax\n");
> +		return 0;
>   	default:
>   		dev_dbg(&cxlr->dev, "unsupported region mode: %d\n",
>   			cxlr->mode);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ca76879af1de..9b3765c5c81a 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -261,6 +261,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
>    * cxl_decoder flags that define the type of memory / devices this
>    * decoder supports as well as configuration lock status See "CXL 2.0
>    * 8.2.5.12.7 CXL HDM Decoder 0 Control Register" for details.
> + * Additionally indicate whether decoder settings were autodetected,
> + * user customized.
>    */
>   #define CXL_DECODER_F_RAM   BIT(0)
>   #define CXL_DECODER_F_PMEM  BIT(1)
> @@ -268,6 +270,7 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
>   #define CXL_DECODER_F_TYPE3 BIT(3)
>   #define CXL_DECODER_F_LOCK  BIT(4)
>   #define CXL_DECODER_F_ENABLE    BIT(5)
> +#define CXL_DECODER_F_AUTO  BIT(6)
>   #define CXL_DECODER_F_MASK  GENMASK(5, 0)
>   
>   enum cxl_decoder_type {
> @@ -380,6 +383,7 @@ typedef struct cxl_dport *(*cxl_calc_hb_fn)(struct cxl_root_decoder *cxlrd,
>    * @region_id: region id for next region provisioning event
>    * @calc_hb: which host bridge covers the n'th position by granularity
>    * @platform_data: platform specific configuration data
> + * @range_lock: sync region autodiscovery by address range
>    * @cxlsd: base cxl switch decoder
>    */
>   struct cxl_root_decoder {
> @@ -387,6 +391,7 @@ struct cxl_root_decoder {
>   	atomic_t region_id;
>   	cxl_calc_hb_fn calc_hb;
>   	void *platform_data;
> +	struct mutex range_lock;
>   	struct cxl_switch_decoder cxlsd;
>   };
>   
> @@ -436,6 +441,13 @@ struct cxl_region_params {
>    */
>   #define CXL_REGION_F_INCOHERENT 0
>   
> +/*
> + * Indicate whether this region has been assembled by autodetection or
> + * userspace assembly. Prevent endpoint decoders outside of automatic
> + * detection from being added to the region.
> + */
> +#define CXL_REGION_F_AUTO 1
> +
>   /**
>    * struct cxl_region - CXL region
>    * @dev: This region's device
> @@ -699,6 +711,7 @@ struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct device *dev);
>   #ifdef CONFIG_CXL_REGION
>   bool is_cxl_pmem_region(struct device *dev);
>   struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev);
> +void cxl_add_to_region(struct cxl_endpoint_decoder *cxled);
>   #else
>   static inline bool is_cxl_pmem_region(struct device *dev)
>   {
> @@ -708,6 +721,9 @@ static inline struct cxl_pmem_region *to_cxl_pmem_region(struct device *dev)
>   {
>   	return NULL;
>   }
> +static inline void cxl_add_to_region(struct cxl_endpoint_decoder *cxled)
> +{
> +}
>   #endif
>   
>   /*
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index 5453771bf330..012a0c6f8476 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,6 +30,23 @@ static void schedule_detach(void *cxlmd)
>   	schedule_cxl_memdev_detach(cxlmd);
>   }
>   
> +static int discover_region(struct device *dev, void *data)
> +{
> +	const unsigned long flags = CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO;
> +	struct cxl_endpoint_decoder *cxled;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if ((cxled->cxld.flags & flags) != flags)
> +		return 0;
> +
> +	cxl_add_to_region(cxled);
> +
> +	return 0;
> +}
> +
>   static int cxl_port_probe(struct device *dev)
>   {
>   	struct cxl_port *port = to_cxl_port(dev);
> @@ -78,6 +95,15 @@ static int cxl_port_probe(struct device *dev)
>   		return rc;
>   	}
>   
> +	if (!is_cxl_endpoint(port))
> +		return 0;
> +
> +	/*
> +	 * Now that all endpoint decoders are successfully enumerated,
> +	 * try to assemble regions from committed decoders
> +	 */
> +	device_for_each_child(dev, NULL, discover_region);
> +
>   	return 0;
>   }
>   
>
Jonathan Cameron Feb. 8, 2023, 12:30 p.m. UTC | #24
On Sun, 05 Feb 2023 17:03:07 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> In preparation for region autodiscovery, that needs all devices
> discovered before their relative position in the region can be
> determined, consolidate all position dependent validation in a helper.
> 
> Recall that in the on-demand region creation flow the end-user picks the
> position of a given endpoint decoder in a region. In the autodiscovery
> case the position of an endpoint decoder can only be determined after
> all other endpoint decoders that claim to decode the region's address
> range have been enumerated and attached. So, in the autodiscovery case
> endpoint decoders may be attached before their relative position is
> known. Once all decoders arrive, then positions can be determined and
> validated with cxl_region_validate_position() the same as user initiated
> on-demand creation.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Hi Dan,

A few comments inline, but mostly reflect the original code rather than
the refactoring you have done in this patch.

Jonathan


> +static int cxl_region_attach(struct cxl_region *cxlr,
> +			     struct cxl_endpoint_decoder *cxled, int pos)
> +{
> +	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_region_params *p = &cxlr->params;
> +	struct cxl_port *ep_port, *root_port;
> +	struct cxl_dport *dport;
> +	int rc = -ENXIO;
> +
> +	if (cxled->mode != cxlr->mode) {
> +		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> +			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> +		return -EINVAL;
> +	}
> +
> +	if (cxled->mode == CXL_DECODER_DEAD) {
> +		dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
> +		return -ENODEV;
> +	}
> +
> +	/* all full of members, or interleave config not established? */
> +	if (p->state > CXL_CONFIG_INTERLEAVE_ACTIVE) {
> +		dev_dbg(&cxlr->dev, "region already active\n");
> +		return -EBUSY;
> +	} else if (p->state < CXL_CONFIG_INTERLEAVE_ACTIVE) {
> +		dev_dbg(&cxlr->dev, "interleave config missing\n");
> +		return -ENXIO;
> +	}
> +
>  	ep_port = cxled_to_port(cxled);
>  	root_port = cxlrd_to_port(cxlrd);
>  	dport = cxl_find_dport_by_dev(root_port, ep_port->host_bridge);
> @@ -1284,13 +1327,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  		return -ENXIO;
>  	}
>  
> -	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> -		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> -			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> -			dev_name(&cxlrd->cxlsd.cxld.dev));
> -		return -ENXIO;
> -	}
> -

In an ideal world, this would have been nice as two patches.
One that reorders the various checks so that they are in the order
after you have factored things out (easy to review for correctness)
then one that factored it out.

>  	if (cxled->cxld.target_type != cxlr->type) {
>  		dev_dbg(&cxlr->dev, "%s:%s type mismatch: %d vs %d\n",
>  			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> @@ -1314,12 +1350,13 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  		return -EINVAL;
>  	}
>  
> -	for (iter = ep_port; !is_cxl_root(iter);
> -	     iter = to_cxl_port(iter->dev.parent)) {
> -		rc = cxl_port_attach_region(iter, cxlr, cxled, pos);
> -		if (rc)
> -			goto err;
> -	}
> +	rc = cxl_region_validate_position(cxlr, cxled, pos);
> +	if (rc)
> +		return rc;
> +
> +	rc = cxl_region_attach_position(cxlr, cxlrd, cxled, dport, pos);
> +	if (rc)
> +		return rc;
>  
>  	p->targets[pos] = cxled;
>  	cxled->pos = pos;

More something about original code than the refactoring...

I'm not keen on the side effects that aren't unwound in the error paths.

p->targets[pos] and cxled->pos are left set.  Probably never matters
but not elegant or as easy to reason about as it would be if they
were cleared in error cases.  In particular there is a check on
whether p->targets[pos] is set that will result in a dev_dbg even
though setting it up actually failed.


> @@ -1343,10 +1380,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  
>  err_decrement:
>  	p->nr_targets--;
> -err:
> -	for (iter = ep_port; !is_cxl_root(iter);
> -	     iter = to_cxl_port(iter->dev.parent))
> -		cxl_port_detach_region(iter, cxlr, cxled);
>  	return rc;
>  }
>  
>
Jonathan Cameron Feb. 8, 2023, 12:36 p.m. UTC | #25
On Sun, 05 Feb 2023 17:03:18 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Add help text and a label so the CXL_REGION config option can be
> toggled. This is mainly to enable compile testing without region
> support.

Hmm. Possibly pull the reasoning up here for why this might
want to be configurable at all.  I'm not sure I fully follow
your reasoning as enumerating existing regions 'should' be harmless
gathering of information, not something that could do any damage
- so who would turn this off?

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/Kconfig |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 0ac53c422c31..163c094e67ae 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -104,12 +104,22 @@ config CXL_SUSPEND
>  	depends on SUSPEND && CXL_MEM
>  
>  config CXL_REGION
> -	bool
> +	bool "CXL: Region Support"
>  	default CXL_BUS
>  	# For MAX_PHYSMEM_BITS
>  	depends on SPARSEMEM
>  	select MEMREGION
>  	select GET_FREE_REGION
> +	help
> +	  Enable the CXL core to enumerate and provision CXL regions. A CXL
> +	  region is defined by one or more CXL expanders that decode a given
> +	  system-physical address range. For CXL regions established by
> +	  platform-firmware this option enables memory error handling to
> +	  identify the devices participating in a given interleaved memory
> +	  range. Otherwise, platform-firmware managed CXL is enabled by being
> +	  placed in the system address map and does not need a driver.
> +
> +	  If unsure say 'y'
>  
>  config CXL_REGION_INVALIDATION_TEST
>  	bool "CXL: Region Cache Management Bypass (TEST)"
> 
>
Jonathan Cameron Feb. 8, 2023, 5:07 p.m. UTC | #26
On Sun, 05 Feb 2023 17:03:29 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Region autodiscovery is an asynchrounous state machine advanced by
> cxl_port_probe(). After the decoders on an endpoint port are enumerated
> they are scanned for actively enabled instances. Each active decoder is
> flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> If a region does not already exist for the address range setting of the
> decoder one is created. That creation process may race with other
> decoders of the same region being discovered since cxl_port_probe() is
> asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> introduced to mitigate that race.
> 
> Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> they are sorted by their relative decode position. The sort algorithm
> involves finding the point in the cxl_port topology where one leg of the
> decode leads to deviceA and the other deviceB. At that point in the
> topology the target order in the 'struct cxl_switch_decoder' indicates
> the relative position of those endpoint decoders in the region.
> 
> >From that point the region goes through the same setup and validation  
> steps as user-created regions, but instead of programming the decoders
> it validates that driver would have written the same values to the
> decoders as were already present.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
A few comments inline,

Thanks,

Jonathan

> ---
>  drivers/cxl/core/hdm.c    |    5 
>  drivers/cxl/core/port.c   |    2 
>  drivers/cxl/core/region.c |  496 ++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxl.h         |   16 +
>  drivers/cxl/port.c        |   26 ++
>  5 files changed, 531 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index dcc16d7cb8f3..174cddfec6e8 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -674,7 +674,7 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
>  	up_read(&cxl_dpa_rwsem);
>  
>  	port->commit_end--;
> -	cxld->flags &= ~CXL_DECODER_F_ENABLE;
> +	cxld->flags &= ~(CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO);
>  
>  	return 0;
>  }
> @@ -719,7 +719,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>  
>  	/* decoders are enabled if committed */
>  	if (committed) {
> -		cxld->flags |= CXL_DECODER_F_ENABLE;
> +		cxld->flags |= CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO;
>  		if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
>  			cxld->flags |= CXL_DECODER_F_LOCK;
>  		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl))
> @@ -783,6 +783,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
>  		return rc;
>  	}
>  	*dpa_base += dpa_size + skip;
> +

:)  No comment.

>  	return 0;
>  }
>  


...


> +static int cmp_decode_pos(const void *a, const void *b)
> +{
> +	struct cxl_endpoint_decoder *cxled_a = *(typeof(cxled_a) *)a;
> +	struct cxl_endpoint_decoder *cxled_b = *(typeof(cxled_b) *)b;
> +	struct cxl_memdev *cxlmd_a = cxled_to_memdev(cxled_a);
> +	struct cxl_memdev *cxlmd_b = cxled_to_memdev(cxled_b);
> +	struct cxl_port *port_a = cxled_to_port(cxled_a);
> +	struct cxl_port *port_b = cxled_to_port(cxled_b);
> +	struct cxl_port *iter_a, *iter_b;
> +
> +	/* Exit early if any prior sorting failed */
> +	if (cxled_a->pos < 0 || cxled_b->pos < 0)
> +		return 0;
> +
> +	/*
> +	 * Walk up the hierarchy to find a shared port, find the decoder
> +	 * that maps the range, compare the relative position of those
> +	 * dport mappings.
> +	 */
> +	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
> +		struct cxl_port *next_a, *next_b, *port;
> +		struct cxl_switch_decoder *cxlsd;
> +
> +		next_a = next_port(iter_a);
> +		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
> +			int a_pos, b_pos, result;
> +			struct device *dev;
> +			unsigned int seq;
> +
> +			next_b = next_port(iter_b);
> +			if (next_a != next_b)
> +				continue;
> +			if (!next_a)
> +				goto out;
Perhaps treat this as a 'find the match' then carry on, so as to avoid having the
guts of the handling nested so deep.  So something like.

	struct cxl_switch_decoder *cxlsd;
	struct cxl_port *next_a, *next_b, *port;
	bool matched = false; // May be simpler test - I'm just lazy

	// possibly pull this out as a utility function so that
	// error handling can be tidied up by simply returning 
	// rather than gotos.
	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
		next_a = next_port(iter_a);
		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
			int a_pos, b_pos, result;
			struct device *dev;
			unsigned int seq;

			next_b = next_port(iter_b);
			if (next_a != next_b)
				continue;
			if (!next_a)
				goto out;
			port = next_a;
			found = true;
			break;
		}
		if (found)
			break;
	}
	if (!found)
		goto out;

	dev = device_find_child(&port->dev, cxled_a,
				decoder_match_range);
	if (!dev) {
		struct range *range = &cxled_a->cxld.hpa_range;
		dev_err(port->uport,
			"failed to find decoder that maps %#llx-:%#llx\n",
		range->start, range->end);
		cxled_a->pos = -1;
		return 0;
	}

	cxlsd = to_cxl_switch_decoder(dev);
	do {
		seq = read_seqbegin(&cxlsd->target_lock);
		find_positions(cxlsd, iter_a, iter_b, &a_pos, &b_pos);
	} while (read_seqretry(&cxlsd->target_lock, seq));

	if (a_pos < 0 || b_pos < 0) {
		dev_err(port->uport,
			"failed to find shared decoder for %s and %s\n",
			dev_name(cxlmd_a->dev.parent),
			dev_name(cxlmd_b->dev.parent));
		cxled_a->pos = -1;
		result = 0;
//if factored as suggested above then a goto here and drop
// the else would be neater again.

	} else {
		result = a_pos - b_pos;
		dev_dbg(port->uport, "%s: %s comes %s %s\n",
			dev_name(&cxlsd->cxld.dev),
			dev_name(cxlmd_a->dev.parent),
			result < 0 ? "before" : "after",
			dev_name(cxlmd_b->dev.parent));
		}
	put_device(dev);

	return result;
out:
	dev_err(cxlmd_a->dev.parent, "failed to find shared port with %s\n",
		dev_name(cxlmd_b->dev.parent));
	cxled_a->pos = -1;
	return 0;
}


> +			dev = device_find_child(&port->dev, cxled_a,
> +						decoder_match_range);
> +			if (!dev) {
> +				struct range *range = &cxled_a->cxld.hpa_range;
> +
> +				dev_err(port->uport,
> +					"failed to find decoder that maps %#llx-:%#llx\n",
> +					range->start, range->end);
> +				cxled_a->pos = -1;
> +				return 0;
> +			}
> +
> +			cxlsd = to_cxl_switch_decoder(dev);
> +			do {
> +				seq = read_seqbegin(&cxlsd->target_lock);
> +				find_positions(cxlsd, iter_a, iter_b, &a_pos,
> +					       &b_pos);
> +			} while (read_seqretry(&cxlsd->target_lock, seq));
> +
> +			if (a_pos < 0 || b_pos < 0) {
> +				dev_err(port->uport,
> +					"failed to find shared decoder for %s and %s\n",
> +					dev_name(cxlmd_a->dev.parent),
> +					dev_name(cxlmd_b->dev.parent));
> +				cxled_a->pos = -1;
> +				result = 0;
> +			} else {
> +				result = a_pos - b_pos;
> +				dev_dbg(port->uport, "%s: %s comes %s %s\n",
> +					dev_name(&cxlsd->cxld.dev),
> +					dev_name(cxlmd_a->dev.parent),
> +					result < 0 ? "before" : "after",
> +					dev_name(cxlmd_b->dev.parent));
> +			}
> +
> +			put_device(dev);
> +
> +			return result;
> +		}
> +	}
> +out:
> +	dev_err(cxlmd_a->dev.parent, "failed to find shared port with %s\n",
> +		dev_name(cxlmd_b->dev.parent));
> +	cxled_a->pos = -1;
> +	return 0;
> +}
> +
> +static int cxl_region_sort_targets(struct cxl_region *cxlr)
> +{
> +	struct cxl_region_params *p = &cxlr->params;
> +	int i, rc = 0;
> +
> +	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
> +	     NULL);
> +
> +	for (i = 0; i < p->nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> +
> +		if (cxled->pos < 0)
> +			rc = -ENXIO;
> +		cxled->pos = i;
> +	}
> +
> +	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
> +	return rc;
> +}
> +
>  static int cxl_region_attach(struct cxl_region *cxlr,
>  			     struct cxl_endpoint_decoder *cxled, int pos)
>  {
> @@ -1350,6 +1572,50 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  		return -EINVAL;
>  	}
>  
> +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> +		int i;
> +
> +		rc = cxl_region_attach_auto(cxlr, cxled, pos);
> +		if (rc)
> +			return rc;
> +
> +		/* await more targets to arrive... */
> +		if (p->nr_targets < p->interleave_ways)
> +			return 0;
> +
> +		/*
> +		 * All targets are here, which implies all PCI enumeration that
> +		 * affects this region has been completed. Walk the topology to
> +		 * sort the devices into their relative region decode position.
> +		 */
> +		rc = cxl_region_sort_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		for (i = 0; i < p->nr_targets; i++) {
> +			cxled = p->targets[i];
> +			ep_port = cxled_to_port(cxled);
> +			dport = cxl_find_dport_by_dev(root_port,
> +						      ep_port->host_bridge);
> +			rc = cxl_region_attach_position(cxlr, cxlrd, cxled,
> +							dport, i);
> +			if (rc)
> +				return rc;
> +		}
> +
> +		rc = cxl_region_setup_targets(cxlr);
> +		if (rc)
> +			return rc;
> +
> +		/*
> +		 * If target setup succeeds in the autodiscovery case
> +		 * then the region is already committed.
> +		 */
> +		p->state = CXL_CONFIG_COMMIT;
> +
> +		return 0;
> +	}
> +
>  	rc = cxl_region_validate_position(cxlr, cxled, pos);
>  	if (rc)
>  		return rc;
> @@ -1500,8 +1766,8 @@ static int detach_target(struct cxl_region *cxlr, int pos)
>  	return rc;
>  }
>  
> -static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> -			    size_t len)
> +static ssize_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> +			     size_t len)

?



...


> +
> +void cxl_add_to_region(struct cxl_endpoint_decoder *cxled)
> +{
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct range *hpa = &cxled->cxld.hpa_range;
> +	struct cxl_decoder *cxld = &cxled->cxld;
> +	struct cxl_root_decoder *cxlrd;
> +	struct cxl_region_params *p;
> +	struct cxl_region *cxlr;
> +	struct cxl_port *root;
> +	bool attach = false;
> +	struct device *dev;
> +
> +	root = find_cxl_root(&cxlmd->dev);
> +	if (!root) {
> +		dev_err(cxlmd->dev.parent, "%s: failed to map CXL root\n",
> +			dev_name(&cxlmd->dev));
> +		return;

I'm not a fan of muddling on from errors at lower layers of code.
I'd push that decision up to the caller - so return an error code from here.
If caller just wants to ignore it (with appropriate comment on why) then
that is fine by me.


> +	}
> +
> +	dev = device_find_child(&root->dev, &cxld->hpa_range,
> +				  match_decoder_by_range);

device_find_child is acquiring a reference to the child I think.

> +	if (!dev) {
> +		dev_err(cxlmd->dev.parent,
> +			"%s:%s no CXL window for range %#llx:%#llx\n",
> +			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
> +			cxld->hpa_range.start, cxld->hpa_range.end);
> +		goto out;
> +	}
> +	cxlrd = to_cxl_root_decoder(dev);
Here we've stashed the dev that we have a reference to in cxlrd.
> +
> +	mutex_lock(&cxlrd->range_lock);
> +	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
> +				match_region_by_range);
> +	if (!dev)
> +		cxlr = construct_region(cxlrd, cxled);
> +	else
> +		cxlr = to_cxl_region(dev);
> +	mutex_unlock(&cxlrd->range_lock);
> +
> +	if (IS_ERR(cxlr))

In this path, do we drop the reference to cxlrd->dev?

> +		goto out;
> +
> +	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
> +
> +	down_read(&cxl_region_rwsem);
> +	p = &cxlr->params;
> +	attach = p->state >= CXL_CONFIG_COMMIT;

Test is a bit non obvious.  In the tree I'm looking at
CXL_CONFIG_COMMIT is last value in the enum.


> +	up_read(&cxl_region_rwsem);
> +
> +	if (attach) {
> +		int rc = device_attach(&cxlr->dev);
> +
> +		/*
> +		 * If device_attach() fails the range may still be
> +		 * active via the platform-firmware memory map
> +		 */
> +		if (rc < 0)

We also want to know if the no matching driver found case happened so perhaps
		if (rc != 1)

> +			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
> +				p->res);
> +	}
> +
> +	put_device(&cxlr->dev);
> +out:
> +	put_device(&root->dev);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
> +
>  static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  {
>  	if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
> @@ -2103,6 +2554,15 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
>  	return 0;
>  }

> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index ca76879af1de..9b3765c5c81a 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -261,6 +261,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
>   * cxl_decoder flags that define the type of memory / devices this
>   * decoder supports as well as configuration lock status See "CXL 2.0
>   * 8.2.5.12.7 CXL HDM Decoder 0 Control Register" for details.
> + * Additionally indicate whether decoder settings were autodetected,
> + * user customized.
>   */
>  #define CXL_DECODER_F_RAM   BIT(0)
>  #define CXL_DECODER_F_PMEM  BIT(1)
> @@ -268,6 +270,7 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
>  #define CXL_DECODER_F_TYPE3 BIT(3)
>  #define CXL_DECODER_F_LOCK  BIT(4)
>  #define CXL_DECODER_F_ENABLE    BIT(5)
> +#define CXL_DECODER_F_AUTO  BIT(6)

That's a bit nasty and unexpected - can we rename to make it really clear
it's not a hardware flag at the point of usage.
Or just put it in a a new structure field rather than pushing it in here.

>  #define CXL_DECODER_F_MASK  GENMASK(5, 0)

Hmm. This mask isn't currently used and now has
an odd meaning because of hiding a flag in bit(6).


...

>  
>  /*
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index 5453771bf330..012a0c6f8476 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -30,6 +30,23 @@ static void schedule_detach(void *cxlmd)
>  	schedule_cxl_memdev_detach(cxlmd);
>  }
>  
> +static int discover_region(struct device *dev, void *data)
> +{
> +	const unsigned long flags = CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO;
> +	struct cxl_endpoint_decoder *cxled;
> +
> +	if (!is_endpoint_decoder(dev))
> +		return 0;
> +
> +	cxled = to_cxl_endpoint_decoder(dev);
> +	if ((cxled->cxld.flags & flags) != flags)
> +		return 0;
> +
> +	cxl_add_to_region(cxled);
> +
> +	return 0;
> +}
> +
>  static int cxl_port_probe(struct device *dev)
>  {
>  	struct cxl_port *port = to_cxl_port(dev);
> @@ -78,6 +95,15 @@ static int cxl_port_probe(struct device *dev)
>  		return rc;
>  	}
>  
> +	if (!is_cxl_endpoint(port))
> +		return 0;
> +
> +	/*
> +	 * Now that all endpoint decoders are successfully enumerated,
> +	 * try to assemble regions from committed decoders
> +	 */
> +	device_for_each_child(dev, NULL, discover_region);
> +

There is a steady reduction in common code vs specific code for the two
cases in here.  Maybe it's time just to split the probe into

static int cxl_port_probe(struct device *dev)
{
	if (is_cxl_endpoint(port))
		cxl_port_probe_endpoint();
	else
		cxl_port_probe_not endpoint(); //with better naming.
}

Shared code is about 7 lines vs 40+ unshared.



>  	return 0;
>  }
>  
> 
>
Jonathan Cameron Feb. 8, 2023, 5:35 p.m. UTC | #27
On Sun, 05 Feb 2023 17:03:53 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> In preparation for hmem platform devices to be unregistered, stop using
> platform_device_add_resources() to convey the address range. The
> platform_device_add_resources() API causes an existing "Soft Reserved"
> iomem resource to be re-parented under an inserted platform device
> resource. When that platform device is deleted it removes the platform
> device resource and all children.
> 
> Instead, it is sufficient to convey just the address range and let
> request_mem_region() insert resources to indicate the devices active in
> the range. This allows the "Soft Reserved" resource to be re-enumerated
> upon the next probe event.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Seems sensible to me.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  drivers/dax/hmem/device.c |   37 ++++++++++++++-----------------------
>  drivers/dax/hmem/hmem.c   |   14 +++-----------
>  include/linux/memregion.h |    2 ++
>  3 files changed, 19 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
> index 20749c7fab81..b1b339bccfe5 100644
> --- a/drivers/dax/hmem/device.c
> +++ b/drivers/dax/hmem/device.c
> @@ -15,15 +15,8 @@ static struct resource hmem_active = {
>  	.flags = IORESOURCE_MEM,
>  };
>  
> -void hmem_register_device(int target_nid, struct resource *r)
> +void hmem_register_device(int target_nid, struct resource *res)
>  {
> -	/* define a clean / non-busy resource for the platform device */
> -	struct resource res = {
> -		.start = r->start,
> -		.end = r->end,
> -		.flags = IORESOURCE_MEM,
> -		.desc = IORES_DESC_SOFT_RESERVED,
> -	};
>  	struct platform_device *pdev;
>  	struct memregion_info info;
>  	int rc, id;
> @@ -31,55 +24,53 @@ void hmem_register_device(int target_nid, struct resource *r)
>  	if (nohmem)
>  		return;
>  
> -	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
> -			IORES_DESC_SOFT_RESERVED);
> +	rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
> +			       IORES_DESC_SOFT_RESERVED);
>  	if (rc != REGION_INTERSECTS)
>  		return;
>  
>  	id = memregion_alloc(GFP_KERNEL);
>  	if (id < 0) {
> -		pr_err("memregion allocation failure for %pr\n", &res);
> +		pr_err("memregion allocation failure for %pr\n", res);
>  		return;
>  	}
>  
>  	pdev = platform_device_alloc("hmem", id);
>  	if (!pdev) {
> -		pr_err("hmem device allocation failure for %pr\n", &res);
> +		pr_err("hmem device allocation failure for %pr\n", res);
>  		goto out_pdev;
>  	}
>  
> -	if (!__request_region(&hmem_active, res.start, resource_size(&res),
> +	if (!__request_region(&hmem_active, res->start, resource_size(res),
>  			      dev_name(&pdev->dev), 0)) {
> -		dev_dbg(&pdev->dev, "hmem range %pr already active\n", &res);
> +		dev_dbg(&pdev->dev, "hmem range %pr already active\n", res);
>  		goto out_active;
>  	}
>  
>  	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
>  	info = (struct memregion_info) {
>  		.target_node = target_nid,
> +		.range = {
> +			.start = res->start,
> +			.end = res->end,
> +		},
>  	};
>  	rc = platform_device_add_data(pdev, &info, sizeof(info));
>  	if (rc < 0) {
> -		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
> -		goto out_resource;
> -	}
> -
> -	rc = platform_device_add_resources(pdev, &res, 1);
> -	if (rc < 0) {
> -		pr_err("hmem resource allocation failure for %pr\n", &res);
> +		pr_err("hmem memregion_info allocation failure for %pr\n", res);
>  		goto out_resource;
>  	}
>  
>  	rc = platform_device_add(pdev);
>  	if (rc < 0) {
> -		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
> +		dev_err(&pdev->dev, "device add failed for %pr\n", res);
>  		goto out_resource;
>  	}
>  
>  	return;
>  
>  out_resource:
> -	__release_region(&hmem_active, res.start, resource_size(&res));
> +	__release_region(&hmem_active, res->start, resource_size(res));
>  out_active:
>  	platform_device_put(pdev);
>  out_pdev:
> diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
> index c7351e0dc8ff..5025a8c9850b 100644
> --- a/drivers/dax/hmem/hmem.c
> +++ b/drivers/dax/hmem/hmem.c
> @@ -15,25 +15,17 @@ static int dax_hmem_probe(struct platform_device *pdev)
>  	struct memregion_info *mri;
>  	struct dev_dax_data data;
>  	struct dev_dax *dev_dax;
> -	struct resource *res;
> -	struct range range;
> -
> -	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> -	if (!res)
> -		return -ENOMEM;
>  
>  	mri = dev->platform_data;
> -	range.start = res->start;
> -	range.end = res->end;
> -	dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
> -			PMD_SIZE, 0);
> +	dax_region = alloc_dax_region(dev, pdev->id, &mri->range,
> +				      mri->target_node, PMD_SIZE, 0);
>  	if (!dax_region)
>  		return -ENOMEM;
>  
>  	data = (struct dev_dax_data) {
>  		.dax_region = dax_region,
>  		.id = -1,
> -		.size = region_idle ? 0 : resource_size(res),
> +		.size = region_idle ? 0 : range_len(&mri->range),
>  	};
>  	dev_dax = devm_create_dev_dax(&data);
>  	if (IS_ERR(dev_dax))
> diff --git a/include/linux/memregion.h b/include/linux/memregion.h
> index bf83363807ac..c01321467789 100644
> --- a/include/linux/memregion.h
> +++ b/include/linux/memregion.h
> @@ -3,10 +3,12 @@
>  #define _MEMREGION_H_
>  #include <linux/types.h>
>  #include <linux/errno.h>
> +#include <linux/range.h>
>  #include <linux/bug.h>
>  
>  struct memregion_info {
>  	int target_node;
> +	struct range range;
>  };
>  
>  #ifdef CONFIG_MEMREGION
>
Fan Ni Feb. 8, 2023, 5:37 p.m. UTC | #28
On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:


> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 
> The 'Soft Reserved' policy rework arranges for performance
> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> to be designated for 'System RAM' by default, rather than the device-dax
> dedicated access mode. That current device-dax default is confusing and
> surprising for the Pareto of users that do not expect memory to be
> quarantined for dedicated access by default. Most users expect all
> 'System RAM'-capable memory to show up in FREE(1).
> 
> 
> Details:
> --------
> 
> Recall that the Linux 'Soft Reserved' designation for memory is a
> reaction to platform-firmware, like EFI EDK2, delineating memory with
> the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
> alternative way to think of that attribute is that it specifies the
> *not* general-purpose memory pool. It is memory that may be too precious
> for general usage or not performant enough for some hot data structures.
> However, in the absence of explicit policy it should just be 'System
> RAM' by default.
> 
> Rather than require every distribution to ship a udev policy to assign
> dax devices to dax_kmem (the device-memory hotplug driver) just make
> that the kernel default. This is similar to the rationale in:
> 
> commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")
> 
> With this change the relatively niche use case of accessing this memory
> via mapping a device-dax instance can be achieved by building with
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
> memhp_default_state=offline at boot, and then use:
> 
>     daxctl reconfigure-device $device -m devdax --force
> 
> ...to shift the corresponding address range to device-dax access.
> 
> The process of assembling a device-dax instance for a given CXL region
> device configuration is similar to the process of assembling a
> Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
> probing by the PCI and driver core enumerates all CXL endpoints and
> their decoders. Then, once enough decoders have arrived to a describe a
> given region, that region is passed to the device-dax subsystem where it
> is subject to the above 'dax_kmem' policy. This assignment and policy
> choice is only possible if memory is set aside by the 'Soft Reserved'
> designation. Otherwise, CXL that is mapped as 'System RAM' becomes
> immutable by CXL driver mechanisms, but is still enumerated for RAS
> purposes.
> 
> This series is also available via:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> 
> ...and has gone through some preview testing in various forms.
> 
> ---

Tested-by: Fan Ni <fan.ni@samsung.com>


Run the following tests with the patch (with the volatile support at qemu).
Note: cxl related code are compiled as modules and loaded before used.

For pmem setup, tried three topologies (1HB1RP1Mem, 1HB2RP2Mem, 1HB2RP4Mem with
a cxl switch). The memdev is either provided in the command line when launching
qemu or hot added to the guest with device_add command in qemu monitor.

The following operations are performed,
1. create-region with cxl cmd
2. create name-space with ndctl cmd
3. convert cxl mem to ram with daxctl cmd
4. online the memory with daxctl cmd
5. Let app use the memory (numactl --membind=1 htop)

Results: No regression.

For volatile memory (hot add with device_add command), mainly tested 1HB1RP1Mem
case (passthrough).
1. the device can be correctly discovered after hot add (cxl list, may need
cxl enable-memdev)
2. creating ram region (cxl create-region) succeeded, after creating the
region, a dax device under /dev/ is shown.
3. online the memory passes, and the memory is shown on another NUMA node.
4. Let app use the memory (numactl --membind=1 htop) passed.




> 
> Dan Williams (18):
>       cxl/Documentation: Update references to attributes added in v6.0
>       cxl/region: Add a mode attribute for regions
>       cxl/region: Support empty uuids for non-pmem regions
>       cxl/region: Validate region mode vs decoder mode
>       cxl/region: Add volatile region creation support
>       cxl/region: Refactor attach_target() for autodiscovery
>       cxl/region: Move region-position validation to a helper
>       kernel/range: Uplevel the cxl subsystem's range_contains() helper
>       cxl/region: Enable CONFIG_CXL_REGION to be toggled
>       cxl/region: Fix passthrough-decoder detection
>       cxl/region: Add region autodiscovery
>       tools/testing/cxl: Define a fixed volatile configuration to parse
>       dax/hmem: Move HMAT and Soft reservation probe initcall level
>       dax/hmem: Drop unnecessary dax_hmem_remove()
>       dax/hmem: Convey the dax range via memregion_info()
>       dax/hmem: Move hmem device registration to dax_hmem.ko
>       dax: Assign RAM regions to memory-hotplug by default
>       cxl/dax: Create dax devices for CXL RAM regions
> 
> 
>  Documentation/ABI/testing/sysfs-bus-cxl |   64 +-
>  MAINTAINERS                             |    1 
>  drivers/acpi/numa/hmat.c                |    4 
>  drivers/cxl/Kconfig                     |   12 
>  drivers/cxl/acpi.c                      |    3 
>  drivers/cxl/core/core.h                 |    7 
>  drivers/cxl/core/hdm.c                  |    8 
>  drivers/cxl/core/pci.c                  |    5 
>  drivers/cxl/core/port.c                 |   34 +
>  drivers/cxl/core/region.c               |  848 ++++++++++++++++++++++++++++---
>  drivers/cxl/cxl.h                       |   46 ++
>  drivers/cxl/cxlmem.h                    |    3 
>  drivers/cxl/port.c                      |   26 +
>  drivers/dax/Kconfig                     |   17 +
>  drivers/dax/Makefile                    |    2 
>  drivers/dax/bus.c                       |   53 +-
>  drivers/dax/bus.h                       |   12 
>  drivers/dax/cxl.c                       |   53 ++
>  drivers/dax/device.c                    |    3 
>  drivers/dax/hmem/Makefile               |    3 
>  drivers/dax/hmem/device.c               |  102 ++--
>  drivers/dax/hmem/hmem.c                 |  148 +++++
>  drivers/dax/kmem.c                      |    1 
>  include/linux/dax.h                     |    7 
>  include/linux/memregion.h               |    2 
>  include/linux/range.h                   |    5 
>  lib/stackinit_kunit.c                   |    6 
>  tools/testing/cxl/test/cxl.c            |  146 +++++
>  28 files changed, 1355 insertions(+), 266 deletions(-)
>  create mode 100644 drivers/dax/cxl.c
> 
> base-commit: 172738bbccdb4ef76bdd72fc72a315c741c39161
>
Verma, Vishal L Feb. 9, 2023, 12:20 a.m. UTC | #29
On Sun, 2023-02-05 at 17:02 -0800, Dan Williams wrote:
> Prior to Linus deciding that the kernel that following v5.19 would be
> v6.0, the CXL ABI documentation already referenced v5.20. In preparation
> for updating these entries update the kernel version to v6.0.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |   30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 329a7e46c805..5be032313e29 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -198,7 +198,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/endpointX/CDAT
>  Date:          July, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RO) If this sysfs entry is not present no DOE mailbox was
> @@ -209,7 +209,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/mode
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> @@ -229,7 +229,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/dpa_resource
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RO) When a CXL decoder is of devtype "cxl_decoder_endpoint",
> @@ -240,7 +240,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/dpa_size
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it
> @@ -260,7 +260,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/interleave_ways
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RO) The number of targets across which this decoder's host
> @@ -275,7 +275,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/interleave_granularity
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RO) The number of consecutive bytes of host physical address
> @@ -287,7 +287,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/create_pmem_region
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Write a string in the form 'regionZ' to start the process
> @@ -303,7 +303,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (WO) Write a string in the form 'regionZ' to delete that region,
> @@ -312,7 +312,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/regionZ/uuid
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Write a unique identifier for the region. This field must
> @@ -322,7 +322,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/regionZ/interleave_granularity
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Set the number of consecutive bytes each device in the
> @@ -333,7 +333,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/regionZ/interleave_ways
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Configures the number of devices participating in the
> @@ -343,7 +343,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/regionZ/size
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) System physical address space to be consumed by the region.
> @@ -360,7 +360,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/regionZ/resource
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RO) A region is a contiguous partition of a CXL root decoder
> @@ -372,7 +372,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/regionZ/target[0..N]
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Write an endpoint decoder object name to 'targetX' where X
> @@ -391,7 +391,7 @@ Description:
>  
>  What:          /sys/bus/cxl/devices/regionZ/commit
>  Date:          May, 2022
> -KernelVersion: v5.20
> +KernelVersion: v6.0
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Write a boolean 'true' string value to this attribute to
> 
>
Verma, Vishal L Feb. 9, 2023, 12:24 a.m. UTC | #30
On Sun, 2023-02-05 at 17:02 -0800, Dan Williams wrote:
> Shipping versions of the cxl-cli utility expect all regions to have a
> 'uuid' attribute. In preparation for 'ram' regions, update the 'uuid'
> attribute to return an empty string which satisfies the current
> expectations of 'cxl list -R'. Otherwise, 'cxl list -R' fails in the
> presence of regions with the 'uuid' attribute missing. Force the
> attribute to be read-only as there is no facility or expectation for a
> 'ram' region to recall its uuid from one boot to the next.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |    3 ++-
>  drivers/cxl/core/region.c               |    7 +++++--
>  2 files changed, 7 insertions(+), 3 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 058b0c45001f..4c4e1cbb1169 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -317,7 +317,8 @@ Contact:    linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Write a unique identifier for the region. This field must
>                 be set for persistent regions and it must not conflict with the
> -               UUID of another region.
> +               UUID of another region. For volatile ram regions this
> +               attribute is a read-only empty string.
>  
>  
>  What:          /sys/bus/cxl/devices/regionZ/interleave_granularity
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 17d2d0c12725..c9e7f05caa0f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -45,7 +45,10 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>         rc = down_read_interruptible(&cxl_region_rwsem);
>         if (rc)
>                 return rc;
> -       rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
> +       if (cxlr->mode != CXL_DECODER_PMEM)
> +               rc = sysfs_emit(buf, "\n");
> +       else
> +               rc = sysfs_emit(buf, "%pUb\n", &p->uuid);
>         up_read(&cxl_region_rwsem);
>  
>         return rc;
> @@ -301,7 +304,7 @@ static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
>         struct cxl_region *cxlr = to_cxl_region(dev);
>  
>         if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_DECODER_PMEM)
> -               return 0;
> +               return 0444;
>         return a->mode;
>  }
>  
> 
>
Verma, Vishal L Feb. 9, 2023, 1:02 a.m. UTC | #31
On Sun, 2023-02-05 at 17:02 -0800, Dan Williams wrote:
> Expand the region creation infrastructure to enable 'ram'
> (volatile-memory) regions. The internals of create_pmem_region_store()
> and create_pmem_region_show() are factored out into helpers
> __create_region() and __create_region_show() for the 'ram' case to
> reuse.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |   22 +++++-----
>  drivers/cxl/core/core.h                 |    1 
>  drivers/cxl/core/port.c                 |   14 ++++++
>  drivers/cxl/core/region.c               |   71 +++++++++++++++++++++++++------
>  4 files changed, 82 insertions(+), 26 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index 4c4e1cbb1169..3acf2f17a73f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -285,20 +285,20 @@ Description:
>                 interleave_granularity).
>  
>  
> -What:          /sys/bus/cxl/devices/decoderX.Y/create_pmem_region
> -Date:          May, 2022
> -KernelVersion: v6.0
> +What:          /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region
> +Date:          May, 2022, January, 2023
> +KernelVersion: v6.0 (pmem), v6.3 (ram)
>  Contact:       linux-cxl@vger.kernel.org
>  Description:
>                 (RW) Write a string in the form 'regionZ' to start the process
> -               of defining a new persistent memory region (interleave-set)
> -               within the decode range bounded by root decoder 'decoderX.Y'.
> -               The value written must match the current value returned from
> -               reading this attribute. An atomic compare exchange operation is
> -               done on write to assign the requested id to a region and
> -               allocate the region-id for the next creation attempt. EBUSY is
> -               returned if the region name written does not match the current
> -               cached value.
> +               of defining a new persistent, or volatile memory region
> +               (interleave-set) within the decode range bounded by root decoder
> +               'decoderX.Y'. The value written must match the current value
> +               returned from reading this attribute. An atomic compare exchange
> +               operation is done on write to assign the requested id to a
> +               region and allocate the region-id for the next creation attempt.
> +               EBUSY is returned if the region name written does not match the
> +               current cached value.
>  
>  
>  What:          /sys/bus/cxl/devices/decoderX.Y/delete_region
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 8c04672dca56..5eb873da5a30 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -11,6 +11,7 @@ extern struct attribute_group cxl_base_attribute_group;
>  
>  #ifdef CONFIG_CXL_REGION
>  extern struct device_attribute dev_attr_create_pmem_region;
> +extern struct device_attribute dev_attr_create_ram_region;
>  extern struct device_attribute dev_attr_delete_region;
>  extern struct device_attribute dev_attr_region;
>  extern const struct device_type cxl_pmem_region_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 8566451cb22f..47e450c3a5a9 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -294,6 +294,7 @@ static struct attribute *cxl_decoder_root_attrs[] = {
>         &dev_attr_cap_type3.attr,
>         &dev_attr_target_list.attr,
>         SET_CXL_REGION_ATTR(create_pmem_region)
> +       SET_CXL_REGION_ATTR(create_ram_region)
>         SET_CXL_REGION_ATTR(delete_region)
>         NULL,
>  };
> @@ -305,6 +306,13 @@ static bool can_create_pmem(struct cxl_root_decoder *cxlrd)
>         return (cxlrd->cxlsd.cxld.flags & flags) == flags;
>  }
>  
> +static bool can_create_ram(struct cxl_root_decoder *cxlrd)
> +{
> +       unsigned long flags = CXL_DECODER_F_TYPE3 | CXL_DECODER_F_RAM;
> +
> +       return (cxlrd->cxlsd.cxld.flags & flags) == flags;
> +}
> +
>  static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *a, int n)
>  {
>         struct device *dev = kobj_to_dev(kobj);
> @@ -313,7 +321,11 @@ static umode_t cxl_root_decoder_visible(struct kobject *kobj, struct attribute *
>         if (a == CXL_REGION_ATTR(create_pmem_region) && !can_create_pmem(cxlrd))
>                 return 0;
>  
> -       if (a == CXL_REGION_ATTR(delete_region) && !can_create_pmem(cxlrd))
> +       if (a == CXL_REGION_ATTR(create_ram_region) && !can_create_ram(cxlrd))
> +               return 0;
> +
> +       if (a == CXL_REGION_ATTR(delete_region) &&
> +           !(can_create_pmem(cxlrd) || can_create_ram(cxlrd)))
>                 return 0;
>  
>         return a->mode;
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 53d6dbe4de6d..8dea49c021b8 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1685,6 +1685,15 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>         struct device *dev;
>         int rc;
>  
> +       switch (mode) {
> +       case CXL_DECODER_RAM:
> +       case CXL_DECODER_PMEM:
> +               break;
> +       default:
> +               dev_err(&cxlrd->cxlsd.cxld.dev, "unsupported mode %d\n", mode);
> +               return ERR_PTR(-EINVAL);
> +       }
> +
>         cxlr = cxl_region_alloc(cxlrd, id);
>         if (IS_ERR(cxlr))
>                 return cxlr;
> @@ -1713,12 +1722,38 @@ static struct cxl_region *devm_cxl_add_region(struct cxl_root_decoder *cxlrd,
>         return ERR_PTR(rc);
>  }
>  
> +static ssize_t __create_region_show(struct cxl_root_decoder *cxlrd, char *buf)
> +{
> +       return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
> +}
> +
>  static ssize_t create_pmem_region_show(struct device *dev,
>                                        struct device_attribute *attr, char *buf)
>  {
> -       struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> +       return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
>  
> -       return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
> +static ssize_t create_ram_region_show(struct device *dev,
> +                                     struct device_attribute *attr, char *buf)
> +{
> +       return __create_region_show(to_cxl_root_decoder(dev), buf);
> +}
> +
> +static struct cxl_region *__create_region(struct cxl_root_decoder *cxlrd,
> +                                         enum cxl_decoder_mode mode, int id)
> +{
> +       int rc;
> +
> +       rc = memregion_alloc(GFP_KERNEL);
> +       if (rc < 0)
> +               return ERR_PTR(rc);
> +
> +       if (atomic_cmpxchg(&cxlrd->region_id, id, rc) != id) {
> +               memregion_free(rc);
> +               return ERR_PTR(-EBUSY);
> +       }
> +
> +       return devm_cxl_add_region(cxlrd, id, mode, CXL_DECODER_EXPANDER);
>  }
>  
>  static ssize_t create_pmem_region_store(struct device *dev,
> @@ -1727,29 +1762,37 @@ static ssize_t create_pmem_region_store(struct device *dev,
>  {
>         struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
>         struct cxl_region *cxlr;
> -       int id, rc;
> +       int rc, id;
>  
>         rc = sscanf(buf, "region%d\n", &id);
>         if (rc != 1)
>                 return -EINVAL;
>  
> -       rc = memregion_alloc(GFP_KERNEL);
> -       if (rc < 0)
> -               return rc;
> +       cxlr = __create_region(cxlrd, CXL_DECODER_PMEM, id);
> +       if (IS_ERR(cxlr))
> +               return PTR_ERR(cxlr);
> +       return len;
> +}
> +DEVICE_ATTR_RW(create_pmem_region);
>  
> -       if (atomic_cmpxchg(&cxlrd->region_id, id, rc) != id) {
> -               memregion_free(rc);
> -               return -EBUSY;
> -       }
> +static ssize_t create_ram_region_store(struct device *dev,
> +                                      struct device_attribute *attr,
> +                                      const char *buf, size_t len)
> +{
> +       struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(dev);
> +       struct cxl_region *cxlr;
> +       int rc, id;
>  
> -       cxlr = devm_cxl_add_region(cxlrd, id, CXL_DECODER_PMEM,
> -                                  CXL_DECODER_EXPANDER);
> +       rc = sscanf(buf, "region%d\n", &id);
> +       if (rc != 1)
> +               return -EINVAL;
> +
> +       cxlr = __create_region(cxlrd, CXL_DECODER_RAM, id);
>         if (IS_ERR(cxlr))
>                 return PTR_ERR(cxlr);
> -
>         return len;
>  }
> -DEVICE_ATTR_RW(create_pmem_region);
> +DEVICE_ATTR_RW(create_ram_region);
>  
>  static ssize_t region_show(struct device *dev, struct device_attribute *attr,
>                            char *buf)
> 
>
Dan Williams Feb. 9, 2023, 4:07 a.m. UTC | #32
Jonathan Cameron wrote:
> On Sun, 05 Feb 2023 17:03:29 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Region autodiscovery is an asynchrounous state machine advanced by
> > cxl_port_probe(). After the decoders on an endpoint port are enumerated
> > they are scanned for actively enabled instances. Each active decoder is
> > flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
> > If a region does not already exist for the address range setting of the
> > decoder one is created. That creation process may race with other
> > decoders of the same region being discovered since cxl_port_probe() is
> > asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
> > introduced to mitigate that race.
> > 
> > Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
> > they are sorted by their relative decode position. The sort algorithm
> > involves finding the point in the cxl_port topology where one leg of the
> > decode leads to deviceA and the other deviceB. At that point in the
> > topology the target order in the 'struct cxl_switch_decoder' indicates
> > the relative position of those endpoint decoders in the region.
> > 
> > >From that point the region goes through the same setup and validation  
> > steps as user-created regions, but instead of programming the decoders
> > it validates that driver would have written the same values to the
> > decoders as were already present.
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> A few comments inline,
> 
> Thanks,
> 
> Jonathan
> 
> > ---
> >  drivers/cxl/core/hdm.c    |    5 
> >  drivers/cxl/core/port.c   |    2 
> >  drivers/cxl/core/region.c |  496 ++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/cxl/cxl.h         |   16 +
> >  drivers/cxl/port.c        |   26 ++
> >  5 files changed, 531 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > index dcc16d7cb8f3..174cddfec6e8 100644
> > --- a/drivers/cxl/core/hdm.c
> > +++ b/drivers/cxl/core/hdm.c
> > @@ -674,7 +674,7 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld)
> >  	up_read(&cxl_dpa_rwsem);
> >  
> >  	port->commit_end--;
> > -	cxld->flags &= ~CXL_DECODER_F_ENABLE;
> > +	cxld->flags &= ~(CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO);
> >  
> >  	return 0;
> >  }
> > @@ -719,7 +719,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
> >  
> >  	/* decoders are enabled if committed */
> >  	if (committed) {
> > -		cxld->flags |= CXL_DECODER_F_ENABLE;
> > +		cxld->flags |= CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO;
> >  		if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
> >  			cxld->flags |= CXL_DECODER_F_LOCK;
> >  		if (FIELD_GET(CXL_HDM_DECODER0_CTRL_TYPE, ctrl))
> > @@ -783,6 +783,7 @@ static int init_hdm_decoder(struct cxl_port *port, struct cxl_decoder *cxld,
> >  		return rc;
> >  	}
> >  	*dpa_base += dpa_size + skip;
> > +
> 
> :)  No comment.

Spurious newlines... my old friend.

> 
> >  	return 0;
> >  }
> >  
> 
> 
> ...
> 
> 
> > +static int cmp_decode_pos(const void *a, const void *b)
> > +{
> > +	struct cxl_endpoint_decoder *cxled_a = *(typeof(cxled_a) *)a;
> > +	struct cxl_endpoint_decoder *cxled_b = *(typeof(cxled_b) *)b;
> > +	struct cxl_memdev *cxlmd_a = cxled_to_memdev(cxled_a);
> > +	struct cxl_memdev *cxlmd_b = cxled_to_memdev(cxled_b);
> > +	struct cxl_port *port_a = cxled_to_port(cxled_a);
> > +	struct cxl_port *port_b = cxled_to_port(cxled_b);
> > +	struct cxl_port *iter_a, *iter_b;
> > +
> > +	/* Exit early if any prior sorting failed */
> > +	if (cxled_a->pos < 0 || cxled_b->pos < 0)
> > +		return 0;
> > +
> > +	/*
> > +	 * Walk up the hierarchy to find a shared port, find the decoder
> > +	 * that maps the range, compare the relative position of those
> > +	 * dport mappings.
> > +	 */
> > +	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
> > +		struct cxl_port *next_a, *next_b, *port;
> > +		struct cxl_switch_decoder *cxlsd;
> > +
> > +		next_a = next_port(iter_a);
> > +		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
> > +			int a_pos, b_pos, result;
> > +			struct device *dev;
> > +			unsigned int seq;
> > +
> > +			next_b = next_port(iter_b);
> > +			if (next_a != next_b)
> > +				continue;
> > +			if (!next_a)
> > +				goto out;
> Perhaps treat this as a 'find the match' then carry on, so as to avoid having the
> guts of the handling nested so deep.  So something like.

I always like a good out-indent, will give it a look.

> 
> 	struct cxl_switch_decoder *cxlsd;
> 	struct cxl_port *next_a, *next_b, *port;
> 	bool matched = false; // May be simpler test - I'm just lazy
> 
> 	// possibly pull this out as a utility function so that
> 	// error handling can be tidied up by simply returning 
> 	// rather than gotos.
> 	for (iter_a = port_a; iter_a; iter_a = next_port(iter_a)) {
> 		next_a = next_port(iter_a);
> 		for (iter_b = port_b; iter_b; iter_b = next_port(iter_b)) {
> 			int a_pos, b_pos, result;
> 			struct device *dev;
> 			unsigned int seq;
> 
> 			next_b = next_port(iter_b);
> 			if (next_a != next_b)
> 				continue;
> 			if (!next_a)
> 				goto out;
> 			port = next_a;
> 			found = true;

@found can just be @port.

> 			break;
> 		}
> 		if (found)
> 			break;
> 	}
> 	if (!found)
> 		goto out;
> 
> 	dev = device_find_child(&port->dev, cxled_a,
> 				decoder_match_range);
> 	if (!dev) {
> 		struct range *range = &cxled_a->cxld.hpa_range;
> 		dev_err(port->uport,
> 			"failed to find decoder that maps %#llx-:%#llx\n",
> 		range->start, range->end);
> 		cxled_a->pos = -1;
> 		return 0;
> 	}
> 
> 	cxlsd = to_cxl_switch_decoder(dev);
> 	do {
> 		seq = read_seqbegin(&cxlsd->target_lock);
> 		find_positions(cxlsd, iter_a, iter_b, &a_pos, &b_pos);
> 	} while (read_seqretry(&cxlsd->target_lock, seq));
> 
> 	if (a_pos < 0 || b_pos < 0) {
> 		dev_err(port->uport,
> 			"failed to find shared decoder for %s and %s\n",
> 			dev_name(cxlmd_a->dev.parent),
> 			dev_name(cxlmd_b->dev.parent));
> 		cxled_a->pos = -1;
> 		result = 0;
> //if factored as suggested above then a goto here and drop
> // the else would be neater again.

Perhaps, I'll take a look.

> 
> 	} else {
> 		result = a_pos - b_pos;
> 		dev_dbg(port->uport, "%s: %s comes %s %s\n",
> 			dev_name(&cxlsd->cxld.dev),
> 			dev_name(cxlmd_a->dev.parent),
> 			result < 0 ? "before" : "after",
> 			dev_name(cxlmd_b->dev.parent));
> 		}
> 	put_device(dev);
> 
> 	return result;
> out:
> 	dev_err(cxlmd_a->dev.parent, "failed to find shared port with %s\n",
> 		dev_name(cxlmd_b->dev.parent));
> 	cxled_a->pos = -1;
> 	return 0;
> }
> 
> 
> > +			dev = device_find_child(&port->dev, cxled_a,
> > +						decoder_match_range);
> > +			if (!dev) {
> > +				struct range *range = &cxled_a->cxld.hpa_range;
> > +
> > +				dev_err(port->uport,
> > +					"failed to find decoder that maps %#llx-:%#llx\n",
> > +					range->start, range->end);
> > +				cxled_a->pos = -1;
> > +				return 0;
> > +			}
> > +
> > +			cxlsd = to_cxl_switch_decoder(dev);
> > +			do {
> > +				seq = read_seqbegin(&cxlsd->target_lock);
> > +				find_positions(cxlsd, iter_a, iter_b, &a_pos,
> > +					       &b_pos);
> > +			} while (read_seqretry(&cxlsd->target_lock, seq));
> > +
> > +			if (a_pos < 0 || b_pos < 0) {
> > +				dev_err(port->uport,
> > +					"failed to find shared decoder for %s and %s\n",
> > +					dev_name(cxlmd_a->dev.parent),
> > +					dev_name(cxlmd_b->dev.parent));
> > +				cxled_a->pos = -1;
> > +				result = 0;
> > +			} else {
> > +				result = a_pos - b_pos;
> > +				dev_dbg(port->uport, "%s: %s comes %s %s\n",
> > +					dev_name(&cxlsd->cxld.dev),
> > +					dev_name(cxlmd_a->dev.parent),
> > +					result < 0 ? "before" : "after",
> > +					dev_name(cxlmd_b->dev.parent));
> > +			}
> > +
> > +			put_device(dev);
> > +
> > +			return result;
> > +		}
> > +	}
> > +out:
> > +	dev_err(cxlmd_a->dev.parent, "failed to find shared port with %s\n",
> > +		dev_name(cxlmd_b->dev.parent));
> > +	cxled_a->pos = -1;
> > +	return 0;
> > +}
> > +
> > +static int cxl_region_sort_targets(struct cxl_region *cxlr)
> > +{
> > +	struct cxl_region_params *p = &cxlr->params;
> > +	int i, rc = 0;
> > +
> > +	sort(p->targets, p->nr_targets, sizeof(p->targets[0]), cmp_decode_pos,
> > +	     NULL);
> > +
> > +	for (i = 0; i < p->nr_targets; i++) {
> > +		struct cxl_endpoint_decoder *cxled = p->targets[i];
> > +
> > +		if (cxled->pos < 0)
> > +			rc = -ENXIO;
> > +		cxled->pos = i;
> > +	}
> > +
> > +	dev_dbg(&cxlr->dev, "region sort %s\n", rc ? "failed" : "successful");
> > +	return rc;
> > +}
> > +
> >  static int cxl_region_attach(struct cxl_region *cxlr,
> >  			     struct cxl_endpoint_decoder *cxled, int pos)
> >  {
> > @@ -1350,6 +1572,50 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> >  		return -EINVAL;
> >  	}
> >  
> > +	if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
> > +		int i;
> > +
> > +		rc = cxl_region_attach_auto(cxlr, cxled, pos);
> > +		if (rc)
> > +			return rc;
> > +
> > +		/* await more targets to arrive... */
> > +		if (p->nr_targets < p->interleave_ways)
> > +			return 0;
> > +
> > +		/*
> > +		 * All targets are here, which implies all PCI enumeration that
> > +		 * affects this region has been completed. Walk the topology to
> > +		 * sort the devices into their relative region decode position.
> > +		 */
> > +		rc = cxl_region_sort_targets(cxlr);
> > +		if (rc)
> > +			return rc;
> > +
> > +		for (i = 0; i < p->nr_targets; i++) {
> > +			cxled = p->targets[i];
> > +			ep_port = cxled_to_port(cxled);
> > +			dport = cxl_find_dport_by_dev(root_port,
> > +						      ep_port->host_bridge);
> > +			rc = cxl_region_attach_position(cxlr, cxlrd, cxled,
> > +							dport, i);
> > +			if (rc)
> > +				return rc;
> > +		}
> > +
> > +		rc = cxl_region_setup_targets(cxlr);
> > +		if (rc)
> > +			return rc;
> > +
> > +		/*
> > +		 * If target setup succeeds in the autodiscovery case
> > +		 * then the region is already committed.
> > +		 */
> > +		p->state = CXL_CONFIG_COMMIT;
> > +
> > +		return 0;
> > +	}
> > +
> >  	rc = cxl_region_validate_position(cxlr, cxled, pos);
> >  	if (rc)
> >  		return rc;
> > @@ -1500,8 +1766,8 @@ static int detach_target(struct cxl_region *cxlr, int pos)
> >  	return rc;
> >  }
> >  
> > -static size_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> > -			    size_t len)
> > +static ssize_t store_targetN(struct cxl_region *cxlr, const char *buf, int pos,
> > +			     size_t len)
> 
> ?
> 
> 
> 
> ...
> 
> 
> > +
> > +void cxl_add_to_region(struct cxl_endpoint_decoder *cxled)
> > +{
> > +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > +	struct range *hpa = &cxled->cxld.hpa_range;
> > +	struct cxl_decoder *cxld = &cxled->cxld;
> > +	struct cxl_root_decoder *cxlrd;
> > +	struct cxl_region_params *p;
> > +	struct cxl_region *cxlr;
> > +	struct cxl_port *root;
> > +	bool attach = false;
> > +	struct device *dev;
> > +
> > +	root = find_cxl_root(&cxlmd->dev);
> > +	if (!root) {
> > +		dev_err(cxlmd->dev.parent, "%s: failed to map CXL root\n",
> > +			dev_name(&cxlmd->dev));
> > +		return;
> 
> I'm not a fan of muddling on from errors at lower layers of code.
> I'd push that decision up to the caller - so return an error code from here.
> If caller just wants to ignore it (with appropriate comment on why) then
> that is fine by me.

Yeah, in fact the caller should already know about the root at this
point since this is called from cxl_port_probe() that did that port
establishment. Perhaps this can just take @root as an argument.

Note though that all errors here are ignored by the caller because there
is simply no way to know ahead of time whether this region will succeed
in receiving all its members. So the failure scenario is missing
regions, not anything that would make the port driver fail to load.

> 
> 
> > +	}
> > +
> > +	dev = device_find_child(&root->dev, &cxld->hpa_range,
> > +				  match_decoder_by_range);
> 
> device_find_child is acquiring a reference to the child I think.
> 
> > +	if (!dev) {
> > +		dev_err(cxlmd->dev.parent,
> > +			"%s:%s no CXL window for range %#llx:%#llx\n",
> > +			dev_name(&cxlmd->dev), dev_name(&cxld->dev),
> > +			cxld->hpa_range.start, cxld->hpa_range.end);
> > +		goto out;
> > +	}
> > +	cxlrd = to_cxl_root_decoder(dev);
> Here we've stashed the dev that we have a reference to in cxlrd.
> > +
> > +	mutex_lock(&cxlrd->range_lock);
> > +	dev = device_find_child(&cxlrd->cxlsd.cxld.dev, hpa,
> > +				match_region_by_range);
> > +	if (!dev)
> > +		cxlr = construct_region(cxlrd, cxled);
> > +	else
> > +		cxlr = to_cxl_region(dev);
> > +	mutex_unlock(&cxlrd->range_lock);
> > +
> > +	if (IS_ERR(cxlr))
> 
> In this path, do we drop the reference to cxlrd->dev?

No, good eye. The reference can be dropped immediately since the root
decoder will not be unregistered until the root port is removed and by
this point this port has registered itself in the device_unregister()
chain when its ancestors are taken out.

> 
> > +		goto out;
> > +
> > +	attach_target(cxlr, cxled, -1, TASK_UNINTERRUPTIBLE);
> > +
> > +	down_read(&cxl_region_rwsem);
> > +	p = &cxlr->params;
> > +	attach = p->state >= CXL_CONFIG_COMMIT;
> 
> Test is a bit non obvious.  In the tree I'm looking at
> CXL_CONFIG_COMMIT is last value in the enum.

Can switch to "=="

> 
> 
> > +	up_read(&cxl_region_rwsem);
> > +
> > +	if (attach) {
> > +		int rc = device_attach(&cxlr->dev);
> > +
> > +		/*
> > +		 * If device_attach() fails the range may still be
> > +		 * active via the platform-firmware memory map
> > +		 */
> > +		if (rc < 0)
> 
> We also want to know if the no matching driver found case happened so perhaps
> 		if (rc != 1)

That can't happen as the only possible driver, cxl_region_driver, is in
this same file. I'll add a comment to that effect as its a small
layering violation to know that.

> 
> > +			dev_err(&cxlr->dev, "failed to enable, range: %pr\n",
> > +				p->res);
> > +	}
> > +
> > +	put_device(&cxlr->dev);
> > +out:
> > +	put_device(&root->dev);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_add_to_region, CXL);
> > +
> >  static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
> >  {
> >  	if (!test_bit(CXL_REGION_F_INCOHERENT, &cxlr->flags))
> > @@ -2103,6 +2554,15 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr)
> >  	return 0;
> >  }
> 
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index ca76879af1de..9b3765c5c81a 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -261,6 +261,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
> >   * cxl_decoder flags that define the type of memory / devices this
> >   * decoder supports as well as configuration lock status See "CXL 2.0
> >   * 8.2.5.12.7 CXL HDM Decoder 0 Control Register" for details.
> > + * Additionally indicate whether decoder settings were autodetected,
> > + * user customized.
> >   */
> >  #define CXL_DECODER_F_RAM   BIT(0)
> >  #define CXL_DECODER_F_PMEM  BIT(1)
> > @@ -268,6 +270,7 @@ resource_size_t cxl_rcrb_to_component(struct device *dev,
> >  #define CXL_DECODER_F_TYPE3 BIT(3)
> >  #define CXL_DECODER_F_LOCK  BIT(4)
> >  #define CXL_DECODER_F_ENABLE    BIT(5)
> > +#define CXL_DECODER_F_AUTO  BIT(6)
> 
> That's a bit nasty and unexpected - can we rename to make it really clear
> it's not a hardware flag at the point of usage.
> Or just put it in a a new structure field rather than pushing it in here.

I'll move it to an endpoint-flag value since it is only relevant for
those.

> >  #define CXL_DECODER_F_MASK  GENMASK(5, 0)
> 
> Hmm. This mask isn't currently used and now has
> an odd meaning because of hiding a flag in bit(6).

True, it can be deleted by a separate cleanup.

> 
> 
> ...
> 
> >  
> >  /*
> > diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> > index 5453771bf330..012a0c6f8476 100644
> > --- a/drivers/cxl/port.c
> > +++ b/drivers/cxl/port.c
> > @@ -30,6 +30,23 @@ static void schedule_detach(void *cxlmd)
> >  	schedule_cxl_memdev_detach(cxlmd);
> >  }
> >  
> > +static int discover_region(struct device *dev, void *data)
> > +{
> > +	const unsigned long flags = CXL_DECODER_F_ENABLE | CXL_DECODER_F_AUTO;
> > +	struct cxl_endpoint_decoder *cxled;
> > +
> > +	if (!is_endpoint_decoder(dev))
> > +		return 0;
> > +
> > +	cxled = to_cxl_endpoint_decoder(dev);
> > +	if ((cxled->cxld.flags & flags) != flags)
> > +		return 0;
> > +
> > +	cxl_add_to_region(cxled);
> > +
> > +	return 0;
> > +}
> > +
> >  static int cxl_port_probe(struct device *dev)
> >  {
> >  	struct cxl_port *port = to_cxl_port(dev);
> > @@ -78,6 +95,15 @@ static int cxl_port_probe(struct device *dev)
> >  		return rc;
> >  	}
> >  
> > +	if (!is_cxl_endpoint(port))
> > +		return 0;
> > +
> > +	/*
> > +	 * Now that all endpoint decoders are successfully enumerated,
> > +	 * try to assemble regions from committed decoders
> > +	 */
> > +	device_for_each_child(dev, NULL, discover_region);
> > +
> 
> There is a steady reduction in common code vs specific code for the two
> cases in here.  Maybe it's time just to split the probe into
> 
> static int cxl_port_probe(struct device *dev)
> {
> 	if (is_cxl_endpoint(port))
> 		cxl_port_probe_endpoint();
> 	else
> 		cxl_port_probe_not endpoint(); //with better naming.
> }
> 
> Shared code is about 7 lines vs 40+ unshared.

Works for me.
Dan Williams Feb. 9, 2023, 4:09 a.m. UTC | #33
Jonathan Cameron wrote:
> On Sun, 05 Feb 2023 17:03:07 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > In preparation for region autodiscovery, that needs all devices
> > discovered before their relative position in the region can be
> > determined, consolidate all position dependent validation in a helper.
> > 
> > Recall that in the on-demand region creation flow the end-user picks the
> > position of a given endpoint decoder in a region. In the autodiscovery
> > case the position of an endpoint decoder can only be determined after
> > all other endpoint decoders that claim to decode the region's address
> > range have been enumerated and attached. So, in the autodiscovery case
> > endpoint decoders may be attached before their relative position is
> > known. Once all decoders arrive, then positions can be determined and
> > validated with cxl_region_validate_position() the same as user initiated
> > on-demand creation.
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Hi Dan,
> 
> A few comments inline, but mostly reflect the original code rather than
> the refactoring you have done in this patch.
> 
> Jonathan
> 
> 
> > +static int cxl_region_attach(struct cxl_region *cxlr,
> > +			     struct cxl_endpoint_decoder *cxled, int pos)
> > +{
> > +	struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
> > +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> > +	struct cxl_region_params *p = &cxlr->params;
> > +	struct cxl_port *ep_port, *root_port;
> > +	struct cxl_dport *dport;
> > +	int rc = -ENXIO;
> > +
> > +	if (cxled->mode != cxlr->mode) {
> > +		dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> > +			dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (cxled->mode == CXL_DECODER_DEAD) {
> > +		dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
> > +		return -ENODEV;
> > +	}
> > +
> > +	/* all full of members, or interleave config not established? */
> > +	if (p->state > CXL_CONFIG_INTERLEAVE_ACTIVE) {
> > +		dev_dbg(&cxlr->dev, "region already active\n");
> > +		return -EBUSY;
> > +	} else if (p->state < CXL_CONFIG_INTERLEAVE_ACTIVE) {
> > +		dev_dbg(&cxlr->dev, "interleave config missing\n");
> > +		return -ENXIO;
> > +	}
> > +
> >  	ep_port = cxled_to_port(cxled);
> >  	root_port = cxlrd_to_port(cxlrd);
> >  	dport = cxl_find_dport_by_dev(root_port, ep_port->host_bridge);
> > @@ -1284,13 +1327,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> >  		return -ENXIO;
> >  	}
> >  
> > -	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> > -		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> > -			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> > -			dev_name(&cxlrd->cxlsd.cxld.dev));
> > -		return -ENXIO;
> > -	}
> > -
> 
> In an ideal world, this would have been nice as two patches.
> One that reorders the various checks so that they are in the order
> after you have factored things out (easy to review for correctness)
> then one that factored it out.
> 
> >  	if (cxled->cxld.target_type != cxlr->type) {
> >  		dev_dbg(&cxlr->dev, "%s:%s type mismatch: %d vs %d\n",
> >  			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> > @@ -1314,12 +1350,13 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> >  		return -EINVAL;
> >  	}
> >  
> > -	for (iter = ep_port; !is_cxl_root(iter);
> > -	     iter = to_cxl_port(iter->dev.parent)) {
> > -		rc = cxl_port_attach_region(iter, cxlr, cxled, pos);
> > -		if (rc)
> > -			goto err;
> > -	}
> > +	rc = cxl_region_validate_position(cxlr, cxled, pos);
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = cxl_region_attach_position(cxlr, cxlrd, cxled, dport, pos);
> > +	if (rc)
> > +		return rc;
> >  
> >  	p->targets[pos] = cxled;
> >  	cxled->pos = pos;
> 
> More something about original code than the refactoring...
> 
> I'm not keen on the side effects that aren't unwound in the error paths.
> 
> p->targets[pos] and cxled->pos are left set.  Probably never matters
> but not elegant or as easy to reason about as it would be if they
> were cleared in error cases.  In particular there is a check on
> whether p->targets[pos] is set that will result in a dev_dbg even
> though setting it up actually failed.

I'll clean that up with a lead-in fixup.
Dan Williams Feb. 9, 2023, 4:26 a.m. UTC | #34
Jonathan Cameron wrote:
[..]
> > @@ -1284,13 +1327,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> >  		return -ENXIO;
> >  	}
> >  
> > -	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> > -		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> > -			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> > -			dev_name(&cxlrd->cxlsd.cxld.dev));
> > -		return -ENXIO;
> > -	}
> > -
> 
> In an ideal world, this would have been nice as two patches.
> One that reorders the various checks so that they are in the order
> after you have factored things out (easy to review for correctness)
> then one that factored it out.

I played with this a bit and the only way I could see to make the diff
come out significantly nicer would be to use a forward declaration to
move the new helpers below cxl_region_attach().
Dan Williams Feb. 9, 2023, 4:56 a.m. UTC | #35
Fan Ni wrote:
> On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> 
> 
> > Summary:
> > --------
> > 
> > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > regions, and more routinely, assembling a region from an existing
> > configuration established by platform-firmware. The latter is motivated
> > by CXL memory RAS (Reliability, Availability and Serviceability)
> > support, that requires associating device events with System Physical
> > Address ranges and vice versa.
> > 
> > The 'Soft Reserved' policy rework arranges for performance
> > differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> > to be designated for 'System RAM' by default, rather than the device-dax
> > dedicated access mode. That current device-dax default is confusing and
> > surprising for the Pareto of users that do not expect memory to be
> > quarantined for dedicated access by default. Most users expect all
> > 'System RAM'-capable memory to show up in FREE(1).
> > 
> > 
> > Details:
> > --------
> > 
> > Recall that the Linux 'Soft Reserved' designation for memory is a
> > reaction to platform-firmware, like EFI EDK2, delineating memory with
> > the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
> > alternative way to think of that attribute is that it specifies the
> > *not* general-purpose memory pool. It is memory that may be too precious
> > for general usage or not performant enough for some hot data structures.
> > However, in the absence of explicit policy it should just be 'System
> > RAM' by default.
> > 
> > Rather than require every distribution to ship a udev policy to assign
> > dax devices to dax_kmem (the device-memory hotplug driver) just make
> > that the kernel default. This is similar to the rationale in:
> > 
> > commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")
> > 
> > With this change the relatively niche use case of accessing this memory
> > via mapping a device-dax instance can be achieved by building with
> > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
> > memhp_default_state=offline at boot, and then use:
> > 
> >     daxctl reconfigure-device $device -m devdax --force
> > 
> > ...to shift the corresponding address range to device-dax access.
> > 
> > The process of assembling a device-dax instance for a given CXL region
> > device configuration is similar to the process of assembling a
> > Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
> > probing by the PCI and driver core enumerates all CXL endpoints and
> > their decoders. Then, once enough decoders have arrived to a describe a
> > given region, that region is passed to the device-dax subsystem where it
> > is subject to the above 'dax_kmem' policy. This assignment and policy
> > choice is only possible if memory is set aside by the 'Soft Reserved'
> > designation. Otherwise, CXL that is mapped as 'System RAM' becomes
> > immutable by CXL driver mechanisms, but is still enumerated for RAS
> > purposes.
> > 
> > This series is also available via:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> > 
> > ...and has gone through some preview testing in various forms.
> > 
> > ---
> 
> Tested-by: Fan Ni <fan.ni@samsung.com>
> 
> 
> Run the following tests with the patch (with the volatile support at qemu).
> Note: cxl related code are compiled as modules and loaded before used.
> 
> For pmem setup, tried three topologies (1HB1RP1Mem, 1HB2RP2Mem, 1HB2RP4Mem with
> a cxl switch). The memdev is either provided in the command line when launching
> qemu or hot added to the guest with device_add command in qemu monitor.
> 
> The following operations are performed,
> 1. create-region with cxl cmd
> 2. create name-space with ndctl cmd
> 3. convert cxl mem to ram with daxctl cmd
> 4. online the memory with daxctl cmd
> 5. Let app use the memory (numactl --membind=1 htop)
> 
> Results: No regression.
> 
> For volatile memory (hot add with device_add command), mainly tested 1HB1RP1Mem
> case (passthrough).
> 1. the device can be correctly discovered after hot add (cxl list, may need
> cxl enable-memdev)
> 2. creating ram region (cxl create-region) succeeded, after creating the
> region, a dax device under /dev/ is shown.
> 3. online the memory passes, and the memory is shown on another NUMA node.
> 4. Let app use the memory (numactl --membind=1 htop) passed.

Thank you, Fan!
Jonathan Cameron Feb. 9, 2023, 11:07 a.m. UTC | #36
On Wed, 8 Feb 2023 20:26:26 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> [..]
> > > @@ -1284,13 +1327,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> > >  		return -ENXIO;
> > >  	}
> > >  
> > > -	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> > > -		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> > > -			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> > > -			dev_name(&cxlrd->cxlsd.cxld.dev));
> > > -		return -ENXIO;
> > > -	}
> > > -  
> > 
> > In an ideal world, this would have been nice as two patches.
> > One that reorders the various checks so that they are in the order
> > after you have factored things out (easy to review for correctness)
> > then one that factored it out.  
> 
> I played with this a bit and the only way I could see to make the diff
> come out significantly nicer would be to use a forward declaration to
> move the new helpers below cxl_region_attach().

Don't bother then!  Unless you've already done it.

In the ideal world diff would magically present everything in the most
human readable form :)  What are all these AI folk working on that we
don't already have this!

Jonathan
Verma, Vishal L Feb. 9, 2023, 7:45 p.m. UTC | #37
On Sun, 2023-02-05 at 17:03 -0800, Dan Williams wrote:
> In preparation for region autodiscovery, that needs all devices
> discovered before their relative position in the region can be
> determined, consolidate all position dependent validation in a helper.
> 
> Recall that in the on-demand region creation flow the end-user picks the
> position of a given endpoint decoder in a region. In the autodiscovery
> case the position of an endpoint decoder can only be determined after
> all other endpoint decoders that claim to decode the region's address
> range have been enumerated and attached. So, in the autodiscovery case
> endpoint decoders may be attached before their relative position is
> known. Once all decoders arrive, then positions can be determined and
> validated with cxl_region_validate_position() the same as user initiated
> on-demand creation.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/core/region.c |  119 +++++++++++++++++++++++++++++----------------
>  1 file changed, 76 insertions(+), 43 deletions(-)

Looks good,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 97eafdd75675..c82d3b6f3d1f 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -1207,35 +1207,13 @@ static int cxl_region_setup_targets(struct cxl_region *cxlr)
>         return 0;
>  }
>  
> -static int cxl_region_attach(struct cxl_region *cxlr,
> -                            struct cxl_endpoint_decoder *cxled, int pos)
> +static int cxl_region_validate_position(struct cxl_region *cxlr,
> +                                       struct cxl_endpoint_decoder *cxled,
> +                                       int pos)
>  {
> -       struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
>         struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> -       struct cxl_port *ep_port, *root_port, *iter;
>         struct cxl_region_params *p = &cxlr->params;
> -       struct cxl_dport *dport;
> -       int i, rc = -ENXIO;
> -
> -       if (cxled->mode != cxlr->mode) {
> -               dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> -                       dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> -               return -EINVAL;
> -       }
> -
> -       if (cxled->mode == CXL_DECODER_DEAD) {
> -               dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
> -               return -ENODEV;
> -       }
> -
> -       /* all full of members, or interleave config not established? */
> -       if (p->state > CXL_CONFIG_INTERLEAVE_ACTIVE) {
> -               dev_dbg(&cxlr->dev, "region already active\n");
> -               return -EBUSY;
> -       } else if (p->state < CXL_CONFIG_INTERLEAVE_ACTIVE) {
> -               dev_dbg(&cxlr->dev, "interleave config missing\n");
> -               return -ENXIO;
> -       }
> +       int i;
>  
>         if (pos < 0 || pos >= p->interleave_ways) {
>                 dev_dbg(&cxlr->dev, "position %d out of range %d\n", pos,
> @@ -1274,6 +1252,71 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>                 }
>         }
>  
> +       return 0;
> +}
> +
> +static int cxl_region_attach_position(struct cxl_region *cxlr,
> +                                     struct cxl_root_decoder *cxlrd,
> +                                     struct cxl_endpoint_decoder *cxled,
> +                                     const struct cxl_dport *dport, int pos)
> +{
> +       struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +       struct cxl_port *iter;
> +       int rc;
> +
> +       if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> +               dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> +                       dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> +                       dev_name(&cxlrd->cxlsd.cxld.dev));
> +               return -ENXIO;
> +       }
> +
> +       for (iter = cxled_to_port(cxled); !is_cxl_root(iter);
> +            iter = to_cxl_port(iter->dev.parent)) {
> +               rc = cxl_port_attach_region(iter, cxlr, cxled, pos);
> +               if (rc)
> +                       goto err;
> +       }
> +
> +       return 0;
> +
> +err:
> +       for (iter = cxled_to_port(cxled); !is_cxl_root(iter);
> +            iter = to_cxl_port(iter->dev.parent))
> +               cxl_port_detach_region(iter, cxlr, cxled);
> +       return rc;
> +}
> +
> +static int cxl_region_attach(struct cxl_region *cxlr,
> +                            struct cxl_endpoint_decoder *cxled, int pos)
> +{
> +       struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
> +       struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +       struct cxl_region_params *p = &cxlr->params;
> +       struct cxl_port *ep_port, *root_port;
> +       struct cxl_dport *dport;
> +       int rc = -ENXIO;
> +
> +       if (cxled->mode != cxlr->mode) {
> +               dev_dbg(&cxlr->dev, "%s region mode: %d mismatch: %d\n",
> +                       dev_name(&cxled->cxld.dev), cxlr->mode, cxled->mode);
> +               return -EINVAL;
> +       }
> +
> +       if (cxled->mode == CXL_DECODER_DEAD) {
> +               dev_dbg(&cxlr->dev, "%s dead\n", dev_name(&cxled->cxld.dev));
> +               return -ENODEV;
> +       }
> +
> +       /* all full of members, or interleave config not established? */
> +       if (p->state > CXL_CONFIG_INTERLEAVE_ACTIVE) {
> +               dev_dbg(&cxlr->dev, "region already active\n");
> +               return -EBUSY;
> +       } else if (p->state < CXL_CONFIG_INTERLEAVE_ACTIVE) {
> +               dev_dbg(&cxlr->dev, "interleave config missing\n");
> +               return -ENXIO;
> +       }
> +
>         ep_port = cxled_to_port(cxled);
>         root_port = cxlrd_to_port(cxlrd);
>         dport = cxl_find_dport_by_dev(root_port, ep_port->host_bridge);
> @@ -1284,13 +1327,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>                 return -ENXIO;
>         }
>  
> -       if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> -               dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> -                       dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> -                       dev_name(&cxlrd->cxlsd.cxld.dev));
> -               return -ENXIO;
> -       }
> -
>         if (cxled->cxld.target_type != cxlr->type) {
>                 dev_dbg(&cxlr->dev, "%s:%s type mismatch: %d vs %d\n",
>                         dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> @@ -1314,12 +1350,13 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>                 return -EINVAL;
>         }
>  
> -       for (iter = ep_port; !is_cxl_root(iter);
> -            iter = to_cxl_port(iter->dev.parent)) {
> -               rc = cxl_port_attach_region(iter, cxlr, cxled, pos);
> -               if (rc)
> -                       goto err;
> -       }
> +       rc = cxl_region_validate_position(cxlr, cxled, pos);
> +       if (rc)
> +               return rc;
> +
> +       rc = cxl_region_attach_position(cxlr, cxlrd, cxled, dport, pos);
> +       if (rc)
> +               return rc;
>  
>         p->targets[pos] = cxled;
>         cxled->pos = pos;
> @@ -1343,10 +1380,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
>  
>  err_decrement:
>         p->nr_targets--;
> -err:
> -       for (iter = ep_port; !is_cxl_root(iter);
> -            iter = to_cxl_port(iter->dev.parent))
> -               cxl_port_detach_region(iter, cxlr, cxled);
>         return rc;
>  }
>  
>
Verma, Vishal L Feb. 9, 2023, 8:17 p.m. UTC | #38
On Sun, 2023-02-05 at 17:03 -0800, Dan Williams wrote:
> Add help text and a label so the CXL_REGION config option can be
> toggled. This is mainly to enable compile testing without region
> support.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/cxl/Kconfig |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 0ac53c422c31..163c094e67ae 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -104,12 +104,22 @@ config CXL_SUSPEND
>         depends on SUSPEND && CXL_MEM
>  
>  config CXL_REGION
> -       bool
> +       bool "CXL: Region Support"
>         default CXL_BUS
>         # For MAX_PHYSMEM_BITS
>         depends on SPARSEMEM
>         select MEMREGION
>         select GET_FREE_REGION
> +       help
> +         Enable the CXL core to enumerate and provision CXL regions. A CXL
> +         region is defined by one or more CXL expanders that decode a given
> +         system-physical address range. For CXL regions established by
> +         platform-firmware this option enables memory error handling to

unnecessary-hyphenation?

But regardless,

Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>


> +         identify the devices participating in a given interleaved memory
> +         range. Otherwise, platform-firmware managed CXL is enabled by being
> +         placed in the system address map and does not need a driver.
> +
> +         If unsure say 'y'
>  
>  config CXL_REGION_INVALIDATION_TEST
>         bool "CXL: Region Cache Management Bypass (TEST)"
> 
>
Dan Williams Feb. 9, 2023, 8:52 p.m. UTC | #39
Jonathan Cameron wrote:
> On Wed, 8 Feb 2023 20:26:26 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan Cameron wrote:
> > [..]
> > > > @@ -1284,13 +1327,6 @@ static int cxl_region_attach(struct cxl_region *cxlr,
> > > >  		return -ENXIO;
> > > >  	}
> > > >  
> > > > -	if (cxlrd->calc_hb(cxlrd, pos) != dport) {
> > > > -		dev_dbg(&cxlr->dev, "%s:%s invalid target position for %s\n",
> > > > -			dev_name(&cxlmd->dev), dev_name(&cxled->cxld.dev),
> > > > -			dev_name(&cxlrd->cxlsd.cxld.dev));
> > > > -		return -ENXIO;
> > > > -	}
> > > > -  
> > > 
> > > In an ideal world, this would have been nice as two patches.
> > > One that reorders the various checks so that they are in the order
> > > after you have factored things out (easy to review for correctness)
> > > then one that factored it out.  
> > 
> > I played with this a bit and the only way I could see to make the diff
> > come out significantly nicer would be to use a forward declaration to
> > move the new helpers below cxl_region_attach().
> 
> Don't bother then!  Unless you've already done it.

No worries, abandoned.

> In the ideal world diff would magically present everything in the most
> human readable form :)  What are all these AI folk working on that we
> don't already have this!

+1 to this.
David Hildenbrand Feb. 13, 2023, 12:13 p.m. UTC | #40
On 06.02.23 02:02, Dan Williams wrote:
> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 
> The 'Soft Reserved' policy rework arranges for performance
> differentiated memory like CXL attached DRAM, or high-bandwidth memory,
> to be designated for 'System RAM' by default, rather than the device-dax
> dedicated access mode. That current device-dax default is confusing and
> surprising for the Pareto of users that do not expect memory to be
> quarantined for dedicated access by default. Most users expect all
> 'System RAM'-capable memory to show up in FREE(1).
> 
> 
> Details:
> --------
> 
> Recall that the Linux 'Soft Reserved' designation for memory is a
> reaction to platform-firmware, like EFI EDK2, delineating memory with
> the EFI Specific Purpose Memory attribute (EFI_MEMORY_SP). An
> alternative way to think of that attribute is that it specifies the
> *not* general-purpose memory pool. It is memory that may be too precious
> for general usage or not performant enough for some hot data structures.
> However, in the absence of explicit policy it should just be 'System
> RAM' by default.
> 
> Rather than require every distribution to ship a udev policy to assign
> dax devices to dax_kmem (the device-memory hotplug driver) just make
> that the kernel default. This is similar to the rationale in:
> 
> commit 8604d9e534a3 ("memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE")
> 
> With this change the relatively niche use case of accessing this memory
> via mapping a device-dax instance can be achieved by building with
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, or specifying
> memhp_default_state=offline at boot, and then use:
> 
>      daxctl reconfigure-device $device -m devdax --force
> 
> ...to shift the corresponding address range to device-dax access.
> 
> The process of assembling a device-dax instance for a given CXL region
> device configuration is similar to the process of assembling a
> Device-Mapper or MDRAID storage-device array. Specifically, asynchronous
> probing by the PCI and driver core enumerates all CXL endpoints and
> their decoders. Then, once enough decoders have arrived to a describe a
> given region, that region is passed to the device-dax subsystem where it
> is subject to the above 'dax_kmem' policy. This assignment and policy
> choice is only possible if memory is set aside by the 'Soft Reserved'
> designation. Otherwise, CXL that is mapped as 'System RAM' becomes
> immutable by CXL driver mechanisms, but is still enumerated for RAS
> purposes.
> 
> This series is also available via:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> 
> ...and has gone through some preview testing in various forms.
> 

My concern would be that in setups with a lot of CXL memory 
(soft-reserved), having that much offline memory during boot might make 
the kernel run out of memory. After all, offline memory consumes memory 
for the memmap.

Is the assumption that something like that cannot happen because we'll 
never ever have that much soft-reserved memory? :)

Note that this is a concern only applies when not using auto-onlining in 
the kernel during boot, which (IMHO) is or will be the default in the 
future.
Gregory Price Feb. 14, 2023, 6:27 p.m. UTC | #41
On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> Summary:
> --------
> 
> CXL RAM support allows for the dynamic provisioning of new CXL RAM
> regions, and more routinely, assembling a region from an existing
> configuration established by platform-firmware. The latter is motivated
> by CXL memory RAS (Reliability, Availability and Serviceability)
> support, that requires associating device events with System Physical
> Address ranges and vice versa.
> 

Ok, I simplified down my tests and reverted a bunch of stuff, figured i
should report this before I dive further in.

Earlier i was carrying the DOE patches and others, I've dropped most of
that to make sure i could replicate on the base kernel and qemu images

QEMU branch: 
https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
this is a little out of date at this point i think? but it shouldn't
matter, the results are the same regardless of what else i pull in.

Kernel branch:
https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
I had been carrying a bunch of other hotfixes and work like the DOE
patches, but but i wanted to know if the base branch saw the same
behavior.

Config:
sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
-drive file=/data/qemu/images/cxl.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 4G,slots=4,maxmem=8G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1 \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
-object memory-backend-ram,id=mem0,size=4G \
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G


boot is fine
# ./ndctl/build/cxl/cxl create-region -m -t ram -d decoder0.0 \
  -w 1 -g 4096 mem0

[   28.183276] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
{
  "region":"region0",
  "resource":"0x390000000",
  "size":"4.00 GiB (4.29 GB)",
  "type":"ram",
  "interleave_ways":1,
  "interleave_granularity":4096,
  "decode_state":"commit",
  "mappings":[
    {
      "position":0,
      "memdev":"mem0",
      "decoder":"decoder2.0"
    }
  ]
}
cxl region: cmd_create_region: created 1 region
[   28.247144] Built 1 zonelists, mobility grouping on.  Total pages: 979754
[   28.247844] Policy zone: Normal
[   28.258449] Fallback order for Node 0: 0 1
[   28.258945] Fallback order for Node 1: 1 0
[   28.259422] Built 2 zonelists, mobility grouping on.  Total pages: 1012522
[   28.260087] Policy zone: Normal


top shows the memory has been onlined
MiB Mem :   8022.1 total,   7631.6 free,    200.9 used,    189.7 buff/cache
MiB Swap:   3926.0 total,   3926.0 free,      0.0 used.   7567.8 avail Mem


Lets attempt to use the memory
[root@fedora ~]# numactl --membind=1 python
KVM internal error. Suberror: 3
extra data[0]: 0x0000000080000b0e
extra data[1]: 0x0000000000000031
extra data[2]: 0x0000000000000d81
extra data[3]: 0x0000000390074ac0
extra data[4]: 0x0000000000000010
RAX=0000000080000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000001
RSI=0000000000000000 RDI=0000000390074000 RBP=ffffac1c4067bca0 RSP=ffffac1c4067bc88
R8 =0000000000000000 R9 =0000000000000001 R10=0000000000000000 R11=0000000000000000
R12=0000000000000000 R13=ffff99eed0074000 R14=0000000000000000 R15=0000000000000000
RIP=ffffffff812b3d62 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 0000000000000000 ffffffff 00c00000
GS =0000 ffff99ec3bc00000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 fffffe1d13135000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe1d13133000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=ffffffff812b3d62 CR3=0000000390074000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=5d 9c 01 0f b7 db 48 09 df 48 0f ba ef 3f 0f 22 df 0f 1f 00 <5b> 41 5c 41 5d 5d c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 cd 0d 76 01



I also tested lowering the ram sizes (2GB ram, 1GB "CXL") to see if
there's something going on with the PCI hole or something, but no, same
results.

Double checked if there was an issue using a single root port so i
registered a second one - same results.


In prior tests i accessed the memory directly via devmem2

This still works when mapping the memory manually

[root@fedora map] ./map_memory.sh
echo ram > /sys/bus/cxl/devices/decoder2.0/mode
echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size
echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
echo 4096 > /sys/bus/cxl/devices/region0/interleave_granularity
echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
echo 0x40000000 > /sys/bus/cxl/devices/region0/size
echo decoder2.0 > /sys/bus/cxl/devices/region0/target0
echo 1 > /sys/bus/cxl/devices/region0/commit


[root@fedora devmem]# ./devmem2 0x290000000 w 0x12345678
/dev/mem opened.
Memory mapped at address 0x7fb4d4ed3000.
Value at address 0x290000000 (0x7fb4d4ed3000): 0x0
Written 0x12345678; readback 0x12345678



This kind of implies there's a disagreement about the state of memory
between linux and qemu.


but even just onlining a region produces memory usage:

[root@fedora ~]# cat /sys/bus/node/devices/node1/meminfo
Node 1 MemTotal:        1048576 kB
Node 1 MemFree:         1048112 kB
Node 1 MemUsed:             464 kB


Which I would expect to set off some fireworks.

Maybe an issue at the NUMA level? I just... i have no idea.


I will need to dig through the email chains to figure out what others
have been doing that i'm missing.  Everything *looks* nominal, but the
reactors are exploding so... ¯\_(ツ)_/¯

I'm not sure where to start here, but i'll bash my face on the keyboard
for a bit until i have some ideas.


~Gregory
Dan Williams Feb. 14, 2023, 6:39 p.m. UTC | #42
Gregory Price wrote:
> On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> > Summary:
> > --------
> > 
> > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > regions, and more routinely, assembling a region from an existing
> > configuration established by platform-firmware. The latter is motivated
> > by CXL memory RAS (Reliability, Availability and Serviceability)
> > support, that requires associating device events with System Physical
> > Address ranges and vice versa.
> > 
> 
> Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> should report this before I dive further in.
> 
> Earlier i was carrying the DOE patches and others, I've dropped most of
> that to make sure i could replicate on the base kernel and qemu images
> 
> QEMU branch: 
> https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> this is a little out of date at this point i think? but it shouldn't
> matter, the results are the same regardless of what else i pull in.
> 
> Kernel branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region

Note that I acted on this feedback from Greg to break out a fix and
merge it for v6.2-final

http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com

...i.e. you are missing at least the passthrough decoder fix, but that
would show up as a region creation failure not a QEMU crash.

So I would move to testing cxl/next.

[..]
> Lets attempt to use the memory
> [root@fedora ~]# numactl --membind=1 python
> KVM internal error. Suberror: 3
> extra data[0]: 0x0000000080000b0e
> extra data[1]: 0x0000000000000031
> extra data[2]: 0x0000000000000d81
> extra data[3]: 0x0000000390074ac0
> extra data[4]: 0x0000000000000010
> RAX=0000000080000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000001
> RSI=0000000000000000 RDI=0000000390074000 RBP=ffffac1c4067bca0 RSP=ffffac1c4067bc88
> R8 =0000000000000000 R9 =0000000000000001 R10=0000000000000000 R11=0000000000000000
> R12=0000000000000000 R13=ffff99eed0074000 R14=0000000000000000 R15=0000000000000000
> RIP=ffffffff812b3d62 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 0000000000000000 ffffffff 00c00000
> CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0000 0000000000000000 ffffffff 00c00000
> FS =0000 0000000000000000 ffffffff 00c00000
> GS =0000 ffff99ec3bc00000 ffffffff 00c00000
> LDT=0000 0000000000000000 ffffffff 00c00000
> TR =0040 fffffe1d13135000 00004087 00008b00 DPL=0 TSS64-busy
> GDT=     fffffe1d13133000 0000007f
> IDT=     fffffe0000000000 00000fff
> CR0=80050033 CR2=ffffffff812b3d62 CR3=0000000390074000 CR4=000006f0
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000fffe0ff0 DR7=0000000000000400
> EFER=0000000000000d01
> Code=5d 9c 01 0f b7 db 48 09 df 48 0f ba ef 3f 0f 22 df 0f 1f 00 <5b> 41 5c 41 5d 5d c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 cd 0d 76 01
> 

At first glance that looks like a QEMU issue, but I would capture a cxl
list -vvv before attempting to use the memory just to verify the decoder
setup looks sane.

> 
> 
> I also tested lowering the ram sizes (2GB ram, 1GB "CXL") to see if
> there's something going on with the PCI hole or something, but no, same
> results.
> 
> Double checked if there was an issue using a single root port so i
> registered a second one - same results.
> 
> 
> In prior tests i accessed the memory directly via devmem2
> 
> This still works when mapping the memory manually
> 
> [root@fedora map] ./map_memory.sh
> echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size
> echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region
> echo 4096 > /sys/bus/cxl/devices/region0/interleave_granularity
> echo 1 > /sys/bus/cxl/devices/region0/interleave_ways
> echo 0x40000000 > /sys/bus/cxl/devices/region0/size
> echo decoder2.0 > /sys/bus/cxl/devices/region0/target0
> echo 1 > /sys/bus/cxl/devices/region0/commit
> 
> 
> [root@fedora devmem]# ./devmem2 0x290000000 w 0x12345678
> /dev/mem opened.
> Memory mapped at address 0x7fb4d4ed3000.
> Value at address 0x290000000 (0x7fb4d4ed3000): 0x0
> Written 0x12345678; readback 0x12345678

Likely it is sensitive to crossing an interleave threshold.

> This kind of implies there's a disagreement about the state of memory
> between linux and qemu.
> 
> 
> but even just onlining a region produces memory usage:
> 
> [root@fedora ~]# cat /sys/bus/node/devices/node1/meminfo
> Node 1 MemTotal:        1048576 kB
> Node 1 MemFree:         1048112 kB
> Node 1 MemUsed:             464 kB
> 
> 
> Which I would expect to set off some fireworks.
> 
> Maybe an issue at the NUMA level? I just... i have no idea.
> 
> 
> I will need to dig through the email chains to figure out what others
> have been doing that i'm missing.  Everything *looks* nominal, but the
> reactors are exploding so... ¯\_(ツ)_/¯
> 
> I'm not sure where to start here, but i'll bash my face on the keyboard
> for a bit until i have some ideas.

Not ruling out the driver yet, but Fan's tests with hardware has me
leaning more towards QEMU.
Dan Williams Feb. 14, 2023, 6:45 p.m. UTC | #43
David Hildenbrand wrote:
[..]
> My concern would be that in setups with a lot of CXL memory 
> (soft-reserved), having that much offline memory during boot might make 
> the kernel run out of memory. After all, offline memory consumes memory 
> for the memmap.
> 
> Is the assumption that something like that cannot happen because we'll 
> never ever have that much soft-reserved memory? :)

Right, the assumption is that platform firmware is going to pick a
reasonable general purpose pool that lets an OS boot. In the near term
it is difficult to get that wrong since it will encompass all locally
attached DDR-DRAM. In the longer term, where it becomes possible that
all memory is CXL attached, then platform firmware will need to mark
some or all of CXL memory as general purpose (not Soft Reserved).

> Note that this is a concern only applies when not using auto-onlining in 
> the kernel during boot, which (IMHO) is or will be the default in the 
> future.

Yes, I expect that to be the default as well and let those small few
that have custom requirements figure out how to override that default.
Gregory Price Feb. 14, 2023, 7:01 p.m. UTC | #44
On Tue, Feb 14, 2023 at 10:39:42AM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
> > > Summary:
> > > --------
> > > 
> > > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > > regions, and more routinely, assembling a region from an existing
> > > configuration established by platform-firmware. The latter is motivated
> > > by CXL memory RAS (Reliability, Availability and Serviceability)
> > > support, that requires associating device events with System Physical
> > > Address ranges and vice versa.
> > > 
> > 
> > Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> > should report this before I dive further in.
> > 
> > Earlier i was carrying the DOE patches and others, I've dropped most of
> > that to make sure i could replicate on the base kernel and qemu images
> > 
> > QEMU branch: 
> > https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> > this is a little out of date at this point i think? but it shouldn't
> > matter, the results are the same regardless of what else i pull in.
> > 
> > Kernel branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region
> 
> Note that I acted on this feedback from Greg to break out a fix and
> merge it for v6.2-final
> 
> http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com
> 
> ...i.e. you are missing at least the passthrough decoder fix, but that
> would show up as a region creation failure not a QEMU crash.
> 
> So I would move to testing cxl/next.
> 

I just noticed this, already spinning a new kernel.  Will report back

> Not ruling out the driver yet, but Fan's tests with hardware has me
> leaning more towards QEMU.

Same, not much has changed and I haven't tested with hardware yet. Was
planning to install it on our local boxes sometime later this week.

Was just so close to setting up a virtual memory pool in the lab, was
getting antsy :]
Jonathan Cameron Feb. 14, 2023, 9:18 p.m. UTC | #45
On Tue, 14 Feb 2023 14:01:23 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Tue, Feb 14, 2023 at 10:39:42AM -0800, Dan Williams wrote:
> > Gregory Price wrote:  
> > > On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:  
> > > > Summary:
> > > > --------
> > > > 
> > > > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > > > regions, and more routinely, assembling a region from an existing
> > > > configuration established by platform-firmware. The latter is motivated
> > > > by CXL memory RAS (Reliability, Availability and Serviceability)
> > > > support, that requires associating device events with System Physical
> > > > Address ranges and vice versa.
> > > >   
> > > 
> > > Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> > > should report this before I dive further in.
> > > 
> > > Earlier i was carrying the DOE patches and others, I've dropped most of
> > > that to make sure i could replicate on the base kernel and qemu images
> > > 
> > > QEMU branch: 
> > > https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> > > this is a little out of date at this point i think? but it shouldn't
> > > matter, the results are the same regardless of what else i pull in.
> > > 
> > > Kernel branch:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region  
> > 
> > Note that I acted on this feedback from Greg to break out a fix and
> > merge it for v6.2-final
> > 
> > http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com
> > 
> > ...i.e. you are missing at least the passthrough decoder fix, but that
> > would show up as a region creation failure not a QEMU crash.
> > 
> > So I would move to testing cxl/next.
> >   
> 
> I just noticed this, already spinning a new kernel.  Will report back
> 
> > Not ruling out the driver yet, but Fan's tests with hardware has me
> > leaning more towards QEMU.  
> 
> Same, not much has changed and I haven't tested with hardware yet. Was
> planning to install it on our local boxes sometime later this week.
> 
> Was just so close to setting up a virtual memory pool in the lab, was
> getting antsy :]

Could you test it with TCG (just drop --enable-kvm)?  We have a known
limitation with x86 instructions running out of CXL emulated memory
(side effect of emulating the interleave).  You'll need a fix even on TCG
for the corner case of an instruction bridging from normal ram to cxl memory.
https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/

Performance will be bad, but so far this is only way we can do it correctly.

Jonathan
Gregory Price Feb. 14, 2023, 9:51 p.m. UTC | #46
On Tue, Feb 14, 2023 at 09:18:24PM +0000, Jonathan Cameron wrote:
> On Tue, 14 Feb 2023 14:01:23 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > On Tue, Feb 14, 2023 at 10:39:42AM -0800, Dan Williams wrote:
> > > Gregory Price wrote:  
> > > > On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:  
> > > > > Summary:
> > > > > --------
> > > > > 
> > > > > CXL RAM support allows for the dynamic provisioning of new CXL RAM
> > > > > regions, and more routinely, assembling a region from an existing
> > > > > configuration established by platform-firmware. The latter is motivated
> > > > > by CXL memory RAS (Reliability, Availability and Serviceability)
> > > > > support, that requires associating device events with System Physical
> > > > > Address ranges and vice versa.
> > > > >   
> > > > 
> > > > Ok, I simplified down my tests and reverted a bunch of stuff, figured i
> > > > should report this before I dive further in.
> > > > 
> > > > Earlier i was carrying the DOE patches and others, I've dropped most of
> > > > that to make sure i could replicate on the base kernel and qemu images
> > > > 
> > > > QEMU branch: 
> > > > https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26
> > > > this is a little out of date at this point i think? but it shouldn't
> > > > matter, the results are the same regardless of what else i pull in.
> > > > 
> > > > Kernel branch:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/cxl-ram-region  
> > > 
> > > Note that I acted on this feedback from Greg to break out a fix and
> > > merge it for v6.2-final
> > > 
> > > http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com
> > > 
> > > ...i.e. you are missing at least the passthrough decoder fix, but that
> > > would show up as a region creation failure not a QEMU crash.
> > > 
> > > So I would move to testing cxl/next.
> > >   
> > 
> > I just noticed this, already spinning a new kernel.  Will report back
> > 
> > > Not ruling out the driver yet, but Fan's tests with hardware has me
> > > leaning more towards QEMU.  
> > 
> > Same, not much has changed and I haven't tested with hardware yet. Was
> > planning to install it on our local boxes sometime later this week.
> > 
> > Was just so close to setting up a virtual memory pool in the lab, was
> > getting antsy :]
> 
> Could you test it with TCG (just drop --enable-kvm)?  We have a known
> limitation with x86 instructions running out of CXL emulated memory
> (side effect of emulating the interleave).  You'll need a fix even on TCG
> for the corner case of an instruction bridging from normal ram to cxl memory.
> https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/
> 
> Performance will be bad, but so far this is only way we can do it correctly.
> 
> Jonathan
> 

Siiiggghh... i had this patch and dropped --enable-kvm, but forgot to
drop "accel=kvm" from the -machine line

This was the issue.

And let me tell you, if you numactl --membind=1 python, it is
IMPRESSIVELY slow.  I wonder if it's even hitting a few 100k
instructions a second.


This appears to be the issue.  When I get a bit more time, try to dive
into the deep dark depths of qemu memory regions to see how difficult
a non-mmio fork might be, unless someone else is already looking at it.

~Gregory
Gregory Price Feb. 14, 2023, 9:54 p.m. UTC | #47
On Tue, Feb 14, 2023 at 04:51:53PM -0500, Gregory Price wrote:
> On Tue, Feb 14, 2023 at 09:18:24PM +0000, Jonathan Cameron wrote:
> > On Tue, 14 Feb 2023 14:01:23 -0500
> > Gregory Price <gregory.price@memverge.com> wrote:
> > 
> > Could you test it with TCG (just drop --enable-kvm)?  We have a known
> > limitation with x86 instructions running out of CXL emulated memory
> > (side effect of emulating the interleave).  You'll need a fix even on TCG
> > for the corner case of an instruction bridging from normal ram to cxl memory.
> > https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/
> > 
> > Performance will be bad, but so far this is only way we can do it correctly.
> > 
> > Jonathan
> > 
> 
> Siiiggghh... i had this patch and dropped --enable-kvm, but forgot to
> drop "accel=kvm" from the -machine line
> 
> This was the issue.
> 
> And let me tell you, if you numactl --membind=1 python, it is
> IMPRESSIVELY slow.  I wonder if it's even hitting a few 100k
> instructions a second.
> 
> 
> This appears to be the issue.  When I get a bit more time, try to dive
> into the deep dark depths of qemu memory regions to see how difficult
> a non-mmio fork might be, unless someone else is already looking at it.
> 
> ~Gregory

Just clarifying one thing:  Even with the patch, KVM blows up.
Disabling KVM fixes this entirely.  I haven't tested without KVM but
with the patch, i will do that now.
Jonathan Cameron Feb. 15, 2023, 10:03 a.m. UTC | #48
On Tue, 14 Feb 2023 16:54:02 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Tue, Feb 14, 2023 at 04:51:53PM -0500, Gregory Price wrote:
> > On Tue, Feb 14, 2023 at 09:18:24PM +0000, Jonathan Cameron wrote:  
> > > On Tue, 14 Feb 2023 14:01:23 -0500
> > > Gregory Price <gregory.price@memverge.com> wrote:
> > > 
> > > Could you test it with TCG (just drop --enable-kvm)?  We have a known
> > > limitation with x86 instructions running out of CXL emulated memory
> > > (side effect of emulating the interleave).  You'll need a fix even on TCG
> > > for the corner case of an instruction bridging from normal ram to cxl memory.
> > > https://lore.kernel.org/qemu-devel/20230206193809.1153124-1-richard.henderson@linaro.org/
> > > 
> > > Performance will be bad, but so far this is only way we can do it correctly.
> > > 
> > > Jonathan
> > >   
> > 
> > Siiiggghh... i had this patch and dropped --enable-kvm, but forgot to
> > drop "accel=kvm" from the -machine line
> > 
> > This was the issue.
> > 
> > And let me tell you, if you numactl --membind=1 python, it is
> > IMPRESSIVELY slow.  I wonder if it's even hitting a few 100k
> > instructions a second.
> > 
> > 
> > This appears to be the issue.  When I get a bit more time, try to dive
> > into the deep dark depths of qemu memory regions to see how difficult
> > a non-mmio fork might be, unless someone else is already looking at it.
> > 
> > ~Gregory  
> 
> Just clarifying one thing:  Even with the patch, KVM blows up.
> Disabling KVM fixes this entirely.  I haven't tested without KVM but
> with the patch, i will do that now.

yup.  The patch only fixes TCG so that's expected behavior.

Fingers crossed on this 'working'.

I'm open to suggestions on how to work around the problem with KVM
or indeed allow TCG to cache the instructions (right not it has
to fetch and emulate each instruction on it's own).

I can envision how we might do it for KVM with userspace page fault handling
used to get a fault up to QEMU which can then stitch in a cache
of the underlying memory as a stage 2 translation to the page (a little
bit like how post migration copy works) though I've not prototyped
anything...

I think it would be complex code that would be little used
so we may just have to cope with the emulation being slow.

Intent is very much to be able to test the kernel code etc, not
test it quickly :)

Jonathan
Gregory Price Feb. 18, 2023, 9:47 a.m. UTC | #49
On Wed, Feb 15, 2023 at 10:03:27AM +0000, Jonathan Cameron wrote:
> On Tue, 14 Feb 2023 16:54:02 -0500
> Gregory Price <gregory.price@memverge.com> wrote:
> 
> > Just clarifying one thing:  Even with the patch, KVM blows up.
> > Disabling KVM fixes this entirely.  I haven't tested without KVM but
> > with the patch, i will do that now.
> 
> yup.  The patch only fixes TCG so that's expected behavior.
> 
> Fingers crossed on this 'working'.
> 
> I'm open to suggestions on how to work around the problem with KVM
> or indeed allow TCG to cache the instructions (right not it has
> to fetch and emulate each instruction on it's own).
> 
> I can envision how we might do it for KVM with userspace page fault handling
> used to get a fault up to QEMU which can then stitch in a cache
> of the underlying memory as a stage 2 translation to the page (a little
> bit like how post migration copy works) though I've not prototyped
> anything...
> 

Just following up.  With the patch applied and KVM turned off, no crash.
I've been working with this for a while.

We should move the instruction alignment issue into a separate
discussion thread.