Message ID | 20250220161320.518450-3-jean-philippe@linaro.org |
---|---|
State | New |
Headers | show |
Series | arm: Add DMA remapping for CCA | expand |
On 20.02.25 17:13, Jean-Philippe Brucker wrote: > For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU > notifiers and create e.g. VFIO mappings. The default VFIO discard > notifier isn't sufficient for CCA because the DMA addresses need a > translation (even without vIOMMU). > > At the moment: > * guest_memfd_state_change() calls the populate() notifier > * the populate notifier() calls IOMMU notifiers > * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA > * it calls ram_discard_manager_is_populated() which fails. > > guest_memfd_state_change() only changes the section's state after > calling the populate() notifier. We can't easily invert the order of > operation because it uses the old state bitmap to know which pages need > the populate() notifier. I assume we talk about this code: [1] [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com +static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset, + uint64_t size, bool shared_to_private) +{ + int block_size = memory_attribute_manager_get_block_size(mgr); + int ret = 0; + + if (!memory_attribute_is_valid_range(mgr, offset, size)) { + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", + __func__, offset, size); + return -1; + } + + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) || + (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) { + return 0; + } + + if (shared_to_private) { + memory_attribute_notify_discard(mgr, offset, size); + } else { + ret = memory_attribute_notify_populate(mgr, offset, size); + } + + if (!ret) { + unsigned long first_bit = offset / block_size; + unsigned long nbits = size / block_size; + + g_assert((first_bit + nbits) <= mgr->bitmap_size); + + if (shared_to_private) { + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); + } else { + bitmap_set(mgr->shared_bitmap, first_bit, nbits); + } + + return 0; + } + + return ret; +} Then, in memory_attribute_notify_populate(), we walk the bitmap again. Why? We just checked that it's all in the expected state, no? virtio-mem doesn't handle it that way, so I'm curious why we would have to do it here? > > For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr() > that we're aware of the RAM discard manager state. > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> > --- > > Definitely not the prettiest hack, any idea how to do this cleanly? > --- > include/exec/memory.h | 5 +++++ > system/memory.c | 3 ++- > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 9f73b59867..6fcd98fe58 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -116,6 +116,11 @@ typedef enum { > IOMMU_RO = 1, > IOMMU_WO = 2, > IOMMU_RW = 3, > + /* > + * Allow mapping a discarded page, because we're in the process of > + * populating it. > + */ > + IOMMU_POPULATING = 4, > } IOMMUAccessFlags; > > #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0)) > diff --git a/system/memory.c b/system/memory.c > index 4c829793a0..8e884f9c15 100644 > --- a/system/memory.c > +++ b/system/memory.c > @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, > * Disallow that. vmstate priorities make sure any RamDiscardManager > * were already restored before IOMMUs are restored. > */ > - if (!ram_discard_manager_is_populated(rdm, &tmp)) { > + if (!(iotlb->perm & IOMMU_POPULATING) && > + !ram_discard_manager_is_populated(rdm, &tmp)) { > error_setg(errp, "iommu map to discarded memory (e.g., unplugged" > " via virtio-mem): %" HWADDR_PRIx "", > iotlb->translated_addr);
On 2/21/2025 3:39 AM, David Hildenbrand wrote: > On 20.02.25 17:13, Jean-Philippe Brucker wrote: >> For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU >> notifiers and create e.g. VFIO mappings. The default VFIO discard >> notifier isn't sufficient for CCA because the DMA addresses need a >> translation (even without vIOMMU). >> >> At the moment: >> * guest_memfd_state_change() calls the populate() notifier >> * the populate notifier() calls IOMMU notifiers >> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >> * it calls ram_discard_manager_is_populated() which fails. >> >> guest_memfd_state_change() only changes the section's state after >> calling the populate() notifier. We can't easily invert the order of >> operation because it uses the old state bitmap to know which pages need >> the populate() notifier. > > I assume we talk about this code: [1] > > [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com > > > +static int memory_attribute_state_change(MemoryAttributeManager *mgr, > uint64_t offset, > + uint64_t size, bool > shared_to_private) > +{ > + int block_size = memory_attribute_manager_get_block_size(mgr); > + int ret = 0; > + > + if (!memory_attribute_is_valid_range(mgr, offset, size)) { > + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", > + __func__, offset, size); > + return -1; > + } > + > + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, > offset, size)) || > + (!shared_to_private && memory_attribute_is_range_populated(mgr, > offset, size))) { > + return 0; > + } > + > + if (shared_to_private) { > + memory_attribute_notify_discard(mgr, offset, size); > + } else { > + ret = memory_attribute_notify_populate(mgr, offset, size); > + } > + > + if (!ret) { > + unsigned long first_bit = offset / block_size; > + unsigned long nbits = size / block_size; > + > + g_assert((first_bit + nbits) <= mgr->bitmap_size); > + > + if (shared_to_private) { > + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); > + } else { > + bitmap_set(mgr->shared_bitmap, first_bit, nbits); > + } > + > + return 0; > + } > + > + return ret; > +} > > Then, in memory_attribute_notify_populate(), we walk the bitmap again. > > Why? > > We just checked that it's all in the expected state, no? > > > virtio-mem doesn't handle it that way, so I'm curious why we would have > to do it here? I was concerned about the case where the guest issues a request that only partial of the range is in the desired state. I think the main problem is the policy for the guest conversion request. My current handling is: 1. When a conversion request is made for a range already in the desired state, the helper simply returns success. 2. For requests involving a range partially in the desired state, only the necessary segments are converted, ensuring the entire range complies with the request efficiently. 3. In scenarios where a conversion request is declined by other systems, such as a failure from VFIO during notify_populate(), the helper will roll back the request, maintaining consistency. And the policy of virtio-mem is to refuse the state change if not all blocks are in the opposite state. Actually, this part is still a uncertain to me. BTW, per the status/bitmap track, the virtio-mem also changes the bitmap after the plug/unplug notifier. This is the same, correct? > > >> >> For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr() >> that we're aware of the RAM discard manager state. >> >> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> >> --- >> >> Definitely not the prettiest hack, any idea how to do this cleanly? >> --- >> include/exec/memory.h | 5 +++++ >> system/memory.c | 3 ++- >> 2 files changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/include/exec/memory.h b/include/exec/memory.h >> index 9f73b59867..6fcd98fe58 100644 >> --- a/include/exec/memory.h >> +++ b/include/exec/memory.h >> @@ -116,6 +116,11 @@ typedef enum { >> IOMMU_RO = 1, >> IOMMU_WO = 2, >> IOMMU_RW = 3, >> + /* >> + * Allow mapping a discarded page, because we're in the process of >> + * populating it. >> + */ >> + IOMMU_POPULATING = 4, >> } IOMMUAccessFlags; >> #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? >> IOMMU_WO : 0)) >> diff --git a/system/memory.c b/system/memory.c >> index 4c829793a0..8e884f9c15 100644 >> --- a/system/memory.c >> +++ b/system/memory.c >> @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, >> void **vaddr, >> * Disallow that. vmstate priorities make sure any >> RamDiscardManager >> * were already restored before IOMMUs are restored. >> */ >> - if (!ram_discard_manager_is_populated(rdm, &tmp)) { >> + if (!(iotlb->perm & IOMMU_POPULATING) && >> + !ram_discard_manager_is_populated(rdm, &tmp)) { >> error_setg(errp, "iommu map to discarded memory (e.g., >> unplugged" >> " via virtio-mem): %" HWADDR_PRIx "", >> iotlb->translated_addr); > >
On 21.02.25 03:25, Chenyi Qiang wrote: > > > On 2/21/2025 3:39 AM, David Hildenbrand wrote: >> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>> For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU >>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>> notifier isn't sufficient for CCA because the DMA addresses need a >>> translation (even without vIOMMU). >>> >>> At the moment: >>> * guest_memfd_state_change() calls the populate() notifier >>> * the populate notifier() calls IOMMU notifiers >>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >>> * it calls ram_discard_manager_is_populated() which fails. >>> >>> guest_memfd_state_change() only changes the section's state after >>> calling the populate() notifier. We can't easily invert the order of >>> operation because it uses the old state bitmap to know which pages need >>> the populate() notifier. >> >> I assume we talk about this code: [1] >> >> [1] https://lkml.kernel.org/r/20250217081833.21568-1-chenyi.qiang@intel.com >> >> >> +static int memory_attribute_state_change(MemoryAttributeManager *mgr, >> uint64_t offset, >> + uint64_t size, bool >> shared_to_private) >> +{ >> + int block_size = memory_attribute_manager_get_block_size(mgr); >> + int ret = 0; >> + >> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >> + __func__, offset, size); >> + return -1; >> + } >> + >> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >> offset, size)) || >> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >> offset, size))) { >> + return 0; >> + } >> + >> + if (shared_to_private) { >> + memory_attribute_notify_discard(mgr, offset, size); >> + } else { >> + ret = memory_attribute_notify_populate(mgr, offset, size); >> + } >> + >> + if (!ret) { >> + unsigned long first_bit = offset / block_size; >> + unsigned long nbits = size / block_size; >> + >> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >> + >> + if (shared_to_private) { >> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >> + } else { >> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >> + } >> + >> + return 0; >> + } >> + >> + return ret; >> +} >> >> Then, in memory_attribute_notify_populate(), we walk the bitmap again. >> >> Why? >> >> We just checked that it's all in the expected state, no? >> >> >> virtio-mem doesn't handle it that way, so I'm curious why we would have >> to do it here? > > I was concerned about the case where the guest issues a request that > only partial of the range is in the desired state. > I think the main problem is the policy for the guest conversion request. > My current handling is: > > 1. When a conversion request is made for a range already in the desired > state, the helper simply returns success. Yes. > 2. For requests involving a range partially in the desired state, only > the necessary segments are converted, ensuring the entire range > complies with the request efficiently. Ah, now I get: + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) || + (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) { + return 0; + } + We're not failing if it might already partially be in the other state. > 3. In scenarios where a conversion request is declined by other systems, > such as a failure from VFIO during notify_populate(), the helper will > roll back the request, maintaining consistency. > > And the policy of virtio-mem is to refuse the state change if not all > blocks are in the opposite state. Yes. > > Actually, this part is still a uncertain to me. > IIUC, the problem does not exist if we only convert a single page at a time. Is there a known use case where such partial conversions could happen? > BTW, per the status/bitmap track, the virtio-mem also changes the bitmap > after the plug/unplug notifier. This is the same, correct? Right. But because we reject these partial requests, we don't have to traverse the bitmap and could just adjust the bitmap operations.
On 2/21/2025 4:09 PM, David Hildenbrand wrote: > On 21.02.25 03:25, Chenyi Qiang wrote: >> >> >> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>> For Arm CCA we'd like the guest_memfd discard notifier to call the >>>> IOMMU >>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>> notifier isn't sufficient for CCA because the DMA addresses need a >>>> translation (even without vIOMMU). >>>> >>>> At the moment: >>>> * guest_memfd_state_change() calls the populate() notifier >>>> * the populate notifier() calls IOMMU notifiers >>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >>>> * it calls ram_discard_manager_is_populated() which fails. >>>> >>>> guest_memfd_state_change() only changes the section's state after >>>> calling the populate() notifier. We can't easily invert the order of >>>> operation because it uses the old state bitmap to know which pages need >>>> the populate() notifier. >>> >>> I assume we talk about this code: [1] >>> >>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>> chenyi.qiang@intel.com >>> >>> >>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr, >>> uint64_t offset, >>> + uint64_t size, bool >>> shared_to_private) >>> +{ >>> + int block_size = memory_attribute_manager_get_block_size(mgr); >>> + int ret = 0; >>> + >>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>> + __func__, offset, size); >>> + return -1; >>> + } >>> + >>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>> offset, size)) || >>> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >>> offset, size))) { >>> + return 0; >>> + } >>> + >>> + if (shared_to_private) { >>> + memory_attribute_notify_discard(mgr, offset, size); >>> + } else { >>> + ret = memory_attribute_notify_populate(mgr, offset, size); >>> + } >>> + >>> + if (!ret) { >>> + unsigned long first_bit = offset / block_size; >>> + unsigned long nbits = size / block_size; >>> + >>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>> + >>> + if (shared_to_private) { >>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>> + } else { >>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>> + } >>> + >>> + return 0; >>> + } >>> + >>> + return ret; >>> +} >>> >>> Then, in memory_attribute_notify_populate(), we walk the bitmap again. >>> >>> Why? >>> >>> We just checked that it's all in the expected state, no? >>> >>> >>> virtio-mem doesn't handle it that way, so I'm curious why we would have >>> to do it here? >> >> I was concerned about the case where the guest issues a request that >> only partial of the range is in the desired state. >> I think the main problem is the policy for the guest conversion request. >> My current handling is: >> >> 1. When a conversion request is made for a range already in the desired >> state, the helper simply returns success. > > Yes. > >> 2. For requests involving a range partially in the desired state, only >> the necessary segments are converted, ensuring the entire range >> complies with the request efficiently. > > > Ah, now I get: > > + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, > offset, size)) || > + (!shared_to_private && memory_attribute_is_range_populated(mgr, > offset, size))) { > + return 0; > + } > + > > We're not failing if it might already partially be in the other state. > >> 3. In scenarios where a conversion request is declined by other systems, >> such as a failure from VFIO during notify_populate(), the helper will >> roll back the request, maintaining consistency. >> >> And the policy of virtio-mem is to refuse the state change if not all >> blocks are in the opposite state. > > Yes. > >> >> Actually, this part is still a uncertain to me. >> > > IIUC, the problem does not exist if we only convert a single page at a > time. > > Is there a known use case where such partial conversions could happen? I don't see such case yet. Actually, I'm trying to follow the behavior of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it doesn't reject the request if the whole range isn't in the opposite state. It just uses xa_store() to update it. Also, I don't see the spec says how to handle such case. To be robust, I just allow this special case. > >> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap >> after the plug/unplug notifier. This is the same, correct? > Right. But because we reject these partial requests, we don't have to > traverse the bitmap and could just adjust the bitmap operations. Yes, If we treat it as a guest error/bug, we can adjust it. >
On 2/21/2025 6:04 PM, Chenyi Qiang wrote: > > > On 2/21/2025 4:09 PM, David Hildenbrand wrote: >> On 21.02.25 03:25, Chenyi Qiang wrote: >>> >>> >>> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the >>>>> IOMMU >>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>>> notifier isn't sufficient for CCA because the DMA addresses need a >>>>> translation (even without vIOMMU). >>>>> >>>>> At the moment: >>>>> * guest_memfd_state_change() calls the populate() notifier >>>>> * the populate notifier() calls IOMMU notifiers >>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >>>>> * it calls ram_discard_manager_is_populated() which fails. >>>>> >>>>> guest_memfd_state_change() only changes the section's state after >>>>> calling the populate() notifier. We can't easily invert the order of >>>>> operation because it uses the old state bitmap to know which pages need >>>>> the populate() notifier. >>>> >>>> I assume we talk about this code: [1] >>>> >>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>>> chenyi.qiang@intel.com >>>> >>>> >>>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr, >>>> uint64_t offset, >>>> + uint64_t size, bool >>>> shared_to_private) >>>> +{ >>>> + int block_size = memory_attribute_manager_get_block_size(mgr); >>>> + int ret = 0; >>>> + >>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>>> + __func__, offset, size); >>>> + return -1; >>>> + } >>>> + >>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>>> offset, size)) || >>>> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >>>> offset, size))) { >>>> + return 0; >>>> + } >>>> + >>>> + if (shared_to_private) { >>>> + memory_attribute_notify_discard(mgr, offset, size); >>>> + } else { >>>> + ret = memory_attribute_notify_populate(mgr, offset, size); >>>> + } >>>> + >>>> + if (!ret) { >>>> + unsigned long first_bit = offset / block_size; >>>> + unsigned long nbits = size / block_size; >>>> + >>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>>> + >>>> + if (shared_to_private) { >>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>>> + } else { >>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>>> + } >>>> + >>>> + return 0; >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> >>>> Then, in memory_attribute_notify_populate(), we walk the bitmap again. >>>> >>>> Why? >>>> >>>> We just checked that it's all in the expected state, no? >>>> >>>> >>>> virtio-mem doesn't handle it that way, so I'm curious why we would have >>>> to do it here? >>> >>> I was concerned about the case where the guest issues a request that >>> only partial of the range is in the desired state. >>> I think the main problem is the policy for the guest conversion request. >>> My current handling is: >>> >>> 1. When a conversion request is made for a range already in the desired >>> state, the helper simply returns success. >> >> Yes. >> >>> 2. For requests involving a range partially in the desired state, only >>> the necessary segments are converted, ensuring the entire range >>> complies with the request efficiently. >> >> >> Ah, now I get: >> >> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >> offset, size)) || >> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >> offset, size))) { >> + return 0; >> + } >> + >> >> We're not failing if it might already partially be in the other state. >> >>> 3. In scenarios where a conversion request is declined by other systems, >>> such as a failure from VFIO during notify_populate(), the helper will >>> roll back the request, maintaining consistency. >>> >>> And the policy of virtio-mem is to refuse the state change if not all >>> blocks are in the opposite state. >> >> Yes. >> >>> >>> Actually, this part is still a uncertain to me. >>> >> >> IIUC, the problem does not exist if we only convert a single page at a >> time. >> >> Is there a known use case where such partial conversions could happen? > > I don't see such case yet. Actually, I'm trying to follow the behavior > of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it > doesn't reject the request if the whole range isn't in the opposite > state. It just uses xa_store() to update it. Also, I don't see the spec > says how to handle such case. To be robust, I just allow this special case. > >> >>> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap >>> after the plug/unplug notifier. This is the same, correct? >> Right. But because we reject these partial requests, we don't have to >> traverse the bitmap and could just adjust the bitmap operations. > > Yes, If we treat it as a guest error/bug, we can adjust it. Hi David, do you think which option is better? If prefer to reject the partial requests, I'll change it in my next version. > >> >
On 25.02.25 03:00, Chenyi Qiang wrote: > > > On 2/21/2025 6:04 PM, Chenyi Qiang wrote: >> >> >> On 2/21/2025 4:09 PM, David Hildenbrand wrote: >>> On 21.02.25 03:25, Chenyi Qiang wrote: >>>> >>>> >>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the >>>>>> IOMMU >>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>>>> notifier isn't sufficient for CCA because the DMA addresses need a >>>>>> translation (even without vIOMMU). >>>>>> >>>>>> At the moment: >>>>>> * guest_memfd_state_change() calls the populate() notifier >>>>>> * the populate notifier() calls IOMMU notifiers >>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA >>>>>> * it calls ram_discard_manager_is_populated() which fails. >>>>>> >>>>>> guest_memfd_state_change() only changes the section's state after >>>>>> calling the populate() notifier. We can't easily invert the order of >>>>>> operation because it uses the old state bitmap to know which pages need >>>>>> the populate() notifier. >>>>> >>>>> I assume we talk about this code: [1] >>>>> >>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>>>> chenyi.qiang@intel.com >>>>> >>>>> >>>>> +static int memory_attribute_state_change(MemoryAttributeManager *mgr, >>>>> uint64_t offset, >>>>> + uint64_t size, bool >>>>> shared_to_private) >>>>> +{ >>>>> + int block_size = memory_attribute_manager_get_block_size(mgr); >>>>> + int ret = 0; >>>>> + >>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>>>> + __func__, offset, size); >>>>> + return -1; >>>>> + } >>>>> + >>>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>>>> offset, size)) || >>>>> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >>>>> offset, size))) { >>>>> + return 0; >>>>> + } >>>>> + >>>>> + if (shared_to_private) { >>>>> + memory_attribute_notify_discard(mgr, offset, size); >>>>> + } else { >>>>> + ret = memory_attribute_notify_populate(mgr, offset, size); >>>>> + } >>>>> + >>>>> + if (!ret) { >>>>> + unsigned long first_bit = offset / block_size; >>>>> + unsigned long nbits = size / block_size; >>>>> + >>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>>>> + >>>>> + if (shared_to_private) { >>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>>>> + } else { >>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>>>> + } >>>>> + >>>>> + return 0; >>>>> + } >>>>> + >>>>> + return ret; >>>>> +} >>>>> >>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap again. >>>>> >>>>> Why? >>>>> >>>>> We just checked that it's all in the expected state, no? >>>>> >>>>> >>>>> virtio-mem doesn't handle it that way, so I'm curious why we would have >>>>> to do it here? >>>> >>>> I was concerned about the case where the guest issues a request that >>>> only partial of the range is in the desired state. >>>> I think the main problem is the policy for the guest conversion request. >>>> My current handling is: >>>> >>>> 1. When a conversion request is made for a range already in the desired >>>> state, the helper simply returns success. >>> >>> Yes. >>> >>>> 2. For requests involving a range partially in the desired state, only >>>> the necessary segments are converted, ensuring the entire range >>>> complies with the request efficiently. >>> >>> >>> Ah, now I get: >>> >>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>> offset, size)) || >>> + (!shared_to_private && memory_attribute_is_range_populated(mgr, >>> offset, size))) { >>> + return 0; >>> + } >>> + >>> >>> We're not failing if it might already partially be in the other state. >>> >>>> 3. In scenarios where a conversion request is declined by other systems, >>>> such as a failure from VFIO during notify_populate(), the helper will >>>> roll back the request, maintaining consistency. >>>> >>>> And the policy of virtio-mem is to refuse the state change if not all >>>> blocks are in the opposite state. >>> >>> Yes. >>> >>>> >>>> Actually, this part is still a uncertain to me. >>>> >>> >>> IIUC, the problem does not exist if we only convert a single page at a >>> time. >>> >>> Is there a known use case where such partial conversions could happen? >> >> I don't see such case yet. Actually, I'm trying to follow the behavior >> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it >> doesn't reject the request if the whole range isn't in the opposite >> state. It just uses xa_store() to update it. Also, I don't see the spec >> says how to handle such case. To be robust, I just allow this special case. >> >>> >>>> BTW, per the status/bitmap track, the virtio-mem also changes the bitmap >>>> after the plug/unplug notifier. This is the same, correct? >>> Right. But because we reject these partial requests, we don't have to >>> traverse the bitmap and could just adjust the bitmap operations. >> >> Yes, If we treat it as a guest error/bug, we can adjust it. > > Hi David, do you think which option is better? If prefer to reject the > partial requests, I'll change it in my next version. Hi, still scratching my head. Having to work around it as in this patch here is suboptimal. Could we simplify the whole thing while still allowing for (unexpected) partial conversions? Essentially: If states are mixed, fallback to a "1 block at a time" handling. The only problem is: what to do if we fail halfway through? Well, we can only have such partial completions for "populate", not for discard. Option a) Just leave it as "partially completed populate" and return the error. The bitmap and the notifiers are consistent. Option b) Just discard everything: someone tried to convert something "partial shared" to "shared". So maybe, if anything goes wrong, we can just have "all private". The question is also, what the expectation from the caller is: can the caller even make progress on failure or do we have to retry until it works? Both options would be compatible with "the caller must retry either way until it works". a) is probably easiest. Something like the following on top of your code: From 40c001a57c3c1350f3a44288ccb013d903d300cf Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Tue, 25 Feb 2025 09:55:38 +0100 Subject: [PATCH] tmp Signed-off-by: David Hildenbrand <david@redhat.com> --- system/memory-attribute-manager.c | 66 +++++++++++++++++-------------- 1 file changed, 36 insertions(+), 30 deletions(-) diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c index 17c70cf677..31960e4a54 100644 --- a/system/memory-attribute-manager.c +++ b/system/memory-attribute-manager.c @@ -274,9 +274,7 @@ static void memory_attribute_notify_discard(MemoryAttributeManager *mgr, if (!memory_region_section_intersect_range(&tmp, offset, size)) { continue; } - - memory_attribute_for_each_populated_section(mgr, &tmp, rdl, - memory_attribute_notify_discard_cb); + rdl->notify_discard(rdl, &tmp); } } @@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr, if (!memory_region_section_intersect_range(&tmp, offset, size)) { continue; } - - ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl, - memory_attribute_notify_populate_cb); + ret = rdl->notify_populate(rdl, &tmp); if (ret) { break; } @@ -311,9 +307,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr, if (!memory_region_section_intersect_range(&tmp, offset, size)) { continue; } - - memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2, - memory_attribute_notify_discard_cb); + rdl2->notify_discard(rdl2, &tmp); } } return ret; @@ -348,7 +342,10 @@ static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr, static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset, uint64_t size, bool shared_to_private) { - int block_size = memory_attribute_manager_get_block_size(mgr); + const int block_size = memory_attribute_manager_get_block_size(mgr); + const unsigned long first_bit = offset / block_size; + const unsigned long nbits = size / block_size; + uint64_t cur_offset; int ret = 0; if (!memory_attribute_is_valid_range(mgr, offset, size)) { @@ -357,32 +354,41 @@ static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o return -1; } - if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) || - (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) { - return 0; - } - if (shared_to_private) { - memory_attribute_notify_discard(mgr, offset, size); - } else { - ret = memory_attribute_notify_populate(mgr, offset, size); - } - - if (!ret) { - unsigned long first_bit = offset / block_size; - unsigned long nbits = size / block_size; - - g_assert((first_bit + nbits) <= mgr->bitmap_size); - - if (shared_to_private) { + if (memory_attribute_is_range_discarded(mgr, offset, size)) { + /* Already private. */ + } else if (!memory_attribute_is_range_populated(mgr, offset, size)) { + /* Unexpected mixture: not completely shared. */ + for (cur_offset = 0; cur_offset < offset; cur_offset += block_size) { + memory_attribute_state_change(mgr, cur_offset, block_size, + true); + } + } else { + /* Completely shared. */ bitmap_clear(mgr->shared_bitmap, first_bit, nbits); + memory_attribute_notify_discard(mgr, offset, size); + } + } else { + if (memory_attribute_is_range_populated(mgr, offset, size)) { + /* Already shared. */ + } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) { + /* Unexpected mixture: not completely private. */ + for (cur_offset = 0; cur_offset < offset; cur_offset += block_size) { + ret = memory_attribute_state_change(mgr, cur_offset, block_size, + false); + if (ret) { + break; + } + } } else { + /* Completely private. */ bitmap_set(mgr->shared_bitmap, first_bit, nbits); + ret = memory_attribute_notify_populate(mgr, offset, size); + if (ret) { + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); + } } - - return 0; } - return ret; }
On 2/25/2025 5:41 PM, David Hildenbrand wrote: > On 25.02.25 03:00, Chenyi Qiang wrote: >> >> >> On 2/21/2025 6:04 PM, Chenyi Qiang wrote: >>> >>> >>> On 2/21/2025 4:09 PM, David Hildenbrand wrote: >>>> On 21.02.25 03:25, Chenyi Qiang wrote: >>>>> >>>>> >>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the >>>>>>> IOMMU >>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a >>>>>>> translation (even without vIOMMU). >>>>>>> >>>>>>> At the moment: >>>>>>> * guest_memfd_state_change() calls the populate() notifier >>>>>>> * the populate notifier() calls IOMMU notifiers >>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get >>>>>>> a VA >>>>>>> * it calls ram_discard_manager_is_populated() which fails. >>>>>>> >>>>>>> guest_memfd_state_change() only changes the section's state after >>>>>>> calling the populate() notifier. We can't easily invert the order of >>>>>>> operation because it uses the old state bitmap to know which >>>>>>> pages need >>>>>>> the populate() notifier. >>>>>> >>>>>> I assume we talk about this code: [1] >>>>>> >>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>>>>> chenyi.qiang@intel.com >>>>>> >>>>>> >>>>>> +static int memory_attribute_state_change(MemoryAttributeManager >>>>>> *mgr, >>>>>> uint64_t offset, >>>>>> + uint64_t size, bool >>>>>> shared_to_private) >>>>>> +{ >>>>>> + int block_size = memory_attribute_manager_get_block_size(mgr); >>>>>> + int ret = 0; >>>>>> + >>>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>>>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>>>>> + __func__, offset, size); >>>>>> + return -1; >>>>>> + } >>>>>> + >>>>>> + if ((shared_to_private && >>>>>> memory_attribute_is_range_discarded(mgr, >>>>>> offset, size)) || >>>>>> + (!shared_to_private && >>>>>> memory_attribute_is_range_populated(mgr, >>>>>> offset, size))) { >>>>>> + return 0; >>>>>> + } >>>>>> + >>>>>> + if (shared_to_private) { >>>>>> + memory_attribute_notify_discard(mgr, offset, size); >>>>>> + } else { >>>>>> + ret = memory_attribute_notify_populate(mgr, offset, size); >>>>>> + } >>>>>> + >>>>>> + if (!ret) { >>>>>> + unsigned long first_bit = offset / block_size; >>>>>> + unsigned long nbits = size / block_size; >>>>>> + >>>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>>>>> + >>>>>> + if (shared_to_private) { >>>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>>>>> + } else { >>>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>>>>> + } >>>>>> + >>>>>> + return 0; >>>>>> + } >>>>>> + >>>>>> + return ret; >>>>>> +} >>>>>> >>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap >>>>>> again. >>>>>> >>>>>> Why? >>>>>> >>>>>> We just checked that it's all in the expected state, no? >>>>>> >>>>>> >>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would >>>>>> have >>>>>> to do it here? >>>>> >>>>> I was concerned about the case where the guest issues a request that >>>>> only partial of the range is in the desired state. >>>>> I think the main problem is the policy for the guest conversion >>>>> request. >>>>> My current handling is: >>>>> >>>>> 1. When a conversion request is made for a range already in the >>>>> desired >>>>> state, the helper simply returns success. >>>> >>>> Yes. >>>> >>>>> 2. For requests involving a range partially in the desired state, only >>>>> the necessary segments are converted, ensuring the entire range >>>>> complies with the request efficiently. >>>> >>>> >>>> Ah, now I get: >>>> >>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>>> offset, size)) || >>>> + (!shared_to_private && >>>> memory_attribute_is_range_populated(mgr, >>>> offset, size))) { >>>> + return 0; >>>> + } >>>> + >>>> >>>> We're not failing if it might already partially be in the other state. >>>> >>>>> 3. In scenarios where a conversion request is declined by other >>>>> systems, >>>>> such as a failure from VFIO during notify_populate(), the >>>>> helper will >>>>> roll back the request, maintaining consistency. >>>>> >>>>> And the policy of virtio-mem is to refuse the state change if not all >>>>> blocks are in the opposite state. >>>> >>>> Yes. >>>> >>>>> >>>>> Actually, this part is still a uncertain to me. >>>>> >>>> >>>> IIUC, the problem does not exist if we only convert a single page at a >>>> time. >>>> >>>> Is there a known use case where such partial conversions could happen? >>> >>> I don't see such case yet. Actually, I'm trying to follow the behavior >>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it >>> doesn't reject the request if the whole range isn't in the opposite >>> state. It just uses xa_store() to update it. Also, I don't see the spec >>> says how to handle such case. To be robust, I just allow this special >>> case. >>> >>>> >>>>> BTW, per the status/bitmap track, the virtio-mem also changes the >>>>> bitmap >>>>> after the plug/unplug notifier. This is the same, correct? >>>> Right. But because we reject these partial requests, we don't have to >>>> traverse the bitmap and could just adjust the bitmap operations. >>> >>> Yes, If we treat it as a guest error/bug, we can adjust it. >> >> Hi David, do you think which option is better? If prefer to reject the >> partial requests, I'll change it in my next version. > > Hi, > > still scratching my head. Having to work around it as in this patch here is > suboptimal. > > Could we simplify the whole thing while still allowing for (unexpected) > partial > conversions? > > Essentially: If states are mixed, fallback to a "1 block at a time" > handling. > > The only problem is: what to do if we fail halfway through? Well, we can > only have > such partial completions for "populate", not for discard. > > Option a) Just leave it as "partially completed populate" and return the > error. The > bitmap and the notifiers are consistent. > > Option b) Just discard everything: someone tried to convert something > "partial > shared" to "shared". So maybe, if anything goes wrong, we can just have > "all private". > > The question is also, what the expectation from the caller is: can the > caller > even make progress on failure or do we have to retry until it works? Yes, That's the key problem. For core mm side conversion, The caller (guest) handles three case: success, failure and retry. guest can continue on failure but will keep the memory in its original attribute and trigger some calltrace. While in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed. As for the VFIO conversion, at present, we allow it to fail and don't return error code to guest as long as we undo the conversion. It only causes the device not work in guest. I think if we view the attribute mismatch between core mm and IOMMU as a fatal error, we can call VM stop or let guest retry until it converts successfully. > > Both options would be compatible with "the caller must retry either way > until it works". > > a) is probably easiest. Something like the following on top of your code: LGTM, with option a), We need to return the retry status to guest. the caller (guest) must handle the retry. Not doing so with "partially completed populate" would cause some problem. > > > > From 40c001a57c3c1350f3a44288ccb013d903d300cf Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Tue, 25 Feb 2025 09:55:38 +0100 > Subject: [PATCH] tmp > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > system/memory-attribute-manager.c | 66 +++++++++++++++++-------------- > 1 file changed, 36 insertions(+), 30 deletions(-) > > diff --git a/system/memory-attribute-manager.c b/system/memory- > attribute-manager.c > index 17c70cf677..31960e4a54 100644 > --- a/system/memory-attribute-manager.c > +++ b/system/memory-attribute-manager.c > @@ -274,9 +274,7 @@ static void > memory_attribute_notify_discard(MemoryAttributeManager *mgr, > if (!memory_region_section_intersect_range(&tmp, offset, size)) { > continue; > } > - > - memory_attribute_for_each_populated_section(mgr, &tmp, rdl, > - > memory_attribute_notify_discard_cb); > + rdl->notify_discard(rdl, &tmp); > } > } > > @@ -292,9 +290,7 @@ static int > memory_attribute_notify_populate(MemoryAttributeManager *mgr, > if (!memory_region_section_intersect_range(&tmp, offset, size)) { > continue; > } > - > - ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl, > - > memory_attribute_notify_populate_cb); > + ret = rdl->notify_populate(rdl, &tmp); > if (ret) { > break; > } > @@ -311,9 +307,7 @@ static int > memory_attribute_notify_populate(MemoryAttributeManager *mgr, > if (!memory_region_section_intersect_range(&tmp, offset, > size)) { > continue; > } > - > - memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2, > - > memory_attribute_notify_discard_cb); > + rdl2->notify_discard(rdl2, &tmp); > } > } > return ret; > @@ -348,7 +342,10 @@ static bool > memory_attribute_is_range_discarded(MemoryAttributeManager *mgr, > static int memory_attribute_state_change(MemoryAttributeManager *mgr, > uint64_t offset, > uint64_t size, bool > shared_to_private) > { > - int block_size = memory_attribute_manager_get_block_size(mgr); > + const int block_size = memory_attribute_manager_get_block_size(mgr); > + const unsigned long first_bit = offset / block_size; > + const unsigned long nbits = size / block_size; > + uint64_t cur_offset; > int ret = 0; > > if (!memory_attribute_is_valid_range(mgr, offset, size)) { > @@ -357,32 +354,41 @@ static int > memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o > return -1; > } > > - if ((shared_to_private && memory_attribute_is_range_discarded(mgr, > offset, size)) || > - (!shared_to_private && memory_attribute_is_range_populated(mgr, > offset, size))) { > - return 0; > - } > - > if (shared_to_private) { > - memory_attribute_notify_discard(mgr, offset, size); > - } else { > - ret = memory_attribute_notify_populate(mgr, offset, size); > - } > - > - if (!ret) { > - unsigned long first_bit = offset / block_size; > - unsigned long nbits = size / block_size; > - > - g_assert((first_bit + nbits) <= mgr->bitmap_size); > - > - if (shared_to_private) { > + if (memory_attribute_is_range_discarded(mgr, offset, size)) { > + /* Already private. */ > + } else if (!memory_attribute_is_range_populated(mgr, offset, > size)) { > + /* Unexpected mixture: not completely shared. */ > + for (cur_offset = 0; cur_offset < offset; cur_offset += > block_size) { > + memory_attribute_state_change(mgr, cur_offset, block_size, > + true); > + } > + } else { > + /* Completely shared. */ > bitmap_clear(mgr->shared_bitmap, first_bit, nbits); > + memory_attribute_notify_discard(mgr, offset, size); > + } > + } else { > + if (memory_attribute_is_range_populated(mgr, offset, size)) { > + /* Already shared. */ > + } else if (!memory_attribute_is_range_discarded(mgr, offset, > size)) { > + /* Unexpected mixture: not completely private. */ > + for (cur_offset = 0; cur_offset < offset; cur_offset += > block_size) { > + ret = memory_attribute_state_change(mgr, cur_offset, > block_size, > + false); > + if (ret) { > + break; > + } > + } > } else { > + /* Completely private. */ > bitmap_set(mgr->shared_bitmap, first_bit, nbits); > + ret = memory_attribute_notify_populate(mgr, offset, size); > + if (ret) { > + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); > + } > } > - > - return 0; > } > - > return ret; > } >
On 2/26/2025 8:43 PM, Chenyi Qiang wrote: > > > On 2/25/2025 5:41 PM, David Hildenbrand wrote: >> On 25.02.25 03:00, Chenyi Qiang wrote: >>> >>> >>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote: >>>> >>>> >>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote: >>>>> On 21.02.25 03:25, Chenyi Qiang wrote: >>>>>> >>>>>> >>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the >>>>>>>> IOMMU >>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a >>>>>>>> translation (even without vIOMMU). >>>>>>>> >>>>>>>> At the moment: >>>>>>>> * guest_memfd_state_change() calls the populate() notifier >>>>>>>> * the populate notifier() calls IOMMU notifiers >>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get >>>>>>>> a VA >>>>>>>> * it calls ram_discard_manager_is_populated() which fails. >>>>>>>> >>>>>>>> guest_memfd_state_change() only changes the section's state after >>>>>>>> calling the populate() notifier. We can't easily invert the order of >>>>>>>> operation because it uses the old state bitmap to know which >>>>>>>> pages need >>>>>>>> the populate() notifier. >>>>>>> >>>>>>> I assume we talk about this code: [1] >>>>>>> >>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>>>>>> chenyi.qiang@intel.com >>>>>>> >>>>>>> >>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager >>>>>>> *mgr, >>>>>>> uint64_t offset, >>>>>>> + uint64_t size, bool >>>>>>> shared_to_private) >>>>>>> +{ >>>>>>> + int block_size = memory_attribute_manager_get_block_size(mgr); >>>>>>> + int ret = 0; >>>>>>> + >>>>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>>>>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>>>>>> + __func__, offset, size); >>>>>>> + return -1; >>>>>>> + } >>>>>>> + >>>>>>> + if ((shared_to_private && >>>>>>> memory_attribute_is_range_discarded(mgr, >>>>>>> offset, size)) || >>>>>>> + (!shared_to_private && >>>>>>> memory_attribute_is_range_populated(mgr, >>>>>>> offset, size))) { >>>>>>> + return 0; >>>>>>> + } >>>>>>> + >>>>>>> + if (shared_to_private) { >>>>>>> + memory_attribute_notify_discard(mgr, offset, size); >>>>>>> + } else { >>>>>>> + ret = memory_attribute_notify_populate(mgr, offset, size); >>>>>>> + } >>>>>>> + >>>>>>> + if (!ret) { >>>>>>> + unsigned long first_bit = offset / block_size; >>>>>>> + unsigned long nbits = size / block_size; >>>>>>> + >>>>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>>>>>> + >>>>>>> + if (shared_to_private) { >>>>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>>>>>> + } else { >>>>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>>>>>> + } >>>>>>> + >>>>>>> + return 0; >>>>>>> + } >>>>>>> + >>>>>>> + return ret; >>>>>>> +} >>>>>>> >>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap >>>>>>> again. >>>>>>> >>>>>>> Why? >>>>>>> >>>>>>> We just checked that it's all in the expected state, no? >>>>>>> >>>>>>> >>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would >>>>>>> have >>>>>>> to do it here? >>>>>> >>>>>> I was concerned about the case where the guest issues a request that >>>>>> only partial of the range is in the desired state. >>>>>> I think the main problem is the policy for the guest conversion >>>>>> request. >>>>>> My current handling is: >>>>>> >>>>>> 1. When a conversion request is made for a range already in the >>>>>> desired >>>>>> state, the helper simply returns success. >>>>> >>>>> Yes. >>>>> >>>>>> 2. For requests involving a range partially in the desired state, only >>>>>> the necessary segments are converted, ensuring the entire range >>>>>> complies with the request efficiently. >>>>> >>>>> >>>>> Ah, now I get: >>>>> >>>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>>>> offset, size)) || >>>>> + (!shared_to_private && >>>>> memory_attribute_is_range_populated(mgr, >>>>> offset, size))) { >>>>> + return 0; >>>>> + } >>>>> + >>>>> >>>>> We're not failing if it might already partially be in the other state. >>>>> >>>>>> 3. In scenarios where a conversion request is declined by other >>>>>> systems, >>>>>> such as a failure from VFIO during notify_populate(), the >>>>>> helper will >>>>>> roll back the request, maintaining consistency. >>>>>> >>>>>> And the policy of virtio-mem is to refuse the state change if not all >>>>>> blocks are in the opposite state. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>> Actually, this part is still a uncertain to me. >>>>>> >>>>> >>>>> IIUC, the problem does not exist if we only convert a single page at a >>>>> time. >>>>> >>>>> Is there a known use case where such partial conversions could happen? >>>> >>>> I don't see such case yet. Actually, I'm trying to follow the behavior >>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it >>>> doesn't reject the request if the whole range isn't in the opposite >>>> state. It just uses xa_store() to update it. Also, I don't see the spec >>>> says how to handle such case. To be robust, I just allow this special >>>> case. >>>> >>>>> >>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the >>>>>> bitmap >>>>>> after the plug/unplug notifier. This is the same, correct? >>>>> Right. But because we reject these partial requests, we don't have to >>>>> traverse the bitmap and could just adjust the bitmap operations. >>>> >>>> Yes, If we treat it as a guest error/bug, we can adjust it. >>> >>> Hi David, do you think which option is better? If prefer to reject the >>> partial requests, I'll change it in my next version. >> >> Hi, >> >> still scratching my head. Having to work around it as in this patch here is >> suboptimal. >> >> Could we simplify the whole thing while still allowing for (unexpected) >> partial >> conversions? >> >> Essentially: If states are mixed, fallback to a "1 block at a time" >> handling. >> >> The only problem is: what to do if we fail halfway through? Well, we can >> only have >> such partial completions for "populate", not for discard. >> >> Option a) Just leave it as "partially completed populate" and return the >> error. The >> bitmap and the notifiers are consistent. >> >> Option b) Just discard everything: someone tried to convert something >> "partial >> shared" to "shared". So maybe, if anything goes wrong, we can just have >> "all private". >> >> The question is also, what the expectation from the caller is: can the >> caller >> even make progress on failure or do we have to retry until it works? > > Yes, That's the key problem. > > For core mm side conversion, The caller (guest) handles three case: > success, failure and retry. guest can continue on failure but will keep > the memory in its original attribute and trigger some calltrace. While > in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed. > > As for the VFIO conversion, at present, we allow it to fail and don't > return error code to guest as long as we undo the conversion. It only > causes the device not work in guest. > > I think if we view the attribute mismatch between core mm and IOMMU as a > fatal error, we can call VM stop or let guest retry until it converts > successfully. > Just think more about the options for the failure case handling theoretically as we haven't hit such state_change() failure: 1. Undo + return invalid error Pros: The guest can make progress Cons: Complicated undo operations: Option a) is not appliable, because it leaves it as partial completed populate, but the guest thinks the operation has failed. Also need to add the undo for set_memory_attribute() after state_change() failed. Maybe also apply the attribute bitmap to set_memory_attribute() operation to handle the mixed request case 2. Undo in VFIO and no undo for set_memory_attribute() + return success (Current approach in my series) Pros: The guest can make progress although device doesn't work. Cons: the attribute bitmap only tracks the status in iommu. 3. No undo + return retry Pros: It keeps the attribute bitmap aligned in core mm and iommu. Cons: The guest doesn't know how to handle the retry. It would cause infinite loop. 4. No undo + no return. Just VM stop. Pros: simple Cons: maybe overkill. Maybe option 1 or 4 is better? >
On 27.02.25 04:26, Chenyi Qiang wrote: > > > On 2/26/2025 8:43 PM, Chenyi Qiang wrote: >> >> >> On 2/25/2025 5:41 PM, David Hildenbrand wrote: >>> On 25.02.25 03:00, Chenyi Qiang wrote: >>>> >>>> >>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote: >>>>> >>>>> >>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote: >>>>>> On 21.02.25 03:25, Chenyi Qiang wrote: >>>>>>> >>>>>>> >>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the >>>>>>>>> IOMMU >>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a >>>>>>>>> translation (even without vIOMMU). >>>>>>>>> >>>>>>>>> At the moment: >>>>>>>>> * guest_memfd_state_change() calls the populate() notifier >>>>>>>>> * the populate notifier() calls IOMMU notifiers >>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get >>>>>>>>> a VA >>>>>>>>> * it calls ram_discard_manager_is_populated() which fails. >>>>>>>>> >>>>>>>>> guest_memfd_state_change() only changes the section's state after >>>>>>>>> calling the populate() notifier. We can't easily invert the order of >>>>>>>>> operation because it uses the old state bitmap to know which >>>>>>>>> pages need >>>>>>>>> the populate() notifier. >>>>>>>> >>>>>>>> I assume we talk about this code: [1] >>>>>>>> >>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>>>>>>> chenyi.qiang@intel.com >>>>>>>> >>>>>>>> >>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager >>>>>>>> *mgr, >>>>>>>> uint64_t offset, >>>>>>>> + uint64_t size, bool >>>>>>>> shared_to_private) >>>>>>>> +{ >>>>>>>> + int block_size = memory_attribute_manager_get_block_size(mgr); >>>>>>>> + int ret = 0; >>>>>>>> + >>>>>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>>>>>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx", >>>>>>>> + __func__, offset, size); >>>>>>>> + return -1; >>>>>>>> + } >>>>>>>> + >>>>>>>> + if ((shared_to_private && >>>>>>>> memory_attribute_is_range_discarded(mgr, >>>>>>>> offset, size)) || >>>>>>>> + (!shared_to_private && >>>>>>>> memory_attribute_is_range_populated(mgr, >>>>>>>> offset, size))) { >>>>>>>> + return 0; >>>>>>>> + } >>>>>>>> + >>>>>>>> + if (shared_to_private) { >>>>>>>> + memory_attribute_notify_discard(mgr, offset, size); >>>>>>>> + } else { >>>>>>>> + ret = memory_attribute_notify_populate(mgr, offset, size); >>>>>>>> + } >>>>>>>> + >>>>>>>> + if (!ret) { >>>>>>>> + unsigned long first_bit = offset / block_size; >>>>>>>> + unsigned long nbits = size / block_size; >>>>>>>> + >>>>>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>>>>>>> + >>>>>>>> + if (shared_to_private) { >>>>>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>>>>>>> + } else { >>>>>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>>>>>>> + } >>>>>>>> + >>>>>>>> + return 0; >>>>>>>> + } >>>>>>>> + >>>>>>>> + return ret; >>>>>>>> +} >>>>>>>> >>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap >>>>>>>> again. >>>>>>>> >>>>>>>> Why? >>>>>>>> >>>>>>>> We just checked that it's all in the expected state, no? >>>>>>>> >>>>>>>> >>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would >>>>>>>> have >>>>>>>> to do it here? >>>>>>> >>>>>>> I was concerned about the case where the guest issues a request that >>>>>>> only partial of the range is in the desired state. >>>>>>> I think the main problem is the policy for the guest conversion >>>>>>> request. >>>>>>> My current handling is: >>>>>>> >>>>>>> 1. When a conversion request is made for a range already in the >>>>>>> desired >>>>>>> state, the helper simply returns success. >>>>>> >>>>>> Yes. >>>>>> >>>>>>> 2. For requests involving a range partially in the desired state, only >>>>>>> the necessary segments are converted, ensuring the entire range >>>>>>> complies with the request efficiently. >>>>>> >>>>>> >>>>>> Ah, now I get: >>>>>> >>>>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr, >>>>>> offset, size)) || >>>>>> + (!shared_to_private && >>>>>> memory_attribute_is_range_populated(mgr, >>>>>> offset, size))) { >>>>>> + return 0; >>>>>> + } >>>>>> + >>>>>> >>>>>> We're not failing if it might already partially be in the other state. >>>>>> >>>>>>> 3. In scenarios where a conversion request is declined by other >>>>>>> systems, >>>>>>> such as a failure from VFIO during notify_populate(), the >>>>>>> helper will >>>>>>> roll back the request, maintaining consistency. >>>>>>> >>>>>>> And the policy of virtio-mem is to refuse the state change if not all >>>>>>> blocks are in the opposite state. >>>>>> >>>>>> Yes. >>>>>> >>>>>>> >>>>>>> Actually, this part is still a uncertain to me. >>>>>>> >>>>>> >>>>>> IIUC, the problem does not exist if we only convert a single page at a >>>>>> time. >>>>>> >>>>>> Is there a known use case where such partial conversions could happen? >>>>> >>>>> I don't see such case yet. Actually, I'm trying to follow the behavior >>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it >>>>> doesn't reject the request if the whole range isn't in the opposite >>>>> state. It just uses xa_store() to update it. Also, I don't see the spec >>>>> says how to handle such case. To be robust, I just allow this special >>>>> case. >>>>> >>>>>> >>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the >>>>>>> bitmap >>>>>>> after the plug/unplug notifier. This is the same, correct? >>>>>> Right. But because we reject these partial requests, we don't have to >>>>>> traverse the bitmap and could just adjust the bitmap operations. >>>>> >>>>> Yes, If we treat it as a guest error/bug, we can adjust it. >>>> >>>> Hi David, do you think which option is better? If prefer to reject the >>>> partial requests, I'll change it in my next version. >>> >>> Hi, >>> >>> still scratching my head. Having to work around it as in this patch here is >>> suboptimal. >>> >>> Could we simplify the whole thing while still allowing for (unexpected) >>> partial >>> conversions? >>> >>> Essentially: If states are mixed, fallback to a "1 block at a time" >>> handling. >>> >>> The only problem is: what to do if we fail halfway through? Well, we can >>> only have >>> such partial completions for "populate", not for discard. >>> >>> Option a) Just leave it as "partially completed populate" and return the >>> error. The >>> bitmap and the notifiers are consistent. >>> >>> Option b) Just discard everything: someone tried to convert something >>> "partial >>> shared" to "shared". So maybe, if anything goes wrong, we can just have >>> "all private". >>> >>> The question is also, what the expectation from the caller is: can the >>> caller >>> even make progress on failure or do we have to retry until it works? >> >> Yes, That's the key problem. >> >> For core mm side conversion, The caller (guest) handles three case: >> success, failure and retry. guest can continue on failure but will keep >> the memory in its original attribute and trigger some calltrace. While >> in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed. >> >> As for the VFIO conversion, at present, we allow it to fail and don't >> return error code to guest as long as we undo the conversion. It only >> causes the device not work in guest. >> >> I think if we view the attribute mismatch between core mm and IOMMU as a >> fatal error, we can call VM stop or let guest retry until it converts >> successfully. >> > > Just think more about the options for the failure case handling > theoretically as we haven't hit such state_change() failure: > > 1. Undo + return invalid error > Pros: The guest can make progress > Cons: Complicated undo operations: Option a) is not appliable, because > it leaves it as partial completed populate, but the guest thinks the > operation has failed. > Also need to add the undo for set_memory_attribute() after > state_change() failed. Maybe also apply the attribute bitmap to > set_memory_attribute() operation to handle the mixed request case > > 2. Undo in VFIO and no undo for set_memory_attribute() + return success > (Current approach in my series) > Pros: The guest can make progress although device doesn't work. > Cons: the attribute bitmap only tracks the status in iommu. Right, we should avoid that. Bitmap + notifiers should stay in sync. > > 3. No undo + return retry > Pros: It keeps the attribute bitmap aligned in core mm and iommu. > Cons: The guest doesn't know how to handle the retry. It would cause > infinite loop. > > 4. No undo + no return. Just VM stop. > Pros: simple > Cons: maybe overkill. > > Maybe option 1 or 4 is better? Well, we can proper undo working using a temporary bitmap when converting to shared and we run into this "mixed" scenario. Something like this on top of your work: From f36e3916596ed5952f15233eb7747c65a6424949 Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Tue, 25 Feb 2025 09:55:38 +0100 Subject: [PATCH] tmp Signed-off-by: David Hildenbrand <david@redhat.com> --- system/memory-attribute-manager.c | 95 +++++++++++++++++++++---------- 1 file changed, 65 insertions(+), 30 deletions(-) diff --git a/system/memory-attribute-manager.c b/system/memory-attribute-manager.c index 17c70cf677..e98e7367c1 100644 --- a/system/memory-attribute-manager.c +++ b/system/memory-attribute-manager.c @@ -274,9 +274,7 @@ static void memory_attribute_notify_discard(MemoryAttributeManager *mgr, if (!memory_region_section_intersect_range(&tmp, offset, size)) { continue; } - - memory_attribute_for_each_populated_section(mgr, &tmp, rdl, - memory_attribute_notify_discard_cb); + rdl->notify_discard(rdl, &tmp); } } @@ -292,9 +290,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr, if (!memory_region_section_intersect_range(&tmp, offset, size)) { continue; } - - ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl, - memory_attribute_notify_populate_cb); + ret = rdl->notify_populate(rdl, &tmp); if (ret) { break; } @@ -311,9 +307,7 @@ static int memory_attribute_notify_populate(MemoryAttributeManager *mgr, if (!memory_region_section_intersect_range(&tmp, offset, size)) { continue; } - - memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2, - memory_attribute_notify_discard_cb); + rdl2->notify_discard(rdl2, &tmp); } } return ret; @@ -348,7 +342,12 @@ static bool memory_attribute_is_range_discarded(MemoryAttributeManager *mgr, static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t offset, uint64_t size, bool shared_to_private) { - int block_size = memory_attribute_manager_get_block_size(mgr); + const int block_size = memory_attribute_manager_get_block_size(mgr); + const unsigned long first_bit = offset / block_size; + const unsigned long nbits = size / block_size; + const uint64_t end = offset + size; + unsigned long bit; + uint64_t cur; int ret = 0; if (!memory_attribute_is_valid_range(mgr, offset, size)) { @@ -357,32 +356,68 @@ static int memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o return -1; } - if ((shared_to_private && memory_attribute_is_range_discarded(mgr, offset, size)) || - (!shared_to_private && memory_attribute_is_range_populated(mgr, offset, size))) { - return 0; - } - if (shared_to_private) { - memory_attribute_notify_discard(mgr, offset, size); - } else { - ret = memory_attribute_notify_populate(mgr, offset, size); - } - - if (!ret) { - unsigned long first_bit = offset / block_size; - unsigned long nbits = size / block_size; - - g_assert((first_bit + nbits) <= mgr->bitmap_size); - - if (shared_to_private) { + if (memory_attribute_is_range_discarded(mgr, offset, size)) { + /* Already private. */ + } else if (!memory_attribute_is_range_populated(mgr, offset, size)) { + /* Unexpected mixture: process individual blocks. */ + for (cur = offset; cur < end; cur += block_size) { + bit = cur / block_size; + if (!test_bit(bit, mgr->shared_bitmap)) + continue; + clear_bit(bit, mgr->shared_bitmap); + memory_attribute_notify_discard(mgr, cur, block_size); + } + } else { + /* Completely shared. */ bitmap_clear(mgr->shared_bitmap, first_bit, nbits); + memory_attribute_notify_discard(mgr, offset, size); + } + } else { + if (memory_attribute_is_range_populated(mgr, offset, size)) { + /* Already shared. */ + } else if (!memory_attribute_is_range_discarded(mgr, offset, size)) { + /* Unexpected mixture: process individual blocks. */ + unsigned long *modified_bitmap = bitmap_new(nbits); + + for (cur = offset; cur < end; cur += block_size) { + bit = cur / block_size; + if (test_bit(bit, mgr->shared_bitmap)) + continue; + set_bit(bit, mgr->shared_bitmap); + ret = memory_attribute_notify_populate(mgr, cur, block_size); + if (!ret) { + set_bit(bit - first_bit, modified_bitmap); + continue; + } + clear_bit(bit, mgr->shared_bitmap); + break; + } + if (ret) { + /* + * Very unexpected: something went wrong. Revert to the old + * state, marking only the blocks as private that we converted + * to shared. + */ + for (cur = offset; cur < end; cur += block_size) { + bit = cur / block_size; + if (!test_bit(bit - first_bit, modified_bitmap)) + continue; + assert(test_bit(bit, mgr->shared_bitmap)); + clear_bit(bit, mgr->shared_bitmap); + memory_attribute_notify_discard(mgr, offset, block_size); + } + } + g_free(modified_bitmap); } else { + /* Completely private. */ bitmap_set(mgr->shared_bitmap, first_bit, nbits); + ret = memory_attribute_notify_populate(mgr, offset, size); + if (ret) { + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); + } } - - return 0; } - return ret; }
On 2/27/2025 7:27 PM, David Hildenbrand wrote: > On 27.02.25 04:26, Chenyi Qiang wrote: >> >> >> On 2/26/2025 8:43 PM, Chenyi Qiang wrote: >>> >>> >>> On 2/25/2025 5:41 PM, David Hildenbrand wrote: >>>> On 25.02.25 03:00, Chenyi Qiang wrote: >>>>> >>>>> >>>>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote: >>>>>> >>>>>> >>>>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote: >>>>>>> On 21.02.25 03:25, Chenyi Qiang wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote: >>>>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote: >>>>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call >>>>>>>>>> the >>>>>>>>>> IOMMU >>>>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard >>>>>>>>>> notifier isn't sufficient for CCA because the DMA addresses >>>>>>>>>> need a >>>>>>>>>> translation (even without vIOMMU). >>>>>>>>>> >>>>>>>>>> At the moment: >>>>>>>>>> * guest_memfd_state_change() calls the populate() notifier >>>>>>>>>> * the populate notifier() calls IOMMU notifiers >>>>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get >>>>>>>>>> a VA >>>>>>>>>> * it calls ram_discard_manager_is_populated() which fails. >>>>>>>>>> >>>>>>>>>> guest_memfd_state_change() only changes the section's state after >>>>>>>>>> calling the populate() notifier. We can't easily invert the >>>>>>>>>> order of >>>>>>>>>> operation because it uses the old state bitmap to know which >>>>>>>>>> pages need >>>>>>>>>> the populate() notifier. >>>>>>>>> >>>>>>>>> I assume we talk about this code: [1] >>>>>>>>> >>>>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1- >>>>>>>>> chenyi.qiang@intel.com >>>>>>>>> >>>>>>>>> >>>>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager >>>>>>>>> *mgr, >>>>>>>>> uint64_t offset, >>>>>>>>> + uint64_t size, bool >>>>>>>>> shared_to_private) >>>>>>>>> +{ >>>>>>>>> + int block_size = >>>>>>>>> memory_attribute_manager_get_block_size(mgr); >>>>>>>>> + int ret = 0; >>>>>>>>> + >>>>>>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) { >>>>>>>>> + error_report("%s, invalid range: offset 0x%lx, size >>>>>>>>> 0x%lx", >>>>>>>>> + __func__, offset, size); >>>>>>>>> + return -1; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + if ((shared_to_private && >>>>>>>>> memory_attribute_is_range_discarded(mgr, >>>>>>>>> offset, size)) || >>>>>>>>> + (!shared_to_private && >>>>>>>>> memory_attribute_is_range_populated(mgr, >>>>>>>>> offset, size))) { >>>>>>>>> + return 0; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + if (shared_to_private) { >>>>>>>>> + memory_attribute_notify_discard(mgr, offset, size); >>>>>>>>> + } else { >>>>>>>>> + ret = memory_attribute_notify_populate(mgr, offset, >>>>>>>>> size); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + if (!ret) { >>>>>>>>> + unsigned long first_bit = offset / block_size; >>>>>>>>> + unsigned long nbits = size / block_size; >>>>>>>>> + >>>>>>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size); >>>>>>>>> + >>>>>>>>> + if (shared_to_private) { >>>>>>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits); >>>>>>>>> + } else { >>>>>>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + return 0; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + return ret; >>>>>>>>> +} >>>>>>>>> >>>>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap >>>>>>>>> again. >>>>>>>>> >>>>>>>>> Why? >>>>>>>>> >>>>>>>>> We just checked that it's all in the expected state, no? >>>>>>>>> >>>>>>>>> >>>>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would >>>>>>>>> have >>>>>>>>> to do it here? >>>>>>>> >>>>>>>> I was concerned about the case where the guest issues a request >>>>>>>> that >>>>>>>> only partial of the range is in the desired state. >>>>>>>> I think the main problem is the policy for the guest conversion >>>>>>>> request. >>>>>>>> My current handling is: >>>>>>>> >>>>>>>> 1. When a conversion request is made for a range already in the >>>>>>>> desired >>>>>>>> state, the helper simply returns success. >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>>> 2. For requests involving a range partially in the desired >>>>>>>> state, only >>>>>>>> the necessary segments are converted, ensuring the entire >>>>>>>> range >>>>>>>> complies with the request efficiently. >>>>>>> >>>>>>> >>>>>>> Ah, now I get: >>>>>>> >>>>>>> + if ((shared_to_private && >>>>>>> memory_attribute_is_range_discarded(mgr, >>>>>>> offset, size)) || >>>>>>> + (!shared_to_private && >>>>>>> memory_attribute_is_range_populated(mgr, >>>>>>> offset, size))) { >>>>>>> + return 0; >>>>>>> + } >>>>>>> + >>>>>>> >>>>>>> We're not failing if it might already partially be in the other >>>>>>> state. >>>>>>> >>>>>>>> 3. In scenarios where a conversion request is declined by other >>>>>>>> systems, >>>>>>>> such as a failure from VFIO during notify_populate(), the >>>>>>>> helper will >>>>>>>> roll back the request, maintaining consistency. >>>>>>>> >>>>>>>> And the policy of virtio-mem is to refuse the state change if >>>>>>>> not all >>>>>>>> blocks are in the opposite state. >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>>> >>>>>>>> Actually, this part is still a uncertain to me. >>>>>>>> >>>>>>> >>>>>>> IIUC, the problem does not exist if we only convert a single page >>>>>>> at a >>>>>>> time. >>>>>>> >>>>>>> Is there a known use case where such partial conversions could >>>>>>> happen? >>>>>> >>>>>> I don't see such case yet. Actually, I'm trying to follow the >>>>>> behavior >>>>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it >>>>>> doesn't reject the request if the whole range isn't in the opposite >>>>>> state. It just uses xa_store() to update it. Also, I don't see the >>>>>> spec >>>>>> says how to handle such case. To be robust, I just allow this special >>>>>> case. >>>>>> >>>>>>> >>>>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the >>>>>>>> bitmap >>>>>>>> after the plug/unplug notifier. This is the same, correct? >>>>>>> Right. But because we reject these partial requests, we don't >>>>>>> have to >>>>>>> traverse the bitmap and could just adjust the bitmap operations. >>>>>> >>>>>> Yes, If we treat it as a guest error/bug, we can adjust it. >>>>> >>>>> Hi David, do you think which option is better? If prefer to reject the >>>>> partial requests, I'll change it in my next version. >>>> >>>> Hi, >>>> >>>> still scratching my head. Having to work around it as in this patch >>>> here is >>>> suboptimal. >>>> >>>> Could we simplify the whole thing while still allowing for (unexpected) >>>> partial >>>> conversions? >>>> >>>> Essentially: If states are mixed, fallback to a "1 block at a time" >>>> handling. >>>> >>>> The only problem is: what to do if we fail halfway through? Well, we >>>> can >>>> only have >>>> such partial completions for "populate", not for discard. >>>> >>>> Option a) Just leave it as "partially completed populate" and return >>>> the >>>> error. The >>>> bitmap and the notifiers are consistent. >>>> >>>> Option b) Just discard everything: someone tried to convert something >>>> "partial >>>> shared" to "shared". So maybe, if anything goes wrong, we can just have >>>> "all private". >>>> >>>> The question is also, what the expectation from the caller is: can the >>>> caller >>>> even make progress on failure or do we have to retry until it works? >>> >>> Yes, That's the key problem. >>> >>> For core mm side conversion, The caller (guest) handles three case: >>> success, failure and retry. guest can continue on failure but will keep >>> the memory in its original attribute and trigger some calltrace. While >>> in QEMU side, it would cause VM stop if kvm_set_memory_attributes() >>> failed. >>> >>> As for the VFIO conversion, at present, we allow it to fail and don't >>> return error code to guest as long as we undo the conversion. It only >>> causes the device not work in guest. >>> >>> I think if we view the attribute mismatch between core mm and IOMMU as a >>> fatal error, we can call VM stop or let guest retry until it converts >>> successfully. >>> >> >> Just think more about the options for the failure case handling >> theoretically as we haven't hit such state_change() failure: >> >> 1. Undo + return invalid error >> Pros: The guest can make progress >> Cons: Complicated undo operations: Option a) is not appliable, because >> it leaves it as partial completed populate, but the guest thinks the >> operation has failed. >> Also need to add the undo for set_memory_attribute() after >> state_change() failed. Maybe also apply the attribute bitmap to >> set_memory_attribute() operation to handle the mixed request case >> >> 2. Undo in VFIO and no undo for set_memory_attribute() + return success >> (Current approach in my series) >> Pros: The guest can make progress although device doesn't work. >> Cons: the attribute bitmap only tracks the status in iommu. > > Right, we should avoid that. Bitmap + notifiers should stay in sync. > >> >> 3. No undo + return retry >> Pros: It keeps the attribute bitmap aligned in core mm and iommu. >> Cons: The guest doesn't know how to handle the retry. It would cause >> infinite loop. >> >> 4. No undo + no return. Just VM stop. >> Pros: simple >> Cons: maybe overkill. >> >> Maybe option 1 or 4 is better? > > Well, we can proper undo working using a temporary bitmap when > converting to shared and we run into this "mixed" scenario. > > Something like this on top of your work: LGTM. I'll choose option 1 and add the change in my next version. Thanks a lot!
diff --git a/include/exec/memory.h b/include/exec/memory.h index 9f73b59867..6fcd98fe58 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -116,6 +116,11 @@ typedef enum { IOMMU_RO = 1, IOMMU_WO = 2, IOMMU_RW = 3, + /* + * Allow mapping a discarded page, because we're in the process of + * populating it. + */ + IOMMU_POPULATING = 4, } IOMMUAccessFlags; #define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0)) diff --git a/system/memory.c b/system/memory.c index 4c829793a0..8e884f9c15 100644 --- a/system/memory.c +++ b/system/memory.c @@ -2221,7 +2221,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, * Disallow that. vmstate priorities make sure any RamDiscardManager * were already restored before IOMMUs are restored. */ - if (!ram_discard_manager_is_populated(rdm, &tmp)) { + if (!(iotlb->perm & IOMMU_POPULATING) && + !ram_discard_manager_is_populated(rdm, &tmp)) { error_setg(errp, "iommu map to discarded memory (e.g., unplugged" " via virtio-mem): %" HWADDR_PRIx "", iotlb->translated_addr);
For Arm CCA we'd like the guest_memfd discard notifier to call the IOMMU notifiers and create e.g. VFIO mappings. The default VFIO discard notifier isn't sufficient for CCA because the DMA addresses need a translation (even without vIOMMU). At the moment: * guest_memfd_state_change() calls the populate() notifier * the populate notifier() calls IOMMU notifiers * the IOMMU notifier handler calls memory_get_xlat_addr() to get a VA * it calls ram_discard_manager_is_populated() which fails. guest_memfd_state_change() only changes the section's state after calling the populate() notifier. We can't easily invert the order of operation because it uses the old state bitmap to know which pages need the populate() notifier. For now add a flag to the IOMMU notifier to tell memory_get_xlat_addr() that we're aware of the RAM discard manager state. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> --- Definitely not the prettiest hack, any idea how to do this cleanly? --- include/exec/memory.h | 5 +++++ system/memory.c | 3 ++- 2 files changed, 7 insertions(+), 1 deletion(-)