mbox series

[v10,00/10] virtio-iommu: VFIO integration

Message ID 20201008171558.410886-1-jean-philippe@linaro.org
Headers show
Series virtio-iommu: VFIO integration | expand

Message

Jean-Philippe Brucker Oct. 8, 2020, 5:15 p.m. UTC
This series adds support for VFIO endpoints to virtio-iommu.

Versions 1 to 9 were posted by Bharat Bhushan, but I am taking over for
now since he doesn't have much time to spend on it. Thanks again Bharat
for the work!

Two major changes since [v9]:

* Don't use per-endoint page_size_mask properties. Instead keep a global
  page size for the virtio-iommu device, updated when adding a VFIO
  endpoint. Reject hotplug if the page size is incompatible.

* Try to make the MAP/UNMAP paths more efficient, by keeping track of
  memory region within the endpoint structure.

More testing would be appreciated, since I can only test using a software
model as host at the moment. But it does seem to hold well with PCIe
hotplug/unplug, and pass-through to guest userspace, which are no mean
feat.

Note that one page size combination is not supported: host 64kB guest
4kB cannot work, because the host IOMMU driver will automatically pick
64kB pages which prevents mapping at a smaller granule. Supporting this
case would require introducing a page size negotiation mechanism from
the guest all the way to the host IOMMU driver. Possible, but not
planned at the moment.

[v9] https://lore.kernel.org/qemu-devel/20200323084617.1782-1-bbhushan2@marvell.com/

Bharat Bhushan (7):
  virtio-iommu: Add memory notifiers for map/unmap
  virtio-iommu: Call memory notifiers in attach/detach
  virtio-iommu: Add replay() memory region callback
  virtio-iommu: Add notify_flag_changed() memory region callback
  memory: Add interface to set iommu page size mask
  vfio: Set IOMMU page size as per host supported page size
  virtio-iommu: Set supported page size mask

Jean-Philippe Brucker (3):
  virtio-iommu: Fix virtio_iommu_mr()
  virtio-iommu: Store memory region in endpoint struct
  vfio: Don't issue full 2^64 unmap

 include/exec/memory.h    |  26 +++++
 hw/vfio/common.c         |  19 ++++
 hw/virtio/virtio-iommu.c | 204 ++++++++++++++++++++++++++++++++++++++-
 softmmu/memory.c         |  13 +++
 hw/virtio/trace-events   |   5 +
 5 files changed, 265 insertions(+), 2 deletions(-)

Comments

Alex Williamson Oct. 8, 2020, 9:22 p.m. UTC | #1
On Thu,  8 Oct 2020 19:15:56 +0200
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> From: Bharat Bhushan <bbhushan2@marvell.com>
> 
> Set IOMMU supported page size mask same as host Linux supported page
> size mask.
> 
> Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  hw/vfio/common.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 13471ae2943..e66054b02a7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -636,6 +636,14 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                              int128_get64(llend),
>                              iommu_idx);
>  
> +        ret = memory_region_iommu_set_page_size_mask(giommu->iommu,
> +                                                     container->pgsizes,
> +                                                     &err);
> +        if (ret) {
> +            g_free(giommu);
> +            goto fail;
> +        }
> +
>          ret = memory_region_register_iommu_notifier(section->mr, &giommu->n,
>                                                      &err);
>          if (ret) {

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Eric Auger Oct. 16, 2020, 7:58 a.m. UTC | #2
Hi Jean,

On 10/8/20 7:15 PM, Jean-Philippe Brucker wrote:
> From: Bharat Bhushan <bbhushan2@marvell.com>
> 
> Extend VIRTIO_IOMMU_T_MAP/UNMAP request to notify memory listeners. It
> will call VFIO notifier to map/unmap regions in the physical IOMMU.
> 
> Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v10:
> * Use the flags from IOMMUMemoryRegion
> * Pass virt_start/virt_end rather than size, to avoid dealing with
>   overflow and for consistency.
> ---
>  hw/virtio/virtio-iommu.c | 53 ++++++++++++++++++++++++++++++++++++++++
>  hw/virtio/trace-events   |  2 ++
>  2 files changed, 55 insertions(+)
> 
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 33115e82186..fcdf3a819f8 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -125,6 +125,49 @@ static gint interval_cmp(gconstpointer a, gconstpointer b, gpointer user_data)
>      }
>  }
>  
> +static void virtio_iommu_notify_map(IOMMUMemoryRegion *mr, hwaddr virt_start,
> +                                    hwaddr virt_end, hwaddr paddr)
> +{
> +    IOMMUTLBEntry entry;
> +    IOMMUNotifierFlag flags = mr->iommu_notify_flags;
> +
> +    if (!(flags & IOMMU_NOTIFIER_MAP)) {
> +        return;
> +    }
> +
> +    trace_virtio_iommu_notify_map(mr->parent_obj.name, virt_start, virt_end,
> +                                  paddr);
> +
> +    entry.target_as = &address_space_memory;
> +    entry.addr_mask = virt_end - virt_start;
> +    entry.iova = virt_start;
> +    entry.perm = IOMMU_RW;
logically you should be able to cascade the struct virtio_iommu_req_map
*req flags field instead.
> +    entry.translated_addr = paddr;
> +
> +    memory_region_notify_iommu(mr, 0, entry);
> +}
> +
> +static void virtio_iommu_notify_unmap(IOMMUMemoryRegion *mr, hwaddr virt_start,
> +                                      hwaddr virt_end)
> +{
> +    IOMMUTLBEntry entry;
> +    IOMMUNotifierFlag flags = mr->iommu_notify_flags;
> +
> +    if (!(flags & IOMMU_NOTIFIER_UNMAP)) {
> +        return;
> +    }
> +
> +    trace_virtio_iommu_notify_unmap(mr->parent_obj.name, virt_start, virt_end);
> +
> +    entry.target_as = &address_space_memory;
> +    entry.addr_mask = virt_end - virt_start;
> +    entry.iova = virt_start;
> +    entry.perm = IOMMU_NONE;
> +    entry.translated_addr = 0;
> +
> +    memory_region_notify_iommu(mr, 0, entry);
> +}
> +
>  static void virtio_iommu_detach_endpoint_from_domain(VirtIOIOMMUEndpoint *ep)
>  {
>      if (!ep->domain) {
> @@ -315,6 +358,7 @@ static int virtio_iommu_map(VirtIOIOMMU *s,
>      VirtIOIOMMUDomain *domain;
>      VirtIOIOMMUInterval *interval;
>      VirtIOIOMMUMapping *mapping;
> +    VirtIOIOMMUEndpoint *ep;
>  
>      if (flags & ~VIRTIO_IOMMU_MAP_F_MASK) {
>          return VIRTIO_IOMMU_S_INVAL;
> @@ -344,6 +388,10 @@ static int virtio_iommu_map(VirtIOIOMMU *s,
>  
>      g_tree_insert(domain->mappings, interval, mapping);
>  
> +    QLIST_FOREACH(ep, &domain->endpoint_list, next) {
> +        virtio_iommu_notify_map(ep->iommu_mr, virt_start, virt_end, phys_start);
> +    }
> +
>      return VIRTIO_IOMMU_S_OK;
>  }
>  
> @@ -356,6 +404,7 @@ static int virtio_iommu_unmap(VirtIOIOMMU *s,
>      VirtIOIOMMUMapping *iter_val;
>      VirtIOIOMMUInterval interval, *iter_key;
>      VirtIOIOMMUDomain *domain;
> +    VirtIOIOMMUEndpoint *ep;
>      int ret = VIRTIO_IOMMU_S_OK;
>  
>      trace_virtio_iommu_unmap(domain_id, virt_start, virt_end);
> @@ -373,6 +422,10 @@ static int virtio_iommu_unmap(VirtIOIOMMU *s,
>          uint64_t current_high = iter_key->high;
>  
>          if (interval.low <= current_low && interval.high >= current_high) {
> +            QLIST_FOREACH(ep, &domain->endpoint_list, next) {
> +                virtio_iommu_notify_unmap(ep->iommu_mr, current_low,
> +                                          current_high);
> +            }
>              g_tree_remove(domain->mappings, iter_key);
>              trace_virtio_iommu_unmap_done(domain_id, current_low, current_high);
>          } else {
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index cf1e59de302..65a48555c78 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -106,6 +106,8 @@ virtio_iommu_put_domain(uint32_t domain_id) "Free domain=%d"
>  virtio_iommu_translate_out(uint64_t virt_addr, uint64_t phys_addr, uint32_t sid) "0x%"PRIx64" -> 0x%"PRIx64 " for sid=%d"
>  virtio_iommu_report_fault(uint8_t reason, uint32_t flags, uint32_t endpoint, uint64_t addr) "FAULT reason=%d flags=%d endpoint=%d address =0x%"PRIx64
>  virtio_iommu_fill_resv_property(uint32_t devid, uint8_t subtype, uint64_t start, uint64_t end) "dev= %d, type=%d start=0x%"PRIx64" end=0x%"PRIx64
> +virtio_iommu_notify_map(const char *name, uint64_t virt_start, uint64_t virt_end, uint64_t phys_start) "mr=%s virt_start=0x%"PRIx64" virt_end=0x%"PRIx64" phys_start=0x%"PRIx64
> +virtio_iommu_notify_unmap(const char *name, uint64_t virt_start, uint64_t virt_end) "mr=%s virt_start=0x%"PRIx64" virt_end=0x%"PRIx64
>  
>  # virtio-mem.c
>  virtio_mem_send_response(uint16_t type) "type=%" PRIu16
> 
Besides it looks good to me.

Thanks

Eric
Eric Auger Oct. 16, 2020, 1:13 p.m. UTC | #3
Hi Jean,

On 10/8/20 7:15 PM, Jean-Philippe Brucker wrote:
> This series adds support for VFIO endpoints to virtio-iommu.
> 
> Versions 1 to 9 were posted by Bharat Bhushan, but I am taking over for
> now since he doesn't have much time to spend on it. Thanks again Bharat
> for the work!
> 
> Two major changes since [v9]:
> 
> * Don't use per-endoint page_size_mask properties. Instead keep a global
>   page size for the virtio-iommu device, updated when adding a VFIO
>   endpoint. Reject hotplug if the page size is incompatible.
> 
> * Try to make the MAP/UNMAP paths more efficient, by keeping track of
>   memory region within the endpoint structure.
> 
> More testing would be appreciated, since I can only test using a software
> model as host at the moment. But it does seem to hold well with PCIe
> hotplug/unplug, and pass-through to guest userspace, which are no mean
> feat.

I tested vhost migration and the following configurations:
host 4K- guest 4K: vhost, vfio, vfio hotplug
host 64K - guest 64K: vhost, vfio, vfio hotplug
host 4K - guest 64K: vhost, vfio, vfio hoplug

All those configs worked for me. I haven't noticed any isse with those.

Thanks

Eric
> 
> Note that one page size combination is not supported: host 64kB guest
> 4kB cannot work, because the host IOMMU driver will automatically pick
> 64kB pages which prevents mapping at a smaller granule. Supporting this
> case would require introducing a page size negotiation mechanism from
> the guest all the way to the host IOMMU driver. Possible, but not
> planned at the moment.
> 
> [v9] https://lore.kernel.org/qemu-devel/20200323084617.1782-1-bbhushan2@marvell.com/
> 
> Bharat Bhushan (7):
>   virtio-iommu: Add memory notifiers for map/unmap
>   virtio-iommu: Call memory notifiers in attach/detach
>   virtio-iommu: Add replay() memory region callback
>   virtio-iommu: Add notify_flag_changed() memory region callback
>   memory: Add interface to set iommu page size mask
>   vfio: Set IOMMU page size as per host supported page size
>   virtio-iommu: Set supported page size mask
> 
> Jean-Philippe Brucker (3):
>   virtio-iommu: Fix virtio_iommu_mr()
>   virtio-iommu: Store memory region in endpoint struct
>   vfio: Don't issue full 2^64 unmap
> 
>  include/exec/memory.h    |  26 +++++
>  hw/vfio/common.c         |  19 ++++
>  hw/virtio/virtio-iommu.c | 204 ++++++++++++++++++++++++++++++++++++++-
>  softmmu/memory.c         |  13 +++
>  hw/virtio/trace-events   |   5 +
>  5 files changed, 265 insertions(+), 2 deletions(-)
>
Peter Xu Oct. 19, 2020, 9:35 p.m. UTC | #4
On Thu, Oct 08, 2020 at 07:15:57PM +0200, Jean-Philippe Brucker wrote:
> From: Bharat Bhushan <bbhushan2@marvell.com>
> 
> The virtio-iommu device can deal with arbitrary page sizes for virtual
> endpoints, but for endpoints assigned with VFIO it must follow the page
> granule used by the host IOMMU driver.
> 
> Implement the interface to set the vIOMMU page size mask, called by VFIO
> for each endpoint. We assume that all host IOMMU drivers use the same
> page granule (the host page granule). Override the page_size_mask field
> in the virtio config space.
> 
> Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v10: Use global page mask, allowing VFIO to override it until boot.
> ---
>  hw/virtio/virtio-iommu.c | 51 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
> index 8823bfc804a..dd0b3093d1b 100644
> --- a/hw/virtio/virtio-iommu.c
> +++ b/hw/virtio/virtio-iommu.c
> @@ -914,6 +914,56 @@ static int virtio_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu_mr,
>      return 0;
>  }
>  
> +static int virtio_iommu_set_page_size_mask(IOMMUMemoryRegion *mr,
> +                                           uint64_t page_size_mask,
> +                                           Error **errp)
> +{
> +    int new_granule, old_granule;
> +    IOMMUDevice *sdev = container_of(mr, IOMMUDevice, iommu_mr);
> +    VirtIOIOMMU *s = sdev->viommu;
> +
> +    if (!page_size_mask) {
> +        return -1;
> +    }
> +
> +    new_granule = ctz64(page_size_mask);
> +    old_granule = ctz64(s->config.page_size_mask);
> +
> +    /*
> +     * Modifying the page size after machine initialization isn't supported.
> +     * Having a different mask is possible but the guest will use sub-optimal
> +     * block sizes, so warn about it.
> +     */
> +    if (qdev_hotplug) {
> +        if (new_granule != old_granule) {
> +            error_setg(errp,
> +                       "virtio-iommu page mask 0x%"PRIx64
> +                       " is incompatible with mask 0x%"PRIx64,
> +                       s->config.page_size_mask, page_size_mask);
> +            return -1;
> +        } else if (page_size_mask != s->config.page_size_mask) {
> +            warn_report("virtio-iommu page mask 0x%"PRIx64
> +                        " does not match 0x%"PRIx64,
> +                        s->config.page_size_mask, page_size_mask);
> +        }
> +        return 0;
> +    }
> +
> +    /*
> +     * Disallow shrinking the page size. For example if an endpoint only
> +     * supports 64kB pages, we can't globally enable 4kB pages. But that
> +     * shouldn't happen, the host is unlikely to setup differing page granules.
> +     * The other bits are only hints describing optimal block sizes.
> +     */
> +    if (new_granule < old_granule) {
> +        error_setg(errp, "memory region shrinks the virtio-iommu page granule");
> +        return -1;
> +    }

My understanding is that shrink is actually allowed, instead we should forbid
growing of the mask?  For example, initially the old_granule will always points
to the guest page size.  Then as long as the host page size (which new_granule
represents) is smaller than the old_granule, then it seems fine... Or am I wrong?

Another thing, IIUC this function will be majorly called in vfio code when the
container page mask will be passed into it.  If there're multiple vfio
containers that support different host IOMMU page sizes, then IIUC the order of
the call to virtio_iommu_set_page_size_mask() is undefined.  It's probably
related to which "-device vfio-pci,..." parameter is earlier.

To make this simpler, I'm thinking whether we should just forbid the case where
devices have different iommu page sizes.  So when assigned devices are used, we
make sure all host iommu page sizes are the same, and the value should be
smaller than guest page size.  Otherwise we'll simply fall back to guest psize.

Thanks,

> +
> +    s->config.page_size_mask = page_size_mask;
> +    return 0;
> +}
> +
>  static void virtio_iommu_device_realize(DeviceState *dev, Error **errp)
>  {
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> @@ -1146,6 +1196,7 @@ static void virtio_iommu_memory_region_class_init(ObjectClass *klass,
>      imrc->translate = virtio_iommu_translate;
>      imrc->replay = virtio_iommu_replay;
>      imrc->notify_flag_changed = virtio_iommu_notify_flag_changed;
> +    imrc->iommu_set_page_size_mask = virtio_iommu_set_page_size_mask;
>  }
>  
>  static const TypeInfo virtio_iommu_info = {
> -- 
> 2.28.0
>
Jean-Philippe Brucker Oct. 22, 2020, 4:39 p.m. UTC | #5
On Mon, Oct 19, 2020 at 05:35:39PM -0400, Peter Xu wrote:
> > +    /*

> > +     * Disallow shrinking the page size. For example if an endpoint only

> > +     * supports 64kB pages, we can't globally enable 4kB pages. But that

> > +     * shouldn't happen, the host is unlikely to setup differing page granules.

> > +     * The other bits are only hints describing optimal block sizes.

> > +     */

> > +    if (new_granule < old_granule) {

> > +        error_setg(errp, "memory region shrinks the virtio-iommu page granule");

> > +        return -1;

> > +    }

> 

> My understanding is that shrink is actually allowed, instead we should forbid

> growing of the mask?  For example, initially the old_granule will always points

> to the guest page size.  Then as long as the host page size (which new_granule

> represents) is smaller than the old_granule, then it seems fine... Or am I wrong?


The case I was checking against is two assigned devices with different
page sizes. First one sets a 64kB page size, then the second one shouldn't
be able to shrink it back to 4kB, because the guest would create mappings
not aligned on 64kB, which can't be applied by the pIOMMU of the first
device.

But let's forget this case for now, in practice all assigned devices use
the host page size.

> 

> Another thing, IIUC this function will be majorly called in vfio code when the

> container page mask will be passed into it.  If there're multiple vfio

> containers that support different host IOMMU page sizes, then IIUC the order of

> the call to virtio_iommu_set_page_size_mask() is undefined.  It's probably

> related to which "-device vfio-pci,..." parameter is earlier.

> 

> To make this simpler, I'm thinking whether we should just forbid the case where

> devices have different iommu page sizes.  So when assigned devices are used, we

> make sure all host iommu page sizes are the same, and the value should be

> smaller than guest page size.  Otherwise we'll simply fall back to guest psize.


Mostly agree, I need to simplify this function.

I don't think we care about guest page size, though. Currently our default
mask is TARGET_PAGE_MASK, which is the smallest size supported by vCPUs
(generally 4kB), but it doesn't really mean guest page size, since the
guest can choose a larger granule at runtime. Besides virtio-iommu can in
theory map at byte granule if there isn't any assigned device, so our
default mask could as well be ~0ULL (but doesn't work at the moment, I've
tried).

So what I'd like to do for next version:

* Set qemu_real_host_page_mask as the default page mask, instead of the
  rather arbitrary TARGET_PAGE_MASK. Otherwise we cannot hotplug assigned
  devices on a 64kB host, since TARGET_PAGE_MASK is pretty much always
  4kB.

* Disallow changing the page size. It's simpler and works in
  practice if we default to qemu_real_host_page_mask.

* For non-hotplug devices, allow changing the rest of the mask. For
  hotplug devices, only warn about it.

Thanks,
Jean
Jean-Philippe Brucker Oct. 22, 2020, 4:41 p.m. UTC | #6
On Fri, Oct 16, 2020 at 09:58:28AM +0200, Auger Eric wrote:
> > +static void virtio_iommu_notify_map(IOMMUMemoryRegion *mr, hwaddr virt_start,
> > +                                    hwaddr virt_end, hwaddr paddr)
> > +{
> > +    IOMMUTLBEntry entry;
> > +    IOMMUNotifierFlag flags = mr->iommu_notify_flags;
> > +
> > +    if (!(flags & IOMMU_NOTIFIER_MAP)) {
> > +        return;
> > +    }
> > +
> > +    trace_virtio_iommu_notify_map(mr->parent_obj.name, virt_start, virt_end,
> > +                                  paddr);
> > +
> > +    entry.target_as = &address_space_memory;
> > +    entry.addr_mask = virt_end - virt_start;
> > +    entry.iova = virt_start;
> > +    entry.perm = IOMMU_RW;
> logically you should be able to cascade the struct virtio_iommu_req_map
> *req flags field instead.

Agreed.

I'm also thinking of adding a check for VIRTIO_IOMMU_MAP_F_MMIO, to avoid
going further into the notifier and maybe do the same for unmap.

Thanks,
Jean
Peter Xu Oct. 22, 2020, 8:56 p.m. UTC | #7
On Thu, Oct 22, 2020 at 06:39:37PM +0200, Jean-Philippe Brucker wrote:
> So what I'd like to do for next version:
> 
> * Set qemu_real_host_page_mask as the default page mask, instead of the
>   rather arbitrary TARGET_PAGE_MASK.

Oh, I thought TARGET_PAGE_MASK was intended - kernel committ 39b3b3c9cac1
("iommu/virtio: Reject IOMMU page granule larger than PAGE_SIZE", 2020-03-27)
explicitly introduced a check that virtio-iommu kernel driver will fail
directly if this psize is bigger than PAGE_SIZE in the guest.  So it sounds
reasonable to have the default value as PAGE_SIZE (if it's the same as
TARGET_PAGE_SIZE in QEMU, which seems true?).

For example, I'm thinking whether qemu_real_host_page_mask could be bigger than
PAGE_SIZE in the guest in some environments, then it seems virtio-iommu won't
boot anymore without assigned devices, because that extra check above will
always fail.

>   Otherwise we cannot hotplug assigned
>   devices on a 64kB host, since TARGET_PAGE_MASK is pretty much always
>   4kB.
> 
> * Disallow changing the page size. It's simpler and works in
>   practice if we default to qemu_real_host_page_mask.
> 
> * For non-hotplug devices, allow changing the rest of the mask. For
>   hotplug devices, only warn about it.

Could I ask what's "the rest of the mask"?  On the driver side, I see that
viommu_domain_finalise() will pick the largest supported page size to use, if
so then we seem to be quite restricted on what page size we can use.

I'm also a bit curious about what scenario we plan to support in this initial
version, especially for ARM.  For x86, I think it's probably always 4k
everywhere so it's fairly simple.  Know little on ARM side...

Thanks,
Jean-Philippe Brucker Oct. 23, 2020, 7:48 a.m. UTC | #8
On Thu, Oct 22, 2020 at 04:56:16PM -0400, Peter Xu wrote:
> On Thu, Oct 22, 2020 at 06:39:37PM +0200, Jean-Philippe Brucker wrote:

> > So what I'd like to do for next version:

> > 

> > * Set qemu_real_host_page_mask as the default page mask, instead of the

> >   rather arbitrary TARGET_PAGE_MASK.

> 

> Oh, I thought TARGET_PAGE_MASK was intended - kernel committ 39b3b3c9cac1

> ("iommu/virtio: Reject IOMMU page granule larger than PAGE_SIZE", 2020-03-27)

> explicitly introduced a check that virtio-iommu kernel driver will fail

> directly if this psize is bigger than PAGE_SIZE in the guest.  So it sounds

> reasonable to have the default value as PAGE_SIZE (if it's the same as

> TARGET_PAGE_SIZE in QEMU, which seems true?).

> 

> For example, I'm thinking whether qemu_real_host_page_mask could be bigger than

> PAGE_SIZE in the guest in some environments, then it seems virtio-iommu won't

> boot anymore without assigned devices, because that extra check above will

> always fail.


Right, I missed this problem again. Switching to qemu_real_host_page_mask
is probably not the best idea until we solve the host64k-guest4k problem.

> 

> >   Otherwise we cannot hotplug assigned

> >   devices on a 64kB host, since TARGET_PAGE_MASK is pretty much always

> >   4kB.

> > 

> > * Disallow changing the page size. It's simpler and works in

> >   practice if we default to qemu_real_host_page_mask.

> > 

> > * For non-hotplug devices, allow changing the rest of the mask. For

> >   hotplug devices, only warn about it.

> 

> Could I ask what's "the rest of the mask"?


The LSB in the mask defines the page size. The other bits define which
block sizes are supported, for example 2MB and 1GB blocks with a 4k page
size. These are only for optimization, the upper bits of the mask could
also be all 1s. If the guest aligns its mappings on those block sizes,
then the host can use intermediate levels in the page tables resulting in
fewer IOTLB entries.

> On the driver side, I see that

> viommu_domain_finalise() will pick the largest supported page size to use, if

> so then we seem to be quite restricted on what page size we can use.


In Linux iommu_dma_alloc_remap() tries to allocate blocks based on the
page mask (copied by viommu_domain_finalise() into domain->pgsize_bitmap)

> I'm also a bit curious about what scenario we plan to support in this initial

> version, especially for ARM.  For x86, I think it's probably always 4k

> everywhere so it's fairly simple.  Know little on ARM side...


Arm CPUs and SMMU support 4k, 16k and 64k page sizes. I don't think 16k is
used anywhere but some distributions chose 64k (RHEL, I think?), others
4k, so we need to support both.

Unfortunately as noted above host64k-guest4k is not possible without
adding a negotiation mechanism to virtio-iommu, host VFIO and IOMMU
driver.

Thanks,
Jean
Peter Xu Oct. 23, 2020, 4:47 p.m. UTC | #9
On Fri, Oct 23, 2020 at 09:48:58AM +0200, Jean-Philippe Brucker wrote:
> Arm CPUs and SMMU support 4k, 16k and 64k page sizes. I don't think 16k is
> used anywhere but some distributions chose 64k (RHEL, I think?), others
> 4k, so we need to support both.
> 
> Unfortunately as noted above host64k-guest4k is not possible without
> adding a negotiation mechanism to virtio-iommu, host VFIO and IOMMU
> driver.

I see.  Then it seems we would still need to support host4k-guest64k.

Maybe for assigned case, we can simply AND all the psize_masks of all the vfio
containers that supported to replace the default psize mask (TARGET_PAGE_SIZE)
without caring about whether it's shrinking or not?  Note that current patch
only update config.psize_mask to the new one, but I think we need to calculate
the subset of all containers rather than a simply update.  Then with the help
of 39b3b3c9cac1 imho we'll gracefully fail the probe if the psize is not
suitable anyway, e.g., host64k-guest4k.

Thanks,
Jean-Philippe Brucker Oct. 27, 2020, 5:38 p.m. UTC | #10
On Fri, Oct 23, 2020 at 12:47:02PM -0400, Peter Xu wrote:
> On Fri, Oct 23, 2020 at 09:48:58AM +0200, Jean-Philippe Brucker wrote:
> > Arm CPUs and SMMU support 4k, 16k and 64k page sizes. I don't think 16k is
> > used anywhere but some distributions chose 64k (RHEL, I think?), others
> > 4k, so we need to support both.
> > 
> > Unfortunately as noted above host64k-guest4k is not possible without
> > adding a negotiation mechanism to virtio-iommu, host VFIO and IOMMU
> > driver.
> 
> I see.  Then it seems we would still need to support host4k-guest64k.
> 
> Maybe for assigned case, we can simply AND all the psize_masks of all the vfio
> containers that supported to replace the default psize mask (TARGET_PAGE_SIZE)
> without caring about whether it's shrinking or not?  Note that current patch
> only update config.psize_mask to the new one, but I think we need to calculate
> the subset of all containers rather than a simply update.

Yes I think an AND is the right operation. We'll return an error if the
resulting mask is 0. Then for hotplug, I think I'll keep the current "best
effort" code from this patch. If necessary we could later add a parameter
to set a default mask and guarantee hotplug success.

Thanks,
Jean

> Then with the help
> of 39b3b3c9cac1 imho we'll gracefully fail the probe if the psize is not
> suitable anyway, e.g., host64k-guest4k.
> 
> Thanks,
> 
> -- 
> Peter Xu
>
Michael S. Tsirkin Oct. 30, 2020, 10:24 a.m. UTC | #11
On Tue, Oct 27, 2020 at 06:38:40PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Oct 23, 2020 at 12:47:02PM -0400, Peter Xu wrote:

> > On Fri, Oct 23, 2020 at 09:48:58AM +0200, Jean-Philippe Brucker wrote:

> > > Arm CPUs and SMMU support 4k, 16k and 64k page sizes. I don't think 16k is

> > > used anywhere but some distributions chose 64k (RHEL, I think?), others

> > > 4k, so we need to support both.

> > > 

> > > Unfortunately as noted above host64k-guest4k is not possible without

> > > adding a negotiation mechanism to virtio-iommu, host VFIO and IOMMU

> > > driver.

> > 

> > I see.  Then it seems we would still need to support host4k-guest64k.

> > 

> > Maybe for assigned case, we can simply AND all the psize_masks of all the vfio

> > containers that supported to replace the default psize mask (TARGET_PAGE_SIZE)

> > without caring about whether it's shrinking or not?  Note that current patch

> > only update config.psize_mask to the new one, but I think we need to calculate

> > the subset of all containers rather than a simply update.

> 

> Yes I think an AND is the right operation. We'll return an error if the

> resulting mask is 0. Then for hotplug, I think I'll keep the current "best

> effort" code from this patch. If necessary we could later add a parameter

> to set a default mask and guarantee hotplug success.

> 

> Thanks,

> Jean



So I should expect a new version with that?

> > Then with the help

> > of 39b3b3c9cac1 imho we'll gracefully fail the probe if the psize is not

> > suitable anyway, e.g., host64k-guest4k.

> > 

> > Thanks,

> > 

> > -- 

> > Peter Xu

> >
Michael S. Tsirkin Oct. 30, 2020, 10:26 a.m. UTC | #12
On Thu, Oct 08, 2020 at 03:22:54PM -0600, Alex Williamson wrote:
> On Thu,  8 Oct 2020 19:15:56 +0200
> Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> 
> > From: Bharat Bhushan <bbhushan2@marvell.com>
> > 
> > Set IOMMU supported page size mask same as host Linux supported page
> > size mask.
> > 
> > Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> >  hw/vfio/common.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 13471ae2943..e66054b02a7 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -636,6 +636,14 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >                              int128_get64(llend),
> >                              iommu_idx);
> >  
> > +        ret = memory_region_iommu_set_page_size_mask(giommu->iommu,
> > +                                                     container->pgsizes,
> > +                                                     &err);
> > +        if (ret) {
> > +            g_free(giommu);
> > +            goto fail;
> > +        }
> > +
> >          ret = memory_region_register_iommu_notifier(section->mr, &giommu->n,
> >                                                      &err);
> >          if (ret) {
> 
> Acked-by: Alex Williamson <alex.williamson@redhat.com>

This one too, seems independent of the rest of the
patchset and can be merged separately, right?
If so

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Michael S. Tsirkin Oct. 30, 2020, 10:27 a.m. UTC | #13
On Thu, Oct 08, 2020 at 07:15:48PM +0200, Jean-Philippe Brucker wrote:
> This series adds support for VFIO endpoints to virtio-iommu.

> 

> Versions 1 to 9 were posted by Bharat Bhushan, but I am taking over for

> now since he doesn't have much time to spend on it. Thanks again Bharat

> for the work!


ok so just minor things left, correct? Do you plan to post v11?

> Two major changes since [v9]:

> 

> * Don't use per-endoint page_size_mask properties. Instead keep a global

>   page size for the virtio-iommu device, updated when adding a VFIO

>   endpoint. Reject hotplug if the page size is incompatible.

> 

> * Try to make the MAP/UNMAP paths more efficient, by keeping track of

>   memory region within the endpoint structure.

> 

> More testing would be appreciated, since I can only test using a software

> model as host at the moment. But it does seem to hold well with PCIe

> hotplug/unplug, and pass-through to guest userspace, which are no mean

> feat.

> 

> Note that one page size combination is not supported: host 64kB guest

> 4kB cannot work, because the host IOMMU driver will automatically pick

> 64kB pages which prevents mapping at a smaller granule. Supporting this

> case would require introducing a page size negotiation mechanism from

> the guest all the way to the host IOMMU driver. Possible, but not

> planned at the moment.

> 

> [v9] https://lore.kernel.org/qemu-devel/20200323084617.1782-1-bbhushan2@marvell.com/

> 

> Bharat Bhushan (7):

>   virtio-iommu: Add memory notifiers for map/unmap

>   virtio-iommu: Call memory notifiers in attach/detach

>   virtio-iommu: Add replay() memory region callback

>   virtio-iommu: Add notify_flag_changed() memory region callback

>   memory: Add interface to set iommu page size mask

>   vfio: Set IOMMU page size as per host supported page size

>   virtio-iommu: Set supported page size mask

> 

> Jean-Philippe Brucker (3):

>   virtio-iommu: Fix virtio_iommu_mr()

>   virtio-iommu: Store memory region in endpoint struct

>   vfio: Don't issue full 2^64 unmap

> 

>  include/exec/memory.h    |  26 +++++

>  hw/vfio/common.c         |  19 ++++

>  hw/virtio/virtio-iommu.c | 204 ++++++++++++++++++++++++++++++++++++++-

>  softmmu/memory.c         |  13 +++

>  hw/virtio/trace-events   |   5 +

>  5 files changed, 265 insertions(+), 2 deletions(-)

> 

> -- 

> 2.28.0
Jean-Philippe Brucker Oct. 30, 2020, 10:48 a.m. UTC | #14
On Fri, Oct 30, 2020 at 06:27:35AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2020 at 07:15:48PM +0200, Jean-Philippe Brucker wrote:
> > This series adds support for VFIO endpoints to virtio-iommu.
> > 
> > Versions 1 to 9 were posted by Bharat Bhushan, but I am taking over for
> > now since he doesn't have much time to spend on it. Thanks again Bharat
> > for the work!
> 
> ok so just minor things left, correct? Do you plan to post v11?

Yes, today or early next week

Thanks,
Jean
Jean-Philippe Brucker Oct. 30, 2020, 3:19 p.m. UTC | #15
On Fri, Oct 30, 2020 at 06:26:31AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 08, 2020 at 03:22:54PM -0600, Alex Williamson wrote:
> > On Thu,  8 Oct 2020 19:15:56 +0200
> > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> > 
> > > From: Bharat Bhushan <bbhushan2@marvell.com>
> > > 
> > > Set IOMMU supported page size mask same as host Linux supported page
> > > size mask.
> > > 
> > > Signed-off-by: Bharat Bhushan <bbhushan2@marvell.com>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > ---
> > >  hw/vfio/common.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index 13471ae2943..e66054b02a7 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -636,6 +636,14 @@ static void vfio_listener_region_add(MemoryListener *listener,
> > >                              int128_get64(llend),
> > >                              iommu_idx);
> > >  
> > > +        ret = memory_region_iommu_set_page_size_mask(giommu->iommu,
> > > +                                                     container->pgsizes,
> > > +                                                     &err);
> > > +        if (ret) {
> > > +            g_free(giommu);
> > > +            goto fail;
> > > +        }
> > > +
> > >          ret = memory_region_register_iommu_notifier(section->mr, &giommu->n,
> > >                                                      &err);
> > >          if (ret) {
> > 
> > Acked-by: Alex Williamson <alex.williamson@redhat.com>
> 
> This one too, seems independent of the rest of the
> patchset and can be merged separately, right?

This depends on patch 7 which introduces the function

Thanks,
Jean

> If so
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>