diff mbox series

[RFC,KERNEL,v2,2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

Message ID 20231124103123.3263471-3-Jiqian.Chen@amd.com
State New
Headers show
Series Support device passthrough when dom0 is PVH on Xen | expand

Commit Message

Jiqian Chen Nov. 24, 2023, 10:31 a.m. UTC
This patch is to solve two problems we encountered when we try to
passthrough a device to hvm domU base on Xen PVH dom0.

First, hvm guest will alloc a pirq and irq for a passthrough device
by using gsi, before that, the gsi must first has a mapping in dom0,
see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
into Xen and check whether dom0 has the mapping. See
XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
dom0 and it return irq is 0, and then return -EPERM.
This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
when thay are enabled.

Second, in PVH dom0, the gsi of a passthrough device doesn't get
registered, but gsi must be configured for it to be able to be
mapped into a domU.

After searching codes, we can find map_pirq and register_gsi will be
done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
can be conclude to that the gsi of a passthrough device doesn't be
unmasked.

To solve the unmaske problem, this patch call the unmask_irq when we
assign a device to be passthrough. So that the gsi can get registered
and mapped in PVH dom0.

Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
---
 drivers/xen/xen-pciback/pci_stub.c | 7 +++++++
 include/linux/irq.h                | 1 +
 kernel/irq/chip.c                  | 1 +
 kernel/irq/internals.h             | 1 -
 kernel/irq/irqdesc.c               | 2 +-
 5 files changed, 10 insertions(+), 2 deletions(-)

Comments

Roger Pau Monné Nov. 30, 2023, 4:02 p.m. UTC | #1
On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > This patch is to solve two problems we encountered when we try to
> > passthrough a device to hvm domU base on Xen PVH dom0.
> > 
> > First, hvm guest will alloc a pirq and irq for a passthrough device
> > by using gsi, before that, the gsi must first has a mapping in dom0,
> > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > into Xen and check whether dom0 has the mapping. See
> > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > dom0 and it return irq is 0, and then return -EPERM.
> > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > when thay are enabled.
> > 
> > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > registered, but gsi must be configured for it to be able to be
> > mapped into a domU.
> > 
> > After searching codes, we can find map_pirq and register_gsi will be
> > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > can be conclude to that the gsi of a passthrough device doesn't be
> > unmasked.
> > 
> > To solve the unmaske problem, this patch call the unmask_irq when we
> > assign a device to be passthrough. So that the gsi can get registered
> > and mapped in PVH dom0.
> 
> 
> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> we need the unmask check in Xen? Couldn't we just do:
> 
> 
> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> index 4e40d3609a..df262a4a18 100644
> --- a/xen/arch/x86/hvm/vioapic.c
> +++ b/xen/arch/x86/hvm/vioapic.c
> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>              hvm_dpci_eoi(d, gsi);
>      }
>  
> -    if ( is_hardware_domain(d) && unmasked )
> +    if ( is_hardware_domain(d) )
>      {
>          /*
>           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock

There are some issues with this approach.

mp_register_gsi() will only setup the trigger and polarity of the
IO-APIC pin once, so we do so once the guest unmask the pin in order
to assert that the configuration is the intended one.  A guest is
allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
that doesn't take effect unless the pin is unmasked.

Overall the question would be whether we have any guarantees that
the hardware domain has properly configured the pin, even if it's not
using it itself (as it hasn't been unmasked).

IIRC PCI legacy interrupts are level triggered and low polarity, so we
could configure any pins that are not setup at bind time?

Thanks, Roger.
Stefano Stabellini Dec. 1, 2023, 3:15 a.m. UTC | #2
On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > This patch is to solve two problems we encountered when we try to
> > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > 
> > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > into Xen and check whether dom0 has the mapping. See
> > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > dom0 and it return irq is 0, and then return -EPERM.
> > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > when thay are enabled.
> > > 
> > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > registered, but gsi must be configured for it to be able to be
> > > mapped into a domU.
> > > 
> > > After searching codes, we can find map_pirq and register_gsi will be
> > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > can be conclude to that the gsi of a passthrough device doesn't be
> > > unmasked.
> > > 
> > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > assign a device to be passthrough. So that the gsi can get registered
> > > and mapped in PVH dom0.
> > 
> > 
> > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > we need the unmask check in Xen? Couldn't we just do:
> > 
> > 
> > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > index 4e40d3609a..df262a4a18 100644
> > --- a/xen/arch/x86/hvm/vioapic.c
> > +++ b/xen/arch/x86/hvm/vioapic.c
> > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> >              hvm_dpci_eoi(d, gsi);
> >      }
> >  
> > -    if ( is_hardware_domain(d) && unmasked )
> > +    if ( is_hardware_domain(d) )
> >      {
> >          /*
> >           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> 
> There are some issues with this approach.
> 
> mp_register_gsi() will only setup the trigger and polarity of the
> IO-APIC pin once, so we do so once the guest unmask the pin in order
> to assert that the configuration is the intended one.  A guest is
> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> that doesn't take effect unless the pin is unmasked.
> 
> Overall the question would be whether we have any guarantees that
> the hardware domain has properly configured the pin, even if it's not
> using it itself (as it hasn't been unmasked).
> 
> IIRC PCI legacy interrupts are level triggered and low polarity, so we
> could configure any pins that are not setup at bind time?

That could work.

Another idea is to move only the call to allocate_and_map_gsi_pirq at
bind time? That might be enough to pass a pirq_access_permitted check.
Roger Pau Monné Dec. 1, 2023, 8:58 a.m. UTC | #3
On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > This patch is to solve two problems we encountered when we try to
> > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > 
> > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > into Xen and check whether dom0 has the mapping. See
> > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > when thay are enabled.
> > > > 
> > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > registered, but gsi must be configured for it to be able to be
> > > > mapped into a domU.
> > > > 
> > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > unmasked.
> > > > 
> > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > assign a device to be passthrough. So that the gsi can get registered
> > > > and mapped in PVH dom0.
> > > 
> > > 
> > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > we need the unmask check in Xen? Couldn't we just do:
> > > 
> > > 
> > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > index 4e40d3609a..df262a4a18 100644
> > > --- a/xen/arch/x86/hvm/vioapic.c
> > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > >              hvm_dpci_eoi(d, gsi);
> > >      }
> > >  
> > > -    if ( is_hardware_domain(d) && unmasked )
> > > +    if ( is_hardware_domain(d) )
> > >      {
> > >          /*
> > >           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > 
> > There are some issues with this approach.
> > 
> > mp_register_gsi() will only setup the trigger and polarity of the
> > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > to assert that the configuration is the intended one.  A guest is
> > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > that doesn't take effect unless the pin is unmasked.
> > 
> > Overall the question would be whether we have any guarantees that
> > the hardware domain has properly configured the pin, even if it's not
> > using it itself (as it hasn't been unmasked).
> > 
> > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > could configure any pins that are not setup at bind time?
> 
> That could work.
> 
> Another idea is to move only the call to allocate_and_map_gsi_pirq at
> bind time? That might be enough to pass a pirq_access_permitted check.

Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
parameter would be a GSI instead of a previously mapped IRQ).  Such
difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
route I would recommend that we instead introduce a new dmop that has
this syntax regardless of the domain type it's called from.

Thanks, Roger.
Stefano Stabellini Dec. 2, 2023, 3:37 a.m. UTC | #4
On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > This patch is to solve two problems we encountered when we try to
> > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > > 
> > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > into Xen and check whether dom0 has the mapping. See
> > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > when thay are enabled.
> > > > > 
> > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > registered, but gsi must be configured for it to be able to be
> > > > > mapped into a domU.
> > > > > 
> > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > unmasked.
> > > > > 
> > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > and mapped in PVH dom0.
> > > > 
> > > > 
> > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > we need the unmask check in Xen? Couldn't we just do:
> > > > 
> > > > 
> > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > index 4e40d3609a..df262a4a18 100644
> > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > >              hvm_dpci_eoi(d, gsi);
> > > >      }
> > > >  
> > > > -    if ( is_hardware_domain(d) && unmasked )
> > > > +    if ( is_hardware_domain(d) )
> > > >      {
> > > >          /*
> > > >           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > > 
> > > There are some issues with this approach.
> > > 
> > > mp_register_gsi() will only setup the trigger and polarity of the
> > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > to assert that the configuration is the intended one.  A guest is
> > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > that doesn't take effect unless the pin is unmasked.
> > > 
> > > Overall the question would be whether we have any guarantees that
> > > the hardware domain has properly configured the pin, even if it's not
> > > using it itself (as it hasn't been unmasked).
> > > 
> > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > could configure any pins that are not setup at bind time?
> > 
> > That could work.
> > 
> > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > bind time? That might be enough to pass a pirq_access_permitted check.
> 
> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> parameter would be a GSI instead of a previously mapped IRQ).  Such
> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> route I would recommend that we instead introduce a new dmop that has
> this syntax regardless of the domain type it's called from.

Looking at the code it is certainly a bit confusing. My point was that
we don't need to wait until polarity and trigger are set appropriately
to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
should be able to figure out that Dom0 is permitted pirq access.

So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
polarity to be configured to work. But the suggestion of doing it a
"bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.

But maybe we can find another location, maybe within
xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
trigger and polarity are set and before the interrupt is unmasked.

Then we change the implementation of vioapic_hwdom_map_gsi to skip the
call to allocate_and_map_gsi_pirq, because by the time
vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
already been done.

I am not familiar with vioapic.c but to give you an idea of what I was
thinking:


diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
index 4e40d3609a..16d56fe851 100644
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
         return ret;
     }
 
-    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
-    if ( ret )
-    {
-        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
-                 gsi, ret);
-        return ret;
-    }
-
     pcidevs_lock();
     ret = pt_irq_create_bind(currd, &pt_irq_bind);
     if ( ret )
@@ -287,6 +279,17 @@ static void vioapic_write_redirent(
             hvm_dpci_eoi(d, gsi);
     }
 
+    if ( is_hardware_domain(d) ) 
+    {
+        int pirq = gsi, ret;
+        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
+        if ( ret )
+        {
+            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
+                    gsi, ret);
+            return ret;
+        }
+    }
     if ( is_hardware_domain(d) && unmasked )
     {
         /*
Thomas Gleixner Dec. 4, 2023, 8:13 a.m. UTC | #5
On Fri, Nov 24 2023 at 18:31, Jiqian Chen wrote:
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index 5a96b6c66c07..b83d02bcc76c 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -357,6 +357,7 @@ static int pcistub_match(struct pci_dev *dev)
>  static int pcistub_init_device(struct pci_dev *dev)
>  {
>  	struct xen_pcibk_dev_data *dev_data;
> +	struct irq_desc *desc = NULL;
>  	int err = 0;
>  
>  	dev_dbg(&dev->dev, "initializing...\n");
> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>  	if (err)
>  		goto config_release;
>  
> +	if (xen_initial_domain() && xen_pvh_domain()) {
> +		if (dev->irq <= 0 || !(desc = irq_to_desc(dev->irq)))

Driver code has absolutely no business to access irq_desc.

> +			goto config_release;
> +		unmask_irq(desc);

Or to invoke any internal function.

> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -439,6 +439,7 @@ void unmask_irq(struct irq_desc *desc)
>  		irq_state_clr_masked(desc);
>  	}
>  }
> +EXPORT_SYMBOL_GPL(unmask_irq);

Not going to happen.

> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -380,7 +380,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
>  {
>  	return mtree_load(&sparse_irqs, irq);
>  }
> -#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
> +#if defined CONFIG_KVM_BOOK3S_64_HV_MODULE || defined CONFIG_XEN_PVH

Neither that.

This all smells badly like a XEN internal issue and we are not going to
hack around it by exposing interrupt internals.

Thanks,

        tglx
Roger Pau Monné Dec. 4, 2023, 10:28 a.m. UTC | #6
On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > > This patch is to solve two problems we encountered when we try to
> > > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > > > 
> > > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > > into Xen and check whether dom0 has the mapping. See
> > > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > > when thay are enabled.
> > > > > > 
> > > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > > registered, but gsi must be configured for it to be able to be
> > > > > > mapped into a domU.
> > > > > > 
> > > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > > unmasked.
> > > > > > 
> > > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > > and mapped in PVH dom0.
> > > > > 
> > > > > 
> > > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > > we need the unmask check in Xen? Couldn't we just do:
> > > > > 
> > > > > 
> > > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > > index 4e40d3609a..df262a4a18 100644
> > > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > > >              hvm_dpci_eoi(d, gsi);
> > > > >      }
> > > > >  
> > > > > -    if ( is_hardware_domain(d) && unmasked )
> > > > > +    if ( is_hardware_domain(d) )
> > > > >      {
> > > > >          /*
> > > > >           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > > > 
> > > > There are some issues with this approach.
> > > > 
> > > > mp_register_gsi() will only setup the trigger and polarity of the
> > > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > > to assert that the configuration is the intended one.  A guest is
> > > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > > that doesn't take effect unless the pin is unmasked.
> > > > 
> > > > Overall the question would be whether we have any guarantees that
> > > > the hardware domain has properly configured the pin, even if it's not
> > > > using it itself (as it hasn't been unmasked).
> > > > 
> > > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > > could configure any pins that are not setup at bind time?
> > > 
> > > That could work.
> > > 
> > > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > > bind time? That might be enough to pass a pirq_access_permitted check.
> > 
> > Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> > just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> > parameter would be a GSI instead of a previously mapped IRQ).  Such
> > difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> > route I would recommend that we instead introduce a new dmop that has
> > this syntax regardless of the domain type it's called from.
> 
> Looking at the code it is certainly a bit confusing. My point was that
> we don't need to wait until polarity and trigger are set appropriately
> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> should be able to figure out that Dom0 is permitted pirq access.

The logic is certainly not straightforward, and it could benefit from
some comments.

The irq permissions are a bit special, in that they get setup when the
IRQ is mapped.

The problem however is not so much with IRQ permissions, that we can
indeed sort out internally in Xen.  Such check in dom0 has the side
effect of preventing the IRQ from being assigned to a domU without the
hardware source being properly configured AFAICT.

> 
> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> polarity to be configured to work. But the suggestion of doing it a
> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
> 
> But maybe we can find another location, maybe within
> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> trigger and polarity are set and before the interrupt is unmasked.
> 
> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> call to allocate_and_map_gsi_pirq, because by the time
> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> already been done.

But then we would end up in a situation where the
pirq_access_permitted() check will pass, but the IO-APIC pin won't be
configured, which I think it's not what we want.

One option would be to allow mp_register_gsi() to be called multiple
times, and update the IO-APIC pin configuration as long as the pin is
not unmasked.  That would propagate each dom0 RTE update to the
underlying IO-APIC.  However such approach relies on dom0 configuring
all possible IO-APIC pins, even if no device on dom0 is using them, I
think it's not a very reliable option.

Another option would be to modify the toolstack to setup the GSI
itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
email, since we only care about PCI device passthrough the legacy INTx
should always be level triggered and low polarity.

> I am not familiar with vioapic.c but to give you an idea of what I was
> thinking:
> 
> 
> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> index 4e40d3609a..16d56fe851 100644
> --- a/xen/arch/x86/hvm/vioapic.c
> +++ b/xen/arch/x86/hvm/vioapic.c
> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>          return ret;
>      }
>  
> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> -    if ( ret )
> -    {
> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> -                 gsi, ret);
> -        return ret;
> -    }
> -
>      pcidevs_lock();
>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>      if ( ret )
> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>              hvm_dpci_eoi(d, gsi);
>      }
>  
> +    if ( is_hardware_domain(d) ) 
> +    {
> +        int pirq = gsi, ret;
> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> +        if ( ret )
> +        {
> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> +                    gsi, ret);
> +            return ret;
> +        }
> +    }
>      if ( is_hardware_domain(d) && unmasked )
>      {
>          /*

As said above, such approach relies on dom0 writing to the IO-APIC RTE
of likely each IO-APIC pin, which is IMO not quite reliable.  In there
are two different issues here that need to be fixed for PVH dom0:

 - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
   succeed for a PVH dom0, even if dom0 is not using the GSI itself.

 - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
   the IO-APIC pin itself.

First one needs to be fixed internally in Xen, second one will require
the toolstack to issue an extra hypercall in order to ensure the
IO-APIC pin is properly configured.

Thanks, Roger.
Stefano Stabellini Dec. 4, 2023, 10:19 p.m. UTC | #7
On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> > On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > > On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > > > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > > > This patch is to solve two problems we encountered when we try to
> > > > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > > > > 
> > > > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > > > into Xen and check whether dom0 has the mapping. See
> > > > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > > > when thay are enabled.
> > > > > > > 
> > > > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > > > registered, but gsi must be configured for it to be able to be
> > > > > > > mapped into a domU.
> > > > > > > 
> > > > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > > > unmasked.
> > > > > > > 
> > > > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > > > and mapped in PVH dom0.
> > > > > > 
> > > > > > 
> > > > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > > > we need the unmask check in Xen? Couldn't we just do:
> > > > > > 
> > > > > > 
> > > > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > > > index 4e40d3609a..df262a4a18 100644
> > > > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > > > >              hvm_dpci_eoi(d, gsi);
> > > > > >      }
> > > > > >  
> > > > > > -    if ( is_hardware_domain(d) && unmasked )
> > > > > > +    if ( is_hardware_domain(d) )
> > > > > >      {
> > > > > >          /*
> > > > > >           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > > > > 
> > > > > There are some issues with this approach.
> > > > > 
> > > > > mp_register_gsi() will only setup the trigger and polarity of the
> > > > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > > > to assert that the configuration is the intended one.  A guest is
> > > > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > > > that doesn't take effect unless the pin is unmasked.
> > > > > 
> > > > > Overall the question would be whether we have any guarantees that
> > > > > the hardware domain has properly configured the pin, even if it's not
> > > > > using it itself (as it hasn't been unmasked).
> > > > > 
> > > > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > > > could configure any pins that are not setup at bind time?
> > > > 
> > > > That could work.
> > > > 
> > > > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > > > bind time? That might be enough to pass a pirq_access_permitted check.
> > > 
> > > Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> > > just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> > > parameter would be a GSI instead of a previously mapped IRQ).  Such
> > > difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> > > route I would recommend that we instead introduce a new dmop that has
> > > this syntax regardless of the domain type it's called from.
> > 
> > Looking at the code it is certainly a bit confusing. My point was that
> > we don't need to wait until polarity and trigger are set appropriately
> > to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> > should be able to figure out that Dom0 is permitted pirq access.
> 
> The logic is certainly not straightforward, and it could benefit from
> some comments.
> 
> The irq permissions are a bit special, in that they get setup when the
> IRQ is mapped.
> 
> The problem however is not so much with IRQ permissions, that we can
> indeed sort out internally in Xen.  Such check in dom0 has the side
> effect of preventing the IRQ from being assigned to a domU without the
> hardware source being properly configured AFAICT.

Now I understand why you made a comment previously about Xen having to
configure trigger and polarity for these interrupts on its own.


> > So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> > somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> > polarity to be configured to work. But the suggestion of doing it a
> > "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
> > 
> > But maybe we can find another location, maybe within
> > xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> > trigger and polarity are set and before the interrupt is unmasked.
> > 
> > Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> > call to allocate_and_map_gsi_pirq, because by the time
> > vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> > already been done.
> 
> But then we would end up in a situation where the
> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> configured, which I think it's not what we want.
> 
> One option would be to allow mp_register_gsi() to be called multiple
> times, and update the IO-APIC pin configuration as long as the pin is
> not unmasked.  That would propagate each dom0 RTE update to the
> underlying IO-APIC.  However such approach relies on dom0 configuring
> all possible IO-APIC pins, even if no device on dom0 is using them, I
> think it's not a very reliable option.
> 
> Another option would be to modify the toolstack to setup the GSI
> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
> email, since we only care about PCI device passthrough the legacy INTx
> should always be level triggered and low polarity.
> 
> > I am not familiar with vioapic.c but to give you an idea of what I was
> > thinking:
> > 
> > 
> > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > index 4e40d3609a..16d56fe851 100644
> > --- a/xen/arch/x86/hvm/vioapic.c
> > +++ b/xen/arch/x86/hvm/vioapic.c
> > @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
> >          return ret;
> >      }
> >  
> > -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > -    if ( ret )
> > -    {
> > -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > -                 gsi, ret);
> > -        return ret;
> > -    }
> > -
> >      pcidevs_lock();
> >      ret = pt_irq_create_bind(currd, &pt_irq_bind);
> >      if ( ret )
> > @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
> >              hvm_dpci_eoi(d, gsi);
> >      }
> >  
> > +    if ( is_hardware_domain(d) ) 
> > +    {
> > +        int pirq = gsi, ret;
> > +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > +        if ( ret )
> > +        {
> > +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > +                    gsi, ret);
> > +            return ret;
> > +        }
> > +    }
> >      if ( is_hardware_domain(d) && unmasked )
> >      {
> >          /*
> 
> As said above, such approach relies on dom0 writing to the IO-APIC RTE
> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
> are two different issues here that need to be fixed for PVH dom0:
> 
>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.

Yes makes sense


>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>    the IO-APIC pin itself.
> 
> First one needs to be fixed internally in Xen, second one will require
> the toolstack to issue an extra hypercall in order to ensure the
> IO-APIC pin is properly configured.
 
On ARM, Xen doesn't need to wait for dom0 to configure interrupts
correctly. Xen configures them all on its own at boot based on Device
Tree information. I guess it is not possible to do the same on x86? If
not, then I can see why we would need 1 extra toolstack hypercall for
that (or to bundle the operation of configuring IO-APIC pins together
with an existing toolstack hypercall).
Jiqian Chen Dec. 5, 2023, 6:46 a.m. UTC | #8
On 2023/12/4 18:28, Roger Pau Monné wrote:
> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>
>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>> when thay are enabled.
>>>>>>>
>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>> mapped into a domU.
>>>>>>>
>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>> unmasked.
>>>>>>>
>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>> and mapped in PVH dom0.
>>>>>>
>>>>>>
>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>
>>>>>>
>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>>      }
>>>>>>  
>>>>>> -    if ( is_hardware_domain(d) && unmasked )
>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>      {
>>>>>>          /*
>>>>>>           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>
>>>>> There are some issues with this approach.
>>>>>
>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>> to assert that the configuration is the intended one.  A guest is
>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>
>>>>> Overall the question would be whether we have any guarantees that
>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>> using it itself (as it hasn't been unmasked).
>>>>>
>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>> could configure any pins that are not setup at bind time?
>>>>
>>>> That could work.
>>>>
>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>
>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>> parameter would be a GSI instead of a previously mapped IRQ).  Such
>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>> route I would recommend that we instead introduce a new dmop that has
>>> this syntax regardless of the domain type it's called from.
>>
>> Looking at the code it is certainly a bit confusing. My point was that
>> we don't need to wait until polarity and trigger are set appropriately
>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>> should be able to figure out that Dom0 is permitted pirq access.
> 
> The logic is certainly not straightforward, and it could benefit from
> some comments.
> 
> The irq permissions are a bit special, in that they get setup when the
> IRQ is mapped.
> 
> The problem however is not so much with IRQ permissions, that we can
> indeed sort out internally in Xen.  Such check in dom0 has the side
> effect of preventing the IRQ from being assigned to a domU without the
> hardware source being properly configured AFAICT.
> 
>>
>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>> polarity to be configured to work. But the suggestion of doing it a
>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>
>> But maybe we can find another location, maybe within
>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>> trigger and polarity are set and before the interrupt is unmasked.
>>
>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>> call to allocate_and_map_gsi_pirq, because by the time
>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>> already been done.
> 
> But then we would end up in a situation where the
> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> configured, which I think it's not what we want.
> 
> One option would be to allow mp_register_gsi() to be called multiple
> times, and update the IO-APIC pin configuration as long as the pin is
> not unmasked.  That would propagate each dom0 RTE update to the
> underlying IO-APIC.  However such approach relies on dom0 configuring
> all possible IO-APIC pins, even if no device on dom0 is using them, I
> think it's not a very reliable option.
> 
> Another option would be to modify the toolstack to setup the GSI
> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
> email, since we only care about PCI device passthrough the legacy INTx
> should always be level triggered and low polarity.

I am thinking if we can do PHYSDEVOP_map_pirq and PHYSDEVOP_setup_gsi only for passthrough devices(in function pcistub_init_device).
Then it can pass permission check and setup gsi without affecting other devices that do not use IO-APIC pins.
What do you think?

> 
>> I am not familiar with vioapic.c but to give you an idea of what I was
>> thinking:
>>
>>
>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>> index 4e40d3609a..16d56fe851 100644
>> --- a/xen/arch/x86/hvm/vioapic.c
>> +++ b/xen/arch/x86/hvm/vioapic.c
>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>          return ret;
>>      }
>>  
>> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>> -    if ( ret )
>> -    {
>> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>> -                 gsi, ret);
>> -        return ret;
>> -    }
>> -
>>      pcidevs_lock();
>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>      if ( ret )
>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>              hvm_dpci_eoi(d, gsi);
>>      }
>>  
>> +    if ( is_hardware_domain(d) ) 
>> +    {
>> +        int pirq = gsi, ret;
>> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>> +        if ( ret )
>> +        {
>> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>> +                    gsi, ret);
>> +            return ret;
>> +        }
>> +    }
>>      if ( is_hardware_domain(d) && unmasked )
>>      {
>>          /*
> 
> As said above, such approach relies on dom0 writing to the IO-APIC RTE
> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
> are two different issues here that need to be fixed for PVH dom0:
> 
>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
> 
>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>    the IO-APIC pin itself.
> 
> First one needs to be fixed internally in Xen, second one will require
> the toolstack to issue an extra hypercall in order to ensure the
> IO-APIC pin is properly configured.
> 
> Thanks, Roger.
Jiqian Chen Dec. 5, 2023, 7:03 a.m. UTC | #9
Hi Thomas Gleixner,

Thank you for review, and you are right, it seems more like a XEN internal issue. We are discussing it and maybe will fix it in Xen code next version.

On 2023/12/4 16:13, Thomas Gleixner wrote:
> On Fri, Nov 24 2023 at 18:31, Jiqian Chen wrote:
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index 5a96b6c66c07..b83d02bcc76c 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -357,6 +357,7 @@ static int pcistub_match(struct pci_dev *dev)
>>  static int pcistub_init_device(struct pci_dev *dev)
>>  {
>>  	struct xen_pcibk_dev_data *dev_data;
>> +	struct irq_desc *desc = NULL;
>>  	int err = 0;
>>  
>>  	dev_dbg(&dev->dev, "initializing...\n");
>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>>  	if (err)
>>  		goto config_release;
>>  
>> +	if (xen_initial_domain() && xen_pvh_domain()) {
>> +		if (dev->irq <= 0 || !(desc = irq_to_desc(dev->irq)))
> 
> Driver code has absolutely no business to access irq_desc.
> 
>> +			goto config_release;
>> +		unmask_irq(desc);
> 
> Or to invoke any internal function.
> 
>> --- a/kernel/irq/chip.c
>> +++ b/kernel/irq/chip.c
>> @@ -439,6 +439,7 @@ void unmask_irq(struct irq_desc *desc)
>>  		irq_state_clr_masked(desc);
>>  	}
>>  }
>> +EXPORT_SYMBOL_GPL(unmask_irq);
> 
> Not going to happen.
> 
>> --- a/kernel/irq/irqdesc.c
>> +++ b/kernel/irq/irqdesc.c
>> @@ -380,7 +380,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
>>  {
>>  	return mtree_load(&sparse_irqs, irq);
>>  }
>> -#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
>> +#if defined CONFIG_KVM_BOOK3S_64_HV_MODULE || defined CONFIG_XEN_PVH
> 
> Neither that.
> 
> This all smells badly like a XEN internal issue and we are not going to
> hack around it by exposing interrupt internals.
> 
> Thanks,
> 
>         tglx
Roger Pau Monné Dec. 5, 2023, 9:19 a.m. UTC | #10
On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> > On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> > > On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > > > On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > > > > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > > > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > > > > This patch is to solve two problems we encountered when we try to
> > > > > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > > > > > 
> > > > > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > > > > into Xen and check whether dom0 has the mapping. See
> > > > > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > > > > when thay are enabled.
> > > > > > > > 
> > > > > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > > > > registered, but gsi must be configured for it to be able to be
> > > > > > > > mapped into a domU.
> > > > > > > > 
> > > > > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > > > > unmasked.
> > > > > > > > 
> > > > > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > > > > and mapped in PVH dom0.
> > > > > > > 
> > > > > > > 
> > > > > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > > > > we need the unmask check in Xen? Couldn't we just do:
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > > > > index 4e40d3609a..df262a4a18 100644
> > > > > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > > > > >              hvm_dpci_eoi(d, gsi);
> > > > > > >      }
> > > > > > >  
> > > > > > > -    if ( is_hardware_domain(d) && unmasked )
> > > > > > > +    if ( is_hardware_domain(d) )
> > > > > > >      {
> > > > > > >          /*
> > > > > > >           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > > > > > 
> > > > > > There are some issues with this approach.
> > > > > > 
> > > > > > mp_register_gsi() will only setup the trigger and polarity of the
> > > > > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > > > > to assert that the configuration is the intended one.  A guest is
> > > > > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > > > > that doesn't take effect unless the pin is unmasked.
> > > > > > 
> > > > > > Overall the question would be whether we have any guarantees that
> > > > > > the hardware domain has properly configured the pin, even if it's not
> > > > > > using it itself (as it hasn't been unmasked).
> > > > > > 
> > > > > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > > > > could configure any pins that are not setup at bind time?
> > > > > 
> > > > > That could work.
> > > > > 
> > > > > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > > > > bind time? That might be enough to pass a pirq_access_permitted check.
> > > > 
> > > > Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> > > > just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> > > > parameter would be a GSI instead of a previously mapped IRQ).  Such
> > > > difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> > > > route I would recommend that we instead introduce a new dmop that has
> > > > this syntax regardless of the domain type it's called from.
> > > 
> > > Looking at the code it is certainly a bit confusing. My point was that
> > > we don't need to wait until polarity and trigger are set appropriately
> > > to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> > > should be able to figure out that Dom0 is permitted pirq access.
> > 
> > The logic is certainly not straightforward, and it could benefit from
> > some comments.
> > 
> > The irq permissions are a bit special, in that they get setup when the
> > IRQ is mapped.
> > 
> > The problem however is not so much with IRQ permissions, that we can
> > indeed sort out internally in Xen.  Such check in dom0 has the side
> > effect of preventing the IRQ from being assigned to a domU without the
> > hardware source being properly configured AFAICT.
> 
> Now I understand why you made a comment previously about Xen having to
> configure trigger and polarity for these interrupts on its own.
> 
> 
> > > So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> > > somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> > > polarity to be configured to work. But the suggestion of doing it a
> > > "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
> > > 
> > > But maybe we can find another location, maybe within
> > > xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> > > trigger and polarity are set and before the interrupt is unmasked.
> > > 
> > > Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> > > call to allocate_and_map_gsi_pirq, because by the time
> > > vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> > > already been done.
> > 
> > But then we would end up in a situation where the
> > pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> > configured, which I think it's not what we want.
> > 
> > One option would be to allow mp_register_gsi() to be called multiple
> > times, and update the IO-APIC pin configuration as long as the pin is
> > not unmasked.  That would propagate each dom0 RTE update to the
> > underlying IO-APIC.  However such approach relies on dom0 configuring
> > all possible IO-APIC pins, even if no device on dom0 is using them, I
> > think it's not a very reliable option.
> > 
> > Another option would be to modify the toolstack to setup the GSI
> > itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
> > email, since we only care about PCI device passthrough the legacy INTx
> > should always be level triggered and low polarity.
> > 
> > > I am not familiar with vioapic.c but to give you an idea of what I was
> > > thinking:
> > > 
> > > 
> > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > index 4e40d3609a..16d56fe851 100644
> > > --- a/xen/arch/x86/hvm/vioapic.c
> > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
> > >          return ret;
> > >      }
> > >  
> > > -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > > -    if ( ret )
> > > -    {
> > > -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > > -                 gsi, ret);
> > > -        return ret;
> > > -    }
> > > -
> > >      pcidevs_lock();
> > >      ret = pt_irq_create_bind(currd, &pt_irq_bind);
> > >      if ( ret )
> > > @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
> > >              hvm_dpci_eoi(d, gsi);
> > >      }
> > >  
> > > +    if ( is_hardware_domain(d) ) 
> > > +    {
> > > +        int pirq = gsi, ret;
> > > +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > > +        if ( ret )
> > > +        {
> > > +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > > +                    gsi, ret);
> > > +            return ret;
> > > +        }
> > > +    }
> > >      if ( is_hardware_domain(d) && unmasked )
> > >      {
> > >          /*
> > 
> > As said above, such approach relies on dom0 writing to the IO-APIC RTE
> > of likely each IO-APIC pin, which is IMO not quite reliable.  In there
> > are two different issues here that need to be fixed for PVH dom0:
> > 
> >  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
> >    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
> 
> Yes makes sense
> 
> 
> >  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
> >    the IO-APIC pin itself.
> > 
> > First one needs to be fixed internally in Xen, second one will require
> > the toolstack to issue an extra hypercall in order to ensure the
> > IO-APIC pin is properly configured.
>  
> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
> correctly. Xen configures them all on its own at boot based on Device
> Tree information. I guess it is not possible to do the same on x86?

No, not exactly.  There's some interrupt information in the ACPI MADT,
but that's just for very specific sources (Interrupt Source Override
Structures)

Then on AML devices can have resource descriptors that contain
information about how interrupts are setup.  However Xen is not able
to read any of this information on AML.

Legacy PCI interrupts are (always?) level triggered and low polarity,
because it's assumed that an interrupt source can be shared between
multiple devices.

I'm however not able to find any reference to this in the PCI spec,
hence I'm reluctant to take this for granted in Xen, and default all
GSIs >= 16 to such mode.

OTOH legacy PCI interrupts are not that used anymore, as almost all
devices will support MSI(-X) (because PCIe mandates it) and OSes
should prefer the latter.  SR-IOV VF don't even support legacy PCI
interrupts anymore.

> If
> not, then I can see why we would need 1 extra toolstack hypercall for
> that (or to bundle the operation of configuring IO-APIC pins together
> with an existing toolstack hypercall).

One suitable compromise would be to default unconfigured GSIs >= 16 to
level-triggered and low-polarity, as I would expect that to work in
almost all cases.  We can always introduce the usage of
PHYSDEVOP_setup_gsi later if required.

Maybe Jan has more input here, would you agree to defaulting non-ISA
GSIs to level-triggered, low-polarity in the absence of a specific
setup provided by dom0?

Thanks, Roger.
Jiqian Chen Dec. 5, 2023, 9:39 a.m. UTC | #11
On 2023/12/5 17:19, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>
>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>> when thay are enabled.
>>>>>>>>>
>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>> mapped into a domU.
>>>>>>>>>
>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>> unmasked.
>>>>>>>>>
>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>
>>>>>>>>
>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>>>>      }
>>>>>>>>  
>>>>>>>> -    if ( is_hardware_domain(d) && unmasked )
>>>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>>>      {
>>>>>>>>          /*
>>>>>>>>           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>>>
>>>>>>> There are some issues with this approach.
>>>>>>>
>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>> to assert that the configuration is the intended one.  A guest is
>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>
>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>
>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>
>>>>>> That could work.
>>>>>>
>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>
>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>> parameter would be a GSI instead of a previously mapped IRQ).  Such
>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>> this syntax regardless of the domain type it's called from.
>>>>
>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>> we don't need to wait until polarity and trigger are set appropriately
>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>
>>> The logic is certainly not straightforward, and it could benefit from
>>> some comments.
>>>
>>> The irq permissions are a bit special, in that they get setup when the
>>> IRQ is mapped.
>>>
>>> The problem however is not so much with IRQ permissions, that we can
>>> indeed sort out internally in Xen.  Such check in dom0 has the side
>>> effect of preventing the IRQ from being assigned to a domU without the
>>> hardware source being properly configured AFAICT.
>>
>> Now I understand why you made a comment previously about Xen having to
>> configure trigger and polarity for these interrupts on its own.
>>
>>
>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>> polarity to be configured to work. But the suggestion of doing it a
>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>
>>>> But maybe we can find another location, maybe within
>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>
>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>> already been done.
>>>
>>> But then we would end up in a situation where the
>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>> configured, which I think it's not what we want.
>>>
>>> One option would be to allow mp_register_gsi() to be called multiple
>>> times, and update the IO-APIC pin configuration as long as the pin is
>>> not unmasked.  That would propagate each dom0 RTE update to the
>>> underlying IO-APIC.  However such approach relies on dom0 configuring
>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>> think it's not a very reliable option.
>>>
>>> Another option would be to modify the toolstack to setup the GSI
>>> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
>>> email, since we only care about PCI device passthrough the legacy INTx
>>> should always be level triggered and low polarity.
>>>
>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>> thinking:
>>>>
>>>>
>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>> index 4e40d3609a..16d56fe851 100644
>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>>>          return ret;
>>>>      }
>>>>  
>>>> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> -    if ( ret )
>>>> -    {
>>>> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> -                 gsi, ret);
>>>> -        return ret;
>>>> -    }
>>>> -
>>>>      pcidevs_lock();
>>>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>>      if ( ret )
>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>>              hvm_dpci_eoi(d, gsi);
>>>>      }
>>>>  
>>>> +    if ( is_hardware_domain(d) ) 
>>>> +    {
>>>> +        int pirq = gsi, ret;
>>>> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> +        if ( ret )
>>>> +        {
>>>> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> +                    gsi, ret);
>>>> +            return ret;
>>>> +        }
>>>> +    }
>>>>      if ( is_hardware_domain(d) && unmasked )
>>>>      {
>>>>          /*
>>>
>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
>>> are two different issues here that need to be fixed for PVH dom0:
>>>
>>>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>
>> Yes makes sense
>>
>>
>>>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>>    the IO-APIC pin itself.
>>>
>>> First one needs to be fixed internally in Xen, second one will require
>>> the toolstack to issue an extra hypercall in order to ensure the
>>> IO-APIC pin is properly configured.
>>  
>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>> correctly. Xen configures them all on its own at boot based on Device
>> Tree information. I guess it is not possible to do the same on x86?
> 
> No, not exactly.  There's some interrupt information in the ACPI MADT,
> but that's just for very specific sources (Interrupt Source Override
> Structures)
> 
> Then on AML devices can have resource descriptors that contain
> information about how interrupts are setup.  However Xen is not able
> to read any of this information on AML.
> 
> Legacy PCI interrupts are (always?) level triggered and low polarity,
> because it's assumed that an interrupt source can be shared between
> multiple devices.
> 
> I'm however not able to find any reference to this in the PCI spec,
> hence I'm reluctant to take this for granted in Xen, and default all
> GSIs >= 16 to such mode.
> 
> OTOH legacy PCI interrupts are not that used anymore, as almost all
> devices will support MSI(-X) (because PCIe mandates it) and OSes
> should prefer the latter.  SR-IOV VF don't even support legacy PCI
> interrupts anymore.
> 
>> If
>> not, then I can see why we would need 1 extra toolstack hypercall for
>> that (or to bundle the operation of configuring IO-APIC pins together
>> with an existing toolstack hypercall).
> 
> One suitable compromise would be to default unconfigured GSIs >= 16 to
> level-triggered and low-polarity, as I would expect that to work in
> almost all cases.  We can always introduce the usage of
> PHYSDEVOP_setup_gsi later if required.
> 
> Maybe Jan has more input here, would you agree to defaulting non-ISA
> GSIs to level-triggered, low-polarity in the absence of a specific
> setup provided by dom0?
> 
> Thanks, Roger.

No intention to disturb if I am incorrect, just a little input. On dom0 PVH, when it enables devices, it will call acpi_pci_irq_enable, and in that function, its default trigger is level and polarity is low for pci interrupt.
Jan Beulich Dec. 5, 2023, 10:32 a.m. UTC | #12
On 05.12.2023 10:19, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>
>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>> when thay are enabled.
>>>>>>>>>
>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>> mapped into a domU.
>>>>>>>>>
>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>> unmasked.
>>>>>>>>>
>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>
>>>>>>>>
>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>>>>      }
>>>>>>>>  
>>>>>>>> -    if ( is_hardware_domain(d) && unmasked )
>>>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>>>      {
>>>>>>>>          /*
>>>>>>>>           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>>>
>>>>>>> There are some issues with this approach.
>>>>>>>
>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>> to assert that the configuration is the intended one.  A guest is
>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>
>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>
>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>
>>>>>> That could work.
>>>>>>
>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>
>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>> parameter would be a GSI instead of a previously mapped IRQ).  Such
>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>> this syntax regardless of the domain type it's called from.
>>>>
>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>> we don't need to wait until polarity and trigger are set appropriately
>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>
>>> The logic is certainly not straightforward, and it could benefit from
>>> some comments.
>>>
>>> The irq permissions are a bit special, in that they get setup when the
>>> IRQ is mapped.
>>>
>>> The problem however is not so much with IRQ permissions, that we can
>>> indeed sort out internally in Xen.  Such check in dom0 has the side
>>> effect of preventing the IRQ from being assigned to a domU without the
>>> hardware source being properly configured AFAICT.
>>
>> Now I understand why you made a comment previously about Xen having to
>> configure trigger and polarity for these interrupts on its own.
>>
>>
>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>> polarity to be configured to work. But the suggestion of doing it a
>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>
>>>> But maybe we can find another location, maybe within
>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>
>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>> already been done.
>>>
>>> But then we would end up in a situation where the
>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>> configured, which I think it's not what we want.
>>>
>>> One option would be to allow mp_register_gsi() to be called multiple
>>> times, and update the IO-APIC pin configuration as long as the pin is
>>> not unmasked.  That would propagate each dom0 RTE update to the
>>> underlying IO-APIC.  However such approach relies on dom0 configuring
>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>> think it's not a very reliable option.
>>>
>>> Another option would be to modify the toolstack to setup the GSI
>>> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
>>> email, since we only care about PCI device passthrough the legacy INTx
>>> should always be level triggered and low polarity.
>>>
>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>> thinking:
>>>>
>>>>
>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>> index 4e40d3609a..16d56fe851 100644
>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>>>          return ret;
>>>>      }
>>>>  
>>>> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> -    if ( ret )
>>>> -    {
>>>> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> -                 gsi, ret);
>>>> -        return ret;
>>>> -    }
>>>> -
>>>>      pcidevs_lock();
>>>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>>      if ( ret )
>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>>              hvm_dpci_eoi(d, gsi);
>>>>      }
>>>>  
>>>> +    if ( is_hardware_domain(d) ) 
>>>> +    {
>>>> +        int pirq = gsi, ret;
>>>> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> +        if ( ret )
>>>> +        {
>>>> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> +                    gsi, ret);
>>>> +            return ret;
>>>> +        }
>>>> +    }
>>>>      if ( is_hardware_domain(d) && unmasked )
>>>>      {
>>>>          /*
>>>
>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
>>> are two different issues here that need to be fixed for PVH dom0:
>>>
>>>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>
>> Yes makes sense
>>
>>
>>>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>>    the IO-APIC pin itself.
>>>
>>> First one needs to be fixed internally in Xen, second one will require
>>> the toolstack to issue an extra hypercall in order to ensure the
>>> IO-APIC pin is properly configured.
>>  
>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>> correctly. Xen configures them all on its own at boot based on Device
>> Tree information. I guess it is not possible to do the same on x86?
> 
> No, not exactly.  There's some interrupt information in the ACPI MADT,
> but that's just for very specific sources (Interrupt Source Override
> Structures)
> 
> Then on AML devices can have resource descriptors that contain
> information about how interrupts are setup.  However Xen is not able
> to read any of this information on AML.
> 
> Legacy PCI interrupts are (always?) level triggered and low polarity,
> because it's assumed that an interrupt source can be shared between
> multiple devices.

Except that as per what you said just in the earlier paragraph ACPI can
tell us otherwise.

> I'm however not able to find any reference to this in the PCI spec,
> hence I'm reluctant to take this for granted in Xen, and default all
> GSIs >= 16 to such mode.
> 
> OTOH legacy PCI interrupts are not that used anymore, as almost all
> devices will support MSI(-X) (because PCIe mandates it) and OSes
> should prefer the latter.  SR-IOV VF don't even support legacy PCI
> interrupts anymore.
> 
>> If
>> not, then I can see why we would need 1 extra toolstack hypercall for
>> that (or to bundle the operation of configuring IO-APIC pins together
>> with an existing toolstack hypercall).
> 
> One suitable compromise would be to default unconfigured GSIs >= 16 to
> level-triggered and low-polarity, as I would expect that to work in
> almost all cases.  We can always introduce the usage of
> PHYSDEVOP_setup_gsi later if required.
> 
> Maybe Jan has more input here, would you agree to defaulting non-ISA
> GSIs to level-triggered, low-polarity in the absence of a specific
> setup provided by dom0?

Well, such defaulting is an option, but in case it's wrong we might
end up with hard to diagnose issues. Personally I'd prefer if we
didn't take shortcuts here, i.e. if we followed what Dom0 is able
to read from ACPI.

Jan
Jiqian Chen Dec. 6, 2023, 6:07 a.m. UTC | #13
On 2023/12/5 18:32, Jan Beulich wrote:
> On 05.12.2023 10:19, Roger Pau Monné wrote:
>> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>>
>>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>>> when thay are enabled.
>>>>>>>>>>
>>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>>> mapped into a domU.
>>>>>>>>>>
>>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>>> unmasked.
>>>>>>>>>>
>>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>>>>>      }
>>>>>>>>>  
>>>>>>>>> -    if ( is_hardware_domain(d) && unmasked )
>>>>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>>>>      {
>>>>>>>>>          /*
>>>>>>>>>           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>>>>
>>>>>>>> There are some issues with this approach.
>>>>>>>>
>>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>>> to assert that the configuration is the intended one.  A guest is
>>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>>
>>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>>
>>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>>
>>>>>>> That could work.
>>>>>>>
>>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>>
>>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>>> parameter would be a GSI instead of a previously mapped IRQ).  Such
>>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>>> this syntax regardless of the domain type it's called from.
>>>>>
>>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>>> we don't need to wait until polarity and trigger are set appropriately
>>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>>
>>>> The logic is certainly not straightforward, and it could benefit from
>>>> some comments.
>>>>
>>>> The irq permissions are a bit special, in that they get setup when the
>>>> IRQ is mapped.
>>>>
>>>> The problem however is not so much with IRQ permissions, that we can
>>>> indeed sort out internally in Xen.  Such check in dom0 has the side
>>>> effect of preventing the IRQ from being assigned to a domU without the
>>>> hardware source being properly configured AFAICT.
>>>
>>> Now I understand why you made a comment previously about Xen having to
>>> configure trigger and polarity for these interrupts on its own.
>>>
>>>
>>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>>> polarity to be configured to work. But the suggestion of doing it a
>>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>>
>>>>> But maybe we can find another location, maybe within
>>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>>
>>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>>> already been done.
>>>>
>>>> But then we would end up in a situation where the
>>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>>> configured, which I think it's not what we want.
>>>>
>>>> One option would be to allow mp_register_gsi() to be called multiple
>>>> times, and update the IO-APIC pin configuration as long as the pin is
>>>> not unmasked.  That would propagate each dom0 RTE update to the
>>>> underlying IO-APIC.  However such approach relies on dom0 configuring
>>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>>> think it's not a very reliable option.
>>>>
>>>> Another option would be to modify the toolstack to setup the GSI
>>>> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
>>>> email, since we only care about PCI device passthrough the legacy INTx
>>>> should always be level triggered and low polarity.
>>>>
>>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>>> thinking:
>>>>>
>>>>>
>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>> index 4e40d3609a..16d56fe851 100644
>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>>>>          return ret;
>>>>>      }
>>>>>  
>>>>> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>>> -    if ( ret )
>>>>> -    {
>>>>> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>>> -                 gsi, ret);
>>>>> -        return ret;
>>>>> -    }
>>>>> -
>>>>>      pcidevs_lock();
>>>>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>>>      if ( ret )
>>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>      }
>>>>>  
>>>>> +    if ( is_hardware_domain(d) ) 
>>>>> +    {
>>>>> +        int pirq = gsi, ret;
>>>>> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>>> +        if ( ret )
>>>>> +        {
>>>>> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>>> +                    gsi, ret);
>>>>> +            return ret;
>>>>> +        }
>>>>> +    }
>>>>>      if ( is_hardware_domain(d) && unmasked )
>>>>>      {
>>>>>          /*
>>>>
>>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>>> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
>>>> are two different issues here that need to be fixed for PVH dom0:
>>>>
>>>>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>>>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>>
>>> Yes makes sense
>>>
>>>
>>>>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>>>    the IO-APIC pin itself.
>>>>
>>>> First one needs to be fixed internally in Xen, second one will require
>>>> the toolstack to issue an extra hypercall in order to ensure the
>>>> IO-APIC pin is properly configured.
>>>  
>>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>>> correctly. Xen configures them all on its own at boot based on Device
>>> Tree information. I guess it is not possible to do the same on x86?
>>
>> No, not exactly.  There's some interrupt information in the ACPI MADT,
>> but that's just for very specific sources (Interrupt Source Override
>> Structures)
>>
>> Then on AML devices can have resource descriptors that contain
>> information about how interrupts are setup.  However Xen is not able
>> to read any of this information on AML.
>>
>> Legacy PCI interrupts are (always?) level triggered and low polarity,
>> because it's assumed that an interrupt source can be shared between
>> multiple devices.
> 
> Except that as per what you said just in the earlier paragraph ACPI can
> tell us otherwise.
> 
>> I'm however not able to find any reference to this in the PCI spec,
>> hence I'm reluctant to take this for granted in Xen, and default all
>> GSIs >= 16 to such mode.
>>
>> OTOH legacy PCI interrupts are not that used anymore, as almost all
>> devices will support MSI(-X) (because PCIe mandates it) and OSes
>> should prefer the latter.  SR-IOV VF don't even support legacy PCI
>> interrupts anymore.
>>
>>> If
>>> not, then I can see why we would need 1 extra toolstack hypercall for
>>> that (or to bundle the operation of configuring IO-APIC pins together
>>> with an existing toolstack hypercall).
>>
>> One suitable compromise would be to default unconfigured GSIs >= 16 to
>> level-triggered and low-polarity, as I would expect that to work in
>> almost all cases.  We can always introduce the usage of
>> PHYSDEVOP_setup_gsi later if required.
>>
>> Maybe Jan has more input here, would you agree to defaulting non-ISA
>> GSIs to level-triggered, low-polarity in the absence of a specific
>> setup provided by dom0?
> 
> Well, such defaulting is an option, but in case it's wrong we might
> end up with hard to diagnose issues. Personally I'd prefer if we
> didn't take shortcuts here, i.e. if we followed what Dom0 is able
> to read from ACPI.

When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
I have a version of patch which tried that way, see below:

diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index ada3868c02c2..43e1bda9f946 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/acpi.h>
 #include <linux/export.h>
+#include <linux/pci.h>

 #include <xen/hvc-console.h>

@@ -25,6 +26,127 @@
 bool __ro_after_init xen_pvh;
 EXPORT_SYMBOL_GPL(xen_pvh);

+typedef struct gsi_info {
+       int gsi;
+       int trigger;
+       int polarity;
+       int pirq;
+} gsi_info_t;
+
+struct acpi_prt_entry {
+       struct acpi_pci_id      id;
+       u8                      pin;
+       acpi_handle             link;
+       u32                     index;          /* GSI, or link _CRS index */
+};
+
+static int xen_pvh_get_gsi_info(struct pci_dev *dev,
+                                                               gsi_info_t *gsi_info)
+{
+       int gsi;
+       u8 pin = 0;
+       struct acpi_prt_entry *entry;
+       int trigger = ACPI_LEVEL_SENSITIVE;
+       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
+                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
+
+       if (dev)
+               pin = dev->pin;
+       if (!pin) {
+               xen_raw_printk("No interrupt pin configured\n");
+               return -EINVAL;
+       }
+
+       entry = acpi_pci_irq_lookup(dev, pin);
+       if (entry) {
+               if (entry->link)
+                       gsi = acpi_pci_link_allocate_irq(entry->link,
+                                                        entry->index,
+                                                        &trigger, &polarity,
+                                                        NULL);
+               else
+                       gsi = entry->index;
+       } else
+               return -EINVAL;
+
+       gsi_info->gsi = gsi;
+       gsi_info->trigger = trigger;
+       gsi_info->polarity = polarity;
+
+       return 0;
+}
+
+static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
+{
+       struct physdev_map_pirq map_irq;
+       int ret;
+
+       map_irq.domid = DOMID_SELF;
+       map_irq.type = MAP_PIRQ_TYPE_GSI;
+       map_irq.index = gsi_info->gsi;
+       map_irq.pirq = gsi_info->gsi;
+
+       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
+       gsi_info->pirq = map_irq.pirq;
+
+       return ret;
+}
+
+static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
+{
+       struct physdev_unmap_pirq unmap_irq;
+
+       unmap_irq.domid = DOMID_SELF;
+       unmap_irq.pirq = gsi_info->pirq;
+
+       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
+}
+
+static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
+{
+       struct physdev_setup_gsi setup_gsi;
+
+       setup_gsi.gsi = gsi_info->gsi;
+       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
+       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
+
+       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
+}
+
+int xen_pvh_passthrough_gsi(struct pci_dev *dev)
+{
+       int ret;
+       gsi_info_t gsi_info;
+
+       if (!dev) {
+               return -EINVAL;
+       }
+
+       ret = xen_pvh_get_gsi_info(dev, &gsi_info);
+       if (ret) {
+               xen_raw_printk("Fail to get gsi info!\n");
+               return ret;
+       }
+
+       ret = xen_pvh_map_pirq(&gsi_info);
+       if (ret) {
+               xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
+               return ret;
+       }
+
+       ret = xen_pvh_setup_gsi(&gsi_info);
+       if (ret == -EEXIST) {
+               ret = 0;
+               xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
+       } else if (ret) {
+               xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
+               xen_pvh_unmap_pirq(&gsi_info);
+       }
+
+       return ret;
+}
+EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
+
 void __init xen_pvh_init(struct boot_params *boot_params)
 {
        u32 msr;
diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index ff30ceca2203..630fe0a34bc6 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
 }
 #endif /* CONFIG_X86_IO_APIC */

-static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
 {
        struct acpi_prt_entry *entry = NULL;
        struct pci_dev *bridge;
diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index e34b623e4b41..1abd4dad6f40 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -20,6 +20,7 @@
 #include <linux/atomic.h>
 #include <xen/events.h>
 #include <xen/pci.h>
+#include <xen/acpi.h>
 #include <xen/xen.h>
 #include <asm/xen/hypervisor.h>
 #include <xen/interface/physdev.h>
@@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
        if (err)
                goto config_release;

+       if (xen_initial_domain() && xen_pvh_domain()) {
+               err = xen_pvh_passthrough_gsi(dev);
+               if (err)
+                       goto config_release;
+       }
+
        if (dev->msix_cap) {
                struct physdev_pci_device ppdev = {
                        .seg = pci_domain_nr(dev->bus),
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 641dc4843987..368d56ba2c5e 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);

 struct pci_dev;

+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
 int acpi_pci_irq_enable (struct pci_dev *dev);
 void acpi_penalize_isa_irq(int irq, int active);
 bool acpi_isa_irq_available(int irq);
diff --git a/include/xen/acpi.h b/include/xen/acpi.h
index b1e11863144d..ce7f5554f88e 100644
--- a/include/xen/acpi.h
+++ b/include/xen/acpi.h
@@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
                acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
        }
 }
+int xen_pvh_passthrough_gsi(struct pci_dev *dev);
 #else
 static inline void xen_acpi_sleep_register(void)
 {

> 
> Jan
Stefano Stabellini Dec. 7, 2023, 2:18 a.m. UTC | #14
On Tue, 5 Dec 2023, Chen, Jiqian wrote:
> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
> I have a version of patch which tried that way, see below:

This approach looks much better. I think this patch is OKish. Juergen,
what do you think?


> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> index ada3868c02c2..43e1bda9f946 100644
> --- a/arch/x86/xen/enlighten_pvh.c
> +++ b/arch/x86/xen/enlighten_pvh.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/acpi.h>
>  #include <linux/export.h>
> +#include <linux/pci.h>
> 
>  #include <xen/hvc-console.h>
> 
> @@ -25,6 +26,127 @@
>  bool __ro_after_init xen_pvh;
>  EXPORT_SYMBOL_GPL(xen_pvh);
> 
> +typedef struct gsi_info {
> +       int gsi;
> +       int trigger;
> +       int polarity;
> +       int pirq;
> +} gsi_info_t;
> +
> +struct acpi_prt_entry {
> +       struct acpi_pci_id      id;
> +       u8                      pin;
> +       acpi_handle             link;
> +       u32                     index;          /* GSI, or link _CRS index */
> +};
> +
> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
> +                                                               gsi_info_t *gsi_info)
> +{
> +       int gsi;
> +       u8 pin = 0;
> +       struct acpi_prt_entry *entry;
> +       int trigger = ACPI_LEVEL_SENSITIVE;
> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
> +
> +       if (dev)
> +               pin = dev->pin;
> +       if (!pin) {
> +               xen_raw_printk("No interrupt pin configured\n");
> +               return -EINVAL;
> +       }
> +
> +       entry = acpi_pci_irq_lookup(dev, pin);
> +       if (entry) {
> +               if (entry->link)
> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
> +                                                        entry->index,
> +                                                        &trigger, &polarity,
> +                                                        NULL);
> +               else
> +                       gsi = entry->index;
> +       } else
> +               return -EINVAL;
> +
> +       gsi_info->gsi = gsi;
> +       gsi_info->trigger = trigger;
> +       gsi_info->polarity = polarity;
> +
> +       return 0;
> +}
> +
> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
> +{
> +       struct physdev_map_pirq map_irq;
> +       int ret;
> +
> +       map_irq.domid = DOMID_SELF;
> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
> +       map_irq.index = gsi_info->gsi;
> +       map_irq.pirq = gsi_info->gsi;
> +
> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
> +       gsi_info->pirq = map_irq.pirq;
> +
> +       return ret;
> +}
> +
> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
> +{
> +       struct physdev_unmap_pirq unmap_irq;
> +
> +       unmap_irq.domid = DOMID_SELF;
> +       unmap_irq.pirq = gsi_info->pirq;
> +
> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> +}
> +
> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> +{
> +       struct physdev_setup_gsi setup_gsi;
> +
> +       setup_gsi.gsi = gsi_info->gsi;
> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> +
> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> +}
> +
> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
> +{
> +       int ret;
> +       gsi_info_t gsi_info;
> +
> +       if (!dev) {
> +               return -EINVAL;
> +       }
> +
> +       ret = xen_pvh_get_gsi_info(dev, &gsi_info);
> +       if (ret) {
> +               xen_raw_printk("Fail to get gsi info!\n");
> +               return ret;
> +       }
> +
> +       ret = xen_pvh_map_pirq(&gsi_info);
> +       if (ret) {
> +               xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
> +               return ret;
> +       }
> +
> +       ret = xen_pvh_setup_gsi(&gsi_info);
> +       if (ret == -EEXIST) {
> +               ret = 0;
> +               xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
> +       } else if (ret) {
> +               xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
> +               xen_pvh_unmap_pirq(&gsi_info);
> +       }
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
> +
>  void __init xen_pvh_init(struct boot_params *boot_params)
>  {
>         u32 msr;
> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
> index ff30ceca2203..630fe0a34bc6 100644
> --- a/drivers/acpi/pci_irq.c
> +++ b/drivers/acpi/pci_irq.c
> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>  }
>  #endif /* CONFIG_X86_IO_APIC */
> 
> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>  {
>         struct acpi_prt_entry *entry = NULL;
>         struct pci_dev *bridge;
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index e34b623e4b41..1abd4dad6f40 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -20,6 +20,7 @@
>  #include <linux/atomic.h>
>  #include <xen/events.h>
>  #include <xen/pci.h>
> +#include <xen/acpi.h>
>  #include <xen/xen.h>
>  #include <asm/xen/hypervisor.h>
>  #include <xen/interface/physdev.h>
> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>         if (err)
>                 goto config_release;
> 
> +       if (xen_initial_domain() && xen_pvh_domain()) {
> +               err = xen_pvh_passthrough_gsi(dev);
> +               if (err)
> +                       goto config_release;
> +       }
> +
>         if (dev->msix_cap) {
>                 struct physdev_pci_device ppdev = {
>                         .seg = pci_domain_nr(dev->bus),
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 641dc4843987..368d56ba2c5e 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
> 
>  struct pci_dev;
> 
> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>  int acpi_pci_irq_enable (struct pci_dev *dev);
>  void acpi_penalize_isa_irq(int irq, int active);
>  bool acpi_isa_irq_available(int irq);
> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
> index b1e11863144d..ce7f5554f88e 100644
> --- a/include/xen/acpi.h
> +++ b/include/xen/acpi.h
> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
>                 acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>         }
>  }
> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>  #else
>  static inline void xen_acpi_sleep_register(void)
>  {
> 
> > 
> > Jan
> 
> -- 
> Best regards,
> Jiqian Chen.
>
Jiqian Chen Dec. 7, 2023, 3:38 a.m. UTC | #15
(Adding Juergen to the "To" list.)

Hi Juergen, 
Looking forward to your opinions.

On 2023/12/7 10:18, Stefano Stabellini wrote:
> On Tue, 5 Dec 2023, Chen, Jiqian wrote:
>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>> I have a version of patch which tried that way, see below:
> 
> This approach looks much better. I think this patch is OKish. Juergen,
> what do you think?
> 
> 
>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>> index ada3868c02c2..43e1bda9f946 100644
>> --- a/arch/x86/xen/enlighten_pvh.c
>> +++ b/arch/x86/xen/enlighten_pvh.c
>> @@ -1,6 +1,7 @@
>>  // SPDX-License-Identifier: GPL-2.0
>>  #include <linux/acpi.h>
>>  #include <linux/export.h>
>> +#include <linux/pci.h>
>>
>>  #include <xen/hvc-console.h>
>>
>> @@ -25,6 +26,127 @@
>>  bool __ro_after_init xen_pvh;
>>  EXPORT_SYMBOL_GPL(xen_pvh);
>>
>> +typedef struct gsi_info {
>> +       int gsi;
>> +       int trigger;
>> +       int polarity;
>> +       int pirq;
>> +} gsi_info_t;
>> +
>> +struct acpi_prt_entry {
>> +       struct acpi_pci_id      id;
>> +       u8                      pin;
>> +       acpi_handle             link;
>> +       u32                     index;          /* GSI, or link _CRS index */
>> +};
>> +
>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>> +                                                               gsi_info_t *gsi_info)
>> +{
>> +       int gsi;
>> +       u8 pin = 0;
>> +       struct acpi_prt_entry *entry;
>> +       int trigger = ACPI_LEVEL_SENSITIVE;
>> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>> +
>> +       if (dev)
>> +               pin = dev->pin;
>> +       if (!pin) {
>> +               xen_raw_printk("No interrupt pin configured\n");
>> +               return -EINVAL;
>> +       }
>> +
>> +       entry = acpi_pci_irq_lookup(dev, pin);
>> +       if (entry) {
>> +               if (entry->link)
>> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
>> +                                                        entry->index,
>> +                                                        &trigger, &polarity,
>> +                                                        NULL);
>> +               else
>> +                       gsi = entry->index;
>> +       } else
>> +               return -EINVAL;
>> +
>> +       gsi_info->gsi = gsi;
>> +       gsi_info->trigger = trigger;
>> +       gsi_info->polarity = polarity;
>> +
>> +       return 0;
>> +}
>> +
>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_map_pirq map_irq;
>> +       int ret;
>> +
>> +       map_irq.domid = DOMID_SELF;
>> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
>> +       map_irq.index = gsi_info->gsi;
>> +       map_irq.pirq = gsi_info->gsi;
>> +
>> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>> +       gsi_info->pirq = map_irq.pirq;
>> +
>> +       return ret;
>> +}
>> +
>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_unmap_pirq unmap_irq;
>> +
>> +       unmap_irq.domid = DOMID_SELF;
>> +       unmap_irq.pirq = gsi_info->pirq;
>> +
>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> +}
>> +
>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_setup_gsi setup_gsi;
>> +
>> +       setup_gsi.gsi = gsi_info->gsi;
>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>> +
>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>> +}
>> +
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>> +{
>> +       int ret;
>> +       gsi_info_t gsi_info;
>> +
>> +       if (!dev) {
>> +               return -EINVAL;
>> +       }
>> +
>> +       ret = xen_pvh_get_gsi_info(dev, &gsi_info);
>> +       if (ret) {
>> +               xen_raw_printk("Fail to get gsi info!\n");
>> +               return ret;
>> +       }
>> +
>> +       ret = xen_pvh_map_pirq(&gsi_info);
>> +       if (ret) {
>> +               xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
>> +               return ret;
>> +       }
>> +
>> +       ret = xen_pvh_setup_gsi(&gsi_info);
>> +       if (ret == -EEXIST) {
>> +               ret = 0;
>> +               xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
>> +       } else if (ret) {
>> +               xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
>> +               xen_pvh_unmap_pirq(&gsi_info);
>> +       }
>> +
>> +       return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
>> +
>>  void __init xen_pvh_init(struct boot_params *boot_params)
>>  {
>>         u32 msr;
>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>> index ff30ceca2203..630fe0a34bc6 100644
>> --- a/drivers/acpi/pci_irq.c
>> +++ b/drivers/acpi/pci_irq.c
>> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>>  }
>>  #endif /* CONFIG_X86_IO_APIC */
>>
>> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>>  {
>>         struct acpi_prt_entry *entry = NULL;
>>         struct pci_dev *bridge;
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index e34b623e4b41..1abd4dad6f40 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -20,6 +20,7 @@
>>  #include <linux/atomic.h>
>>  #include <xen/events.h>
>>  #include <xen/pci.h>
>> +#include <xen/acpi.h>
>>  #include <xen/xen.h>
>>  #include <asm/xen/hypervisor.h>
>>  #include <xen/interface/physdev.h>
>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>>         if (err)
>>                 goto config_release;
>>
>> +       if (xen_initial_domain() && xen_pvh_domain()) {
>> +               err = xen_pvh_passthrough_gsi(dev);
>> +               if (err)
>> +                       goto config_release;
>> +       }
>> +
>>         if (dev->msix_cap) {
>>                 struct physdev_pci_device ppdev = {
>>                         .seg = pci_domain_nr(dev->bus),
>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
>> index 641dc4843987..368d56ba2c5e 100644
>> --- a/include/linux/acpi.h
>> +++ b/include/linux/acpi.h
>> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
>>
>>  struct pci_dev;
>>
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>>  int acpi_pci_irq_enable (struct pci_dev *dev);
>>  void acpi_penalize_isa_irq(int irq, int active);
>>  bool acpi_isa_irq_available(int irq);
>> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
>> index b1e11863144d..ce7f5554f88e 100644
>> --- a/include/xen/acpi.h
>> +++ b/include/xen/acpi.h
>> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
>>                 acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>>         }
>>  }
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>>  #else
>>  static inline void xen_acpi_sleep_register(void)
>>  {
>>
>>>
>>> Jan
>>
>> -- 
>> Best regards,
>> Jiqian Chen.
>>
Jürgen Groß Dec. 7, 2023, 6:43 a.m. UTC | #16
On 07.12.23 03:18, Stefano Stabellini wrote:
> On Tue, 5 Dec 2023, Chen, Jiqian wrote:
>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>> I have a version of patch which tried that way, see below:
> 
> This approach looks much better. I think this patch is OKish. Juergen,
> what do you think?

The approach seems to be fine.


Juergen

> 
> 
>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>> index ada3868c02c2..43e1bda9f946 100644
>> --- a/arch/x86/xen/enlighten_pvh.c
>> +++ b/arch/x86/xen/enlighten_pvh.c
>> @@ -1,6 +1,7 @@
>>   // SPDX-License-Identifier: GPL-2.0
>>   #include <linux/acpi.h>
>>   #include <linux/export.h>
>> +#include <linux/pci.h>
>>
>>   #include <xen/hvc-console.h>
>>
>> @@ -25,6 +26,127 @@
>>   bool __ro_after_init xen_pvh;
>>   EXPORT_SYMBOL_GPL(xen_pvh);
>>
>> +typedef struct gsi_info {
>> +       int gsi;
>> +       int trigger;
>> +       int polarity;
>> +       int pirq;
>> +} gsi_info_t;
>> +
>> +struct acpi_prt_entry {
>> +       struct acpi_pci_id      id;
>> +       u8                      pin;
>> +       acpi_handle             link;
>> +       u32                     index;          /* GSI, or link _CRS index */
>> +};
>> +
>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>> +                                                               gsi_info_t *gsi_info)
>> +{
>> +       int gsi;
>> +       u8 pin = 0;
>> +       struct acpi_prt_entry *entry;
>> +       int trigger = ACPI_LEVEL_SENSITIVE;
>> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>> +
>> +       if (dev)
>> +               pin = dev->pin;
>> +       if (!pin) {
>> +               xen_raw_printk("No interrupt pin configured\n");
>> +               return -EINVAL;
>> +       }
>> +
>> +       entry = acpi_pci_irq_lookup(dev, pin);
>> +       if (entry) {
>> +               if (entry->link)
>> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
>> +                                                        entry->index,
>> +                                                        &trigger, &polarity,
>> +                                                        NULL);
>> +               else
>> +                       gsi = entry->index;
>> +       } else
>> +               return -EINVAL;
>> +
>> +       gsi_info->gsi = gsi;
>> +       gsi_info->trigger = trigger;
>> +       gsi_info->polarity = polarity;
>> +
>> +       return 0;
>> +}
>> +
>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_map_pirq map_irq;
>> +       int ret;
>> +
>> +       map_irq.domid = DOMID_SELF;
>> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
>> +       map_irq.index = gsi_info->gsi;
>> +       map_irq.pirq = gsi_info->gsi;
>> +
>> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>> +       gsi_info->pirq = map_irq.pirq;
>> +
>> +       return ret;
>> +}
>> +
>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_unmap_pirq unmap_irq;
>> +
>> +       unmap_irq.domid = DOMID_SELF;
>> +       unmap_irq.pirq = gsi_info->pirq;
>> +
>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> +}
>> +
>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_setup_gsi setup_gsi;
>> +
>> +       setup_gsi.gsi = gsi_info->gsi;
>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>> +
>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>> +}
>> +
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>> +{
>> +       int ret;
>> +       gsi_info_t gsi_info;
>> +
>> +       if (!dev) {
>> +               return -EINVAL;
>> +       }
>> +
>> +       ret = xen_pvh_get_gsi_info(dev, &gsi_info);
>> +       if (ret) {
>> +               xen_raw_printk("Fail to get gsi info!\n");
>> +               return ret;
>> +       }
>> +
>> +       ret = xen_pvh_map_pirq(&gsi_info);
>> +       if (ret) {
>> +               xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
>> +               return ret;
>> +       }
>> +
>> +       ret = xen_pvh_setup_gsi(&gsi_info);
>> +       if (ret == -EEXIST) {
>> +               ret = 0;
>> +               xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
>> +       } else if (ret) {
>> +               xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
>> +               xen_pvh_unmap_pirq(&gsi_info);
>> +       }
>> +
>> +       return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
>> +
>>   void __init xen_pvh_init(struct boot_params *boot_params)
>>   {
>>          u32 msr;
>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>> index ff30ceca2203..630fe0a34bc6 100644
>> --- a/drivers/acpi/pci_irq.c
>> +++ b/drivers/acpi/pci_irq.c
>> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>>   }
>>   #endif /* CONFIG_X86_IO_APIC */
>>
>> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>>   {
>>          struct acpi_prt_entry *entry = NULL;
>>          struct pci_dev *bridge;
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index e34b623e4b41..1abd4dad6f40 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/atomic.h>
>>   #include <xen/events.h>
>>   #include <xen/pci.h>
>> +#include <xen/acpi.h>
>>   #include <xen/xen.h>
>>   #include <asm/xen/hypervisor.h>
>>   #include <xen/interface/physdev.h>
>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>>          if (err)
>>                  goto config_release;
>>
>> +       if (xen_initial_domain() && xen_pvh_domain()) {
>> +               err = xen_pvh_passthrough_gsi(dev);
>> +               if (err)
>> +                       goto config_release;
>> +       }
>> +
>>          if (dev->msix_cap) {
>>                  struct physdev_pci_device ppdev = {
>>                          .seg = pci_domain_nr(dev->bus),
>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
>> index 641dc4843987..368d56ba2c5e 100644
>> --- a/include/linux/acpi.h
>> +++ b/include/linux/acpi.h
>> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
>>
>>   struct pci_dev;
>>
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>>   int acpi_pci_irq_enable (struct pci_dev *dev);
>>   void acpi_penalize_isa_irq(int irq, int active);
>>   bool acpi_isa_irq_available(int irq);
>> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
>> index b1e11863144d..ce7f5554f88e 100644
>> --- a/include/xen/acpi.h
>> +++ b/include/xen/acpi.h
>> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
>>                  acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>>          }
>>   }
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>>   #else
>>   static inline void xen_acpi_sleep_register(void)
>>   {
>>
>>>
>>> Jan
>>
>> -- 
>> Best regards,
>> Jiqian Chen.
>>
Jiqian Chen Dec. 8, 2023, 5:53 a.m. UTC | #17
Thank Stefano and Juergen, I will use this approach in next version.

On 2023/12/7 14:43, Juergen Gross wrote:
> On 07.12.23 03:18, Stefano Stabellini wrote:
>> On Tue, 5 Dec 2023, Chen, Jiqian wrote:
>>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>>> I have a version of patch which tried that way, see below:
>>
>> This approach looks much better. I think this patch is OKish. Juergen,
>> what do you think?
> 
> The approach seems to be fine.
> 
> 
> Juergen
> 
>>
>>
>>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>>> index ada3868c02c2..43e1bda9f946 100644
>>> --- a/arch/x86/xen/enlighten_pvh.c
>>> +++ b/arch/x86/xen/enlighten_pvh.c
>>> @@ -1,6 +1,7 @@
>>>   // SPDX-License-Identifier: GPL-2.0
>>>   #include <linux/acpi.h>
>>>   #include <linux/export.h>
>>> +#include <linux/pci.h>
>>>
>>>   #include <xen/hvc-console.h>
>>>
>>> @@ -25,6 +26,127 @@
>>>   bool __ro_after_init xen_pvh;
>>>   EXPORT_SYMBOL_GPL(xen_pvh);
>>>
>>> +typedef struct gsi_info {
>>> +       int gsi;
>>> +       int trigger;
>>> +       int polarity;
>>> +       int pirq;
>>> +} gsi_info_t;
>>> +
>>> +struct acpi_prt_entry {
>>> +       struct acpi_pci_id      id;
>>> +       u8                      pin;
>>> +       acpi_handle             link;
>>> +       u32                     index;          /* GSI, or link _CRS index */
>>> +};
>>> +
>>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>>> +                                                               gsi_info_t *gsi_info)
>>> +{
>>> +       int gsi;
>>> +       u8 pin = 0;
>>> +       struct acpi_prt_entry *entry;
>>> +       int trigger = ACPI_LEVEL_SENSITIVE;
>>> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>>> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>>> +
>>> +       if (dev)
>>> +               pin = dev->pin;
>>> +       if (!pin) {
>>> +               xen_raw_printk("No interrupt pin configured\n");
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       entry = acpi_pci_irq_lookup(dev, pin);
>>> +       if (entry) {
>>> +               if (entry->link)
>>> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
>>> +                                                        entry->index,
>>> +                                                        &trigger, &polarity,
>>> +                                                        NULL);
>>> +               else
>>> +                       gsi = entry->index;
>>> +       } else
>>> +               return -EINVAL;
>>> +
>>> +       gsi_info->gsi = gsi;
>>> +       gsi_info->trigger = trigger;
>>> +       gsi_info->polarity = polarity;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>>> +{
>>> +       struct physdev_map_pirq map_irq;
>>> +       int ret;
>>> +
>>> +       map_irq.domid = DOMID_SELF;
>>> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
>>> +       map_irq.index = gsi_info->gsi;
>>> +       map_irq.pirq = gsi_info->gsi;
>>> +
>>> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>>> +       gsi_info->pirq = map_irq.pirq;
>>> +
>>> +       return ret;
>>> +}
>>> +
>>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>>> +{
>>> +       struct physdev_unmap_pirq unmap_irq;
>>> +
>>> +       unmap_irq.domid = DOMID_SELF;
>>> +       unmap_irq.pirq = gsi_info->pirq;
>>> +
>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>>> +}
>>> +
>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>> +{
>>> +       struct physdev_setup_gsi setup_gsi;
>>> +
>>> +       setup_gsi.gsi = gsi_info->gsi;
>>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>> +
>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>> +}
>>> +
>>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>>> +{
>>> +       int ret;
>>> +       gsi_info_t gsi_info;
>>> +
>>> +       if (!dev) {
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       ret = xen_pvh_get_gsi_info(dev, &gsi_info);
>>> +       if (ret) {
>>> +               xen_raw_printk("Fail to get gsi info!\n");
>>> +               return ret;
>>> +       }
>>> +
>>> +       ret = xen_pvh_map_pirq(&gsi_info);
>>> +       if (ret) {
>>> +               xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
>>> +               return ret;
>>> +       }
>>> +
>>> +       ret = xen_pvh_setup_gsi(&gsi_info);
>>> +       if (ret == -EEXIST) {
>>> +               ret = 0;
>>> +               xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
>>> +       } else if (ret) {
>>> +               xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
>>> +               xen_pvh_unmap_pirq(&gsi_info);
>>> +       }
>>> +
>>> +       return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
>>> +
>>>   void __init xen_pvh_init(struct boot_params *boot_params)
>>>   {
>>>          u32 msr;
>>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>>> index ff30ceca2203..630fe0a34bc6 100644
>>> --- a/drivers/acpi/pci_irq.c
>>> +++ b/drivers/acpi/pci_irq.c
>>> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>>>   }
>>>   #endif /* CONFIG_X86_IO_APIC */
>>>
>>> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>>>   {
>>>          struct acpi_prt_entry *entry = NULL;
>>>          struct pci_dev *bridge;
>>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>>> index e34b623e4b41..1abd4dad6f40 100644
>>> --- a/drivers/xen/xen-pciback/pci_stub.c
>>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>>> @@ -20,6 +20,7 @@
>>>   #include <linux/atomic.h>
>>>   #include <xen/events.h>
>>>   #include <xen/pci.h>
>>> +#include <xen/acpi.h>
>>>   #include <xen/xen.h>
>>>   #include <asm/xen/hypervisor.h>
>>>   #include <xen/interface/physdev.h>
>>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>>>          if (err)
>>>                  goto config_release;
>>>
>>> +       if (xen_initial_domain() && xen_pvh_domain()) {
>>> +               err = xen_pvh_passthrough_gsi(dev);
>>> +               if (err)
>>> +                       goto config_release;
>>> +       }
>>> +
>>>          if (dev->msix_cap) {
>>>                  struct physdev_pci_device ppdev = {
>>>                          .seg = pci_domain_nr(dev->bus),
>>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
>>> index 641dc4843987..368d56ba2c5e 100644
>>> --- a/include/linux/acpi.h
>>> +++ b/include/linux/acpi.h
>>> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
>>>
>>>   struct pci_dev;
>>>
>>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>>>   int acpi_pci_irq_enable (struct pci_dev *dev);
>>>   void acpi_penalize_isa_irq(int irq, int active);
>>>   bool acpi_isa_irq_available(int irq);
>>> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
>>> index b1e11863144d..ce7f5554f88e 100644
>>> --- a/include/xen/acpi.h
>>> +++ b/include/xen/acpi.h
>>> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
>>>                  acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>>>          }
>>>   }
>>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>>>   #else
>>>   static inline void xen_acpi_sleep_register(void)
>>>   {
>>>
>>>>
>>>> Jan
>>>
>>> -- 
>>> Best regards,
>>> Jiqian Chen.
>>>
>
Roger Pau Monné Dec. 11, 2023, 3:45 p.m. UTC | #18
On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> On 2023/12/5 18:32, Jan Beulich wrote:
> > On 05.12.2023 10:19, Roger Pau Monné wrote:
> >> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
> >>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> >>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> >>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> >>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> >>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> >>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> >>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
> >>>>>>>>>> This patch is to solve two problems we encountered when we try to
> >>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
> >>>>>>>>>>
> >>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
> >>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
> >>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> >>>>>>>>>> into Xen and check whether dom0 has the mapping. See
> >>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> >>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
> >>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> >>>>>>>>>> when thay are enabled.
> >>>>>>>>>>
> >>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
> >>>>>>>>>> registered, but gsi must be configured for it to be able to be
> >>>>>>>>>> mapped into a domU.
> >>>>>>>>>>
> >>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
> >>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> >>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> >>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
> >>>>>>>>>> unmasked.
> >>>>>>>>>>
> >>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
> >>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
> >>>>>>>>>> and mapped in PVH dom0.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> >>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> >>>>>>>>> index 4e40d3609a..df262a4a18 100644
> >>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
> >>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
> >>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> >>>>>>>>>              hvm_dpci_eoi(d, gsi);
> >>>>>>>>>      }
> >>>>>>>>>  
> >>>>>>>>> -    if ( is_hardware_domain(d) && unmasked )
> >>>>>>>>> +    if ( is_hardware_domain(d) )
> >>>>>>>>>      {
> >>>>>>>>>          /*
> >>>>>>>>>           * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> >>>>>>>>
> >>>>>>>> There are some issues with this approach.
> >>>>>>>>
> >>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
> >>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
> >>>>>>>> to assert that the configuration is the intended one.  A guest is
> >>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> >>>>>>>> that doesn't take effect unless the pin is unmasked.
> >>>>>>>>
> >>>>>>>> Overall the question would be whether we have any guarantees that
> >>>>>>>> the hardware domain has properly configured the pin, even if it's not
> >>>>>>>> using it itself (as it hasn't been unmasked).
> >>>>>>>>
> >>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
> >>>>>>>> could configure any pins that are not setup at bind time?
> >>>>>>>
> >>>>>>> That could work.
> >>>>>>>
> >>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
> >>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
> >>>>>>
> >>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> >>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> >>>>>> parameter would be a GSI instead of a previously mapped IRQ).  Such
> >>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> >>>>>> route I would recommend that we instead introduce a new dmop that has
> >>>>>> this syntax regardless of the domain type it's called from.
> >>>>>
> >>>>> Looking at the code it is certainly a bit confusing. My point was that
> >>>>> we don't need to wait until polarity and trigger are set appropriately
> >>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> >>>>> should be able to figure out that Dom0 is permitted pirq access.
> >>>>
> >>>> The logic is certainly not straightforward, and it could benefit from
> >>>> some comments.
> >>>>
> >>>> The irq permissions are a bit special, in that they get setup when the
> >>>> IRQ is mapped.
> >>>>
> >>>> The problem however is not so much with IRQ permissions, that we can
> >>>> indeed sort out internally in Xen.  Such check in dom0 has the side
> >>>> effect of preventing the IRQ from being assigned to a domU without the
> >>>> hardware source being properly configured AFAICT.
> >>>
> >>> Now I understand why you made a comment previously about Xen having to
> >>> configure trigger and polarity for these interrupts on its own.
> >>>
> >>>
> >>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> >>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> >>>>> polarity to be configured to work. But the suggestion of doing it a
> >>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
> >>>>>
> >>>>> But maybe we can find another location, maybe within
> >>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> >>>>> trigger and polarity are set and before the interrupt is unmasked.
> >>>>>
> >>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> >>>>> call to allocate_and_map_gsi_pirq, because by the time
> >>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> >>>>> already been done.
> >>>>
> >>>> But then we would end up in a situation where the
> >>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> >>>> configured, which I think it's not what we want.
> >>>>
> >>>> One option would be to allow mp_register_gsi() to be called multiple
> >>>> times, and update the IO-APIC pin configuration as long as the pin is
> >>>> not unmasked.  That would propagate each dom0 RTE update to the
> >>>> underlying IO-APIC.  However such approach relies on dom0 configuring
> >>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
> >>>> think it's not a very reliable option.
> >>>>
> >>>> Another option would be to modify the toolstack to setup the GSI
> >>>> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
> >>>> email, since we only care about PCI device passthrough the legacy INTx
> >>>> should always be level triggered and low polarity.
> >>>>
> >>>>> I am not familiar with vioapic.c but to give you an idea of what I was
> >>>>> thinking:
> >>>>>
> >>>>>
> >>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> >>>>> index 4e40d3609a..16d56fe851 100644
> >>>>> --- a/xen/arch/x86/hvm/vioapic.c
> >>>>> +++ b/xen/arch/x86/hvm/vioapic.c
> >>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
> >>>>>          return ret;
> >>>>>      }
> >>>>>  
> >>>>> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> >>>>> -    if ( ret )
> >>>>> -    {
> >>>>> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> >>>>> -                 gsi, ret);
> >>>>> -        return ret;
> >>>>> -    }
> >>>>> -
> >>>>>      pcidevs_lock();
> >>>>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
> >>>>>      if ( ret )
> >>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
> >>>>>              hvm_dpci_eoi(d, gsi);
> >>>>>      }
> >>>>>  
> >>>>> +    if ( is_hardware_domain(d) ) 
> >>>>> +    {
> >>>>> +        int pirq = gsi, ret;
> >>>>> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> >>>>> +        if ( ret )
> >>>>> +        {
> >>>>> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> >>>>> +                    gsi, ret);
> >>>>> +            return ret;
> >>>>> +        }
> >>>>> +    }
> >>>>>      if ( is_hardware_domain(d) && unmasked )
> >>>>>      {
> >>>>>          /*
> >>>>
> >>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
> >>>> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
> >>>> are two different issues here that need to be fixed for PVH dom0:
> >>>>
> >>>>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
> >>>>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
> >>>
> >>> Yes makes sense
> >>>
> >>>
> >>>>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
> >>>>    the IO-APIC pin itself.
> >>>>
> >>>> First one needs to be fixed internally in Xen, second one will require
> >>>> the toolstack to issue an extra hypercall in order to ensure the
> >>>> IO-APIC pin is properly configured.
> >>>  
> >>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
> >>> correctly. Xen configures them all on its own at boot based on Device
> >>> Tree information. I guess it is not possible to do the same on x86?
> >>
> >> No, not exactly.  There's some interrupt information in the ACPI MADT,
> >> but that's just for very specific sources (Interrupt Source Override
> >> Structures)
> >>
> >> Then on AML devices can have resource descriptors that contain
> >> information about how interrupts are setup.  However Xen is not able
> >> to read any of this information on AML.
> >>
> >> Legacy PCI interrupts are (always?) level triggered and low polarity,
> >> because it's assumed that an interrupt source can be shared between
> >> multiple devices.
> > 
> > Except that as per what you said just in the earlier paragraph ACPI can
> > tell us otherwise.
> > 
> >> I'm however not able to find any reference to this in the PCI spec,
> >> hence I'm reluctant to take this for granted in Xen, and default all
> >> GSIs >= 16 to such mode.
> >>
> >> OTOH legacy PCI interrupts are not that used anymore, as almost all
> >> devices will support MSI(-X) (because PCIe mandates it) and OSes
> >> should prefer the latter.  SR-IOV VF don't even support legacy PCI
> >> interrupts anymore.
> >>
> >>> If
> >>> not, then I can see why we would need 1 extra toolstack hypercall for
> >>> that (or to bundle the operation of configuring IO-APIC pins together
> >>> with an existing toolstack hypercall).
> >>
> >> One suitable compromise would be to default unconfigured GSIs >= 16 to
> >> level-triggered and low-polarity, as I would expect that to work in
> >> almost all cases.  We can always introduce the usage of
> >> PHYSDEVOP_setup_gsi later if required.
> >>
> >> Maybe Jan has more input here, would you agree to defaulting non-ISA
> >> GSIs to level-triggered, low-polarity in the absence of a specific
> >> setup provided by dom0?
> > 
> > Well, such defaulting is an option, but in case it's wrong we might
> > end up with hard to diagnose issues. Personally I'd prefer if we
> > didn't take shortcuts here, i.e. if we followed what Dom0 is able
> > to read from ACPI.
> 
> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
> I have a version of patch which tried that way, see below:
> 
> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> index ada3868c02c2..43e1bda9f946 100644
> --- a/arch/x86/xen/enlighten_pvh.c
> +++ b/arch/x86/xen/enlighten_pvh.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/acpi.h>
>  #include <linux/export.h>
> +#include <linux/pci.h>
> 
>  #include <xen/hvc-console.h>
> 
> @@ -25,6 +26,127 @@
>  bool __ro_after_init xen_pvh;
>  EXPORT_SYMBOL_GPL(xen_pvh);
> 
> +typedef struct gsi_info {
> +       int gsi;
> +       int trigger;
> +       int polarity;
> +       int pirq;
> +} gsi_info_t;
> +
> +struct acpi_prt_entry {
> +       struct acpi_pci_id      id;
> +       u8                      pin;
> +       acpi_handle             link;
> +       u32                     index;          /* GSI, or link _CRS index */
> +};
> +
> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
> +                                                               gsi_info_t *gsi_info)
> +{
> +       int gsi;
> +       u8 pin = 0;
> +       struct acpi_prt_entry *entry;
> +       int trigger = ACPI_LEVEL_SENSITIVE;
> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
> +
> +       if (dev)
> +               pin = dev->pin;
> +       if (!pin) {
> +               xen_raw_printk("No interrupt pin configured\n");
> +               return -EINVAL;
> +       }
> +
> +       entry = acpi_pci_irq_lookup(dev, pin);
> +       if (entry) {
> +               if (entry->link)
> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
> +                                                        entry->index,
> +                                                        &trigger, &polarity,
> +                                                        NULL);
> +               else
> +                       gsi = entry->index;
> +       } else
> +               return -EINVAL;
> +
> +       gsi_info->gsi = gsi;
> +       gsi_info->trigger = trigger;
> +       gsi_info->polarity = polarity;
> +
> +       return 0;
> +}
> +
> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
> +{
> +       struct physdev_map_pirq map_irq;
> +       int ret;
> +
> +       map_irq.domid = DOMID_SELF;
> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
> +       map_irq.index = gsi_info->gsi;
> +       map_irq.pirq = gsi_info->gsi;
> +
> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
> +       gsi_info->pirq = map_irq.pirq;
> +
> +       return ret;
> +}
> +
> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
> +{
> +       struct physdev_unmap_pirq unmap_irq;
> +
> +       unmap_irq.domid = DOMID_SELF;
> +       unmap_irq.pirq = gsi_info->pirq;
> +
> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> +}
> +
> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> +{
> +       struct physdev_setup_gsi setup_gsi;
> +
> +       setup_gsi.gsi = gsi_info->gsi;
> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> +
> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> +}

Hm, why not simply call pcibios_enable_device() from pciback?  What
you are doing here using the hypercalls is a backdoor into what's done
automatically by Xen on IO-APIC accesses by a PVH dom0.

It will be much more natural for the PVH dom0 model to simply use the
native way to configure and unmask the IO-APIC pin, and that would
correctly setup the triggering/polarity and bind it to dom0 without
requiring the usage of any hypercalls.

Is that an issue since in that case the gsi will get mapped and bound
to dom0?

Otherwise I would prefer if the gsi is just configured from pciback
(PHYSDEVOP_setup_gsi) but not mapped, as allowing a PVH dom0 to map
interrupts using PHYSDEVOP_{,un}map_pirq to itself introduces a new
interface to manage interrupts that clashes with the native way that a
PVH dom0 uses.

Thanks, Roger.
Jiqian Chen Dec. 12, 2023, 6:16 a.m. UTC | #19
On 2023/12/11 23:45, Roger Pau Monné wrote:
> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>
>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>> I have a version of patch which tried that way, see below:
>>
>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>> index ada3868c02c2..43e1bda9f946 100644
>> --- a/arch/x86/xen/enlighten_pvh.c
>> +++ b/arch/x86/xen/enlighten_pvh.c
>> @@ -1,6 +1,7 @@
>>  // SPDX-License-Identifier: GPL-2.0
>>  #include <linux/acpi.h>
>>  #include <linux/export.h>
>> +#include <linux/pci.h>
>>
>>  #include <xen/hvc-console.h>
>>
>> @@ -25,6 +26,127 @@
>>  bool __ro_after_init xen_pvh;
>>  EXPORT_SYMBOL_GPL(xen_pvh);
>>
>> +typedef struct gsi_info {
>> +       int gsi;
>> +       int trigger;
>> +       int polarity;
>> +       int pirq;
>> +} gsi_info_t;
>> +
>> +struct acpi_prt_entry {
>> +       struct acpi_pci_id      id;
>> +       u8                      pin;
>> +       acpi_handle             link;
>> +       u32                     index;          /* GSI, or link _CRS index */
>> +};
>> +
>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>> +                                                               gsi_info_t *gsi_info)
>> +{
>> +       int gsi;
>> +       u8 pin = 0;
>> +       struct acpi_prt_entry *entry;
>> +       int trigger = ACPI_LEVEL_SENSITIVE;
>> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>> +
>> +       if (dev)
>> +               pin = dev->pin;
>> +       if (!pin) {
>> +               xen_raw_printk("No interrupt pin configured\n");
>> +               return -EINVAL;
>> +       }
>> +
>> +       entry = acpi_pci_irq_lookup(dev, pin);
>> +       if (entry) {
>> +               if (entry->link)
>> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
>> +                                                        entry->index,
>> +                                                        &trigger, &polarity,
>> +                                                        NULL);
>> +               else
>> +                       gsi = entry->index;
>> +       } else
>> +               return -EINVAL;
>> +
>> +       gsi_info->gsi = gsi;
>> +       gsi_info->trigger = trigger;
>> +       gsi_info->polarity = polarity;
>> +
>> +       return 0;
>> +}
>> +
>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_map_pirq map_irq;
>> +       int ret;
>> +
>> +       map_irq.domid = DOMID_SELF;
>> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
>> +       map_irq.index = gsi_info->gsi;
>> +       map_irq.pirq = gsi_info->gsi;
>> +
>> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>> +       gsi_info->pirq = map_irq.pirq;
>> +
>> +       return ret;
>> +}
>> +
>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_unmap_pirq unmap_irq;
>> +
>> +       unmap_irq.domid = DOMID_SELF;
>> +       unmap_irq.pirq = gsi_info->pirq;
>> +
>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> +}
>> +
>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>> +{
>> +       struct physdev_setup_gsi setup_gsi;
>> +
>> +       setup_gsi.gsi = gsi_info->gsi;
>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>> +
>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>> +}
> 
> Hm, why not simply call pcibios_enable_device() from pciback?  What
pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> you are doing here using the hypercalls is a backdoor into what's done
> automatically by Xen on IO-APIC accesses by a PVH dom0.
But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
> 
> It will be much more natural for the PVH dom0 model to simply use the
> native way to configure and unmask the IO-APIC pin, and that would
> correctly setup the triggering/polarity and bind it to dom0 without
> requiring the usage of any hypercalls.
Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
But Thomas Gleixner think it is not suitable to export unmask_irq.
> 
> Is that an issue since in that case the gsi will get mapped and bound
> to dom0?
Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(), 
> 
> Otherwise I would prefer if the gsi is just configured from pciback
> (PHYSDEVOP_setup_gsi) but not mapped, as allowing a PVH dom0 to map
> interrupts using PHYSDEVOP_{,un}map_pirq to itself introduces a new
> interface to manage interrupts that clashes with the native way that a
> PVH dom0 uses.
This method does map_pirq and setup_gsi only when a device is assigned to passthrough, it won't affect the other device using native way.

> 
> Thanks, Roger.
Roger Pau Monné Dec. 12, 2023, 8:49 a.m. UTC | #20
On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
> On 2023/12/11 23:45, Roger Pau Monné wrote:
> > On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> >>
> >> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
> >> I have a version of patch which tried that way, see below:
> >>
> >> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> >> index ada3868c02c2..43e1bda9f946 100644
> >> --- a/arch/x86/xen/enlighten_pvh.c
> >> +++ b/arch/x86/xen/enlighten_pvh.c
> >> @@ -1,6 +1,7 @@
> >>  // SPDX-License-Identifier: GPL-2.0
> >>  #include <linux/acpi.h>
> >>  #include <linux/export.h>
> >> +#include <linux/pci.h>
> >>
> >>  #include <xen/hvc-console.h>
> >>
> >> @@ -25,6 +26,127 @@
> >>  bool __ro_after_init xen_pvh;
> >>  EXPORT_SYMBOL_GPL(xen_pvh);
> >>
> >> +typedef struct gsi_info {
> >> +       int gsi;
> >> +       int trigger;
> >> +       int polarity;
> >> +       int pirq;
> >> +} gsi_info_t;
> >> +
> >> +struct acpi_prt_entry {
> >> +       struct acpi_pci_id      id;
> >> +       u8                      pin;
> >> +       acpi_handle             link;
> >> +       u32                     index;          /* GSI, or link _CRS index */
> >> +};
> >> +
> >> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
> >> +                                                               gsi_info_t *gsi_info)
> >> +{
> >> +       int gsi;
> >> +       u8 pin = 0;
> >> +       struct acpi_prt_entry *entry;
> >> +       int trigger = ACPI_LEVEL_SENSITIVE;
> >> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
> >> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
> >> +
> >> +       if (dev)
> >> +               pin = dev->pin;
> >> +       if (!pin) {
> >> +               xen_raw_printk("No interrupt pin configured\n");
> >> +               return -EINVAL;
> >> +       }
> >> +
> >> +       entry = acpi_pci_irq_lookup(dev, pin);
> >> +       if (entry) {
> >> +               if (entry->link)
> >> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
> >> +                                                        entry->index,
> >> +                                                        &trigger, &polarity,
> >> +                                                        NULL);
> >> +               else
> >> +                       gsi = entry->index;
> >> +       } else
> >> +               return -EINVAL;
> >> +
> >> +       gsi_info->gsi = gsi;
> >> +       gsi_info->trigger = trigger;
> >> +       gsi_info->polarity = polarity;
> >> +
> >> +       return 0;
> >> +}
> >> +
> >> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
> >> +{
> >> +       struct physdev_map_pirq map_irq;
> >> +       int ret;
> >> +
> >> +       map_irq.domid = DOMID_SELF;
> >> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
> >> +       map_irq.index = gsi_info->gsi;
> >> +       map_irq.pirq = gsi_info->gsi;
> >> +
> >> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
> >> +       gsi_info->pirq = map_irq.pirq;
> >> +
> >> +       return ret;
> >> +}
> >> +
> >> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
> >> +{
> >> +       struct physdev_unmap_pirq unmap_irq;
> >> +
> >> +       unmap_irq.domid = DOMID_SELF;
> >> +       unmap_irq.pirq = gsi_info->pirq;
> >> +
> >> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> >> +}
> >> +
> >> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> >> +{
> >> +       struct physdev_setup_gsi setup_gsi;
> >> +
> >> +       setup_gsi.gsi = gsi_info->gsi;
> >> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> >> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> >> +
> >> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> >> +}
> > 
> > Hm, why not simply call pcibios_enable_device() from pciback?  What
> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> > you are doing here using the hypercalls is a backdoor into what's done
> > automatically by Xen on IO-APIC accesses by a PVH dom0.
> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>

I see, it does setup the IO-APIC pin but doesn't unmask it, that's
what I feared.

> > It will be much more natural for the PVH dom0 model to simply use the
> > native way to configure and unmask the IO-APIC pin, and that would
> > correctly setup the triggering/polarity and bind it to dom0 without
> > requiring the usage of any hypercalls.
> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
> But Thomas Gleixner think it is not suitable to export unmask_irq.

Yeah, that wasn't good.

> > 
> > Is that an issue since in that case the gsi will get mapped and bound
> > to dom0?
> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(), 

Can we see about finding another way to fix this check?

One option would be granting permissions over the IRQ in
PHYSDEVOP_setup_gsi?

Otherwise we could see about modifying the logic in PHYSDEVOP_map_pirq
so that the hardware domain can map IRQs to other domains without
having it mapped to itself first?

I think the call to PHYSDEVOP_setup_gsi in pciback is fine, but I
would really like to avoid the usage of PHYSDEVOP_{,un}map_pirq on a
PVH dom0 against itself.

> > 
> > Otherwise I would prefer if the gsi is just configured from pciback
> > (PHYSDEVOP_setup_gsi) but not mapped, as allowing a PVH dom0 to map
> > interrupts using PHYSDEVOP_{,un}map_pirq to itself introduces a new
> > interface to manage interrupts that clashes with the native way that a
> > PVH dom0 uses.
> This method does map_pirq and setup_gsi only when a device is assigned to passthrough, it won't affect the other device using native way.

It's not affected because of the specific usage in Linux, but allowing
the interface to be used against itself (so to manage interrupts
from assigned to dom0) is opening a whole new way to setup interrupts,
and it's unclear to me how that will affect the current way we use to
manage interrupts on a PVH dom0.

Thanks, Roger.
Jan Beulich Dec. 12, 2023, 9:38 a.m. UTC | #21
(I think the Cc list is too long here, but then I don't know who to
keep and who to possibly drop.)

On 12.12.2023 09:49, Roger Pau Monné wrote:
> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
>> On 2023/12/11 23:45, Roger Pau Monné wrote:
>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>>> +{
>>>> +       struct physdev_setup_gsi setup_gsi;
>>>> +
>>>> +       setup_gsi.gsi = gsi_info->gsi;
>>>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>>> +
>>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>>> +}
>>>
>>> Hm, why not simply call pcibios_enable_device() from pciback?  What
>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
>>> you are doing here using the hypercalls is a backdoor into what's done
>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>>
> 
> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
> what I feared.
> 
>>> It will be much more natural for the PVH dom0 model to simply use the
>>> native way to configure and unmask the IO-APIC pin, and that would
>>> correctly setup the triggering/polarity and bind it to dom0 without
>>> requiring the usage of any hypercalls.
>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
>> But Thomas Gleixner think it is not suitable to export unmask_irq.
> 
> Yeah, that wasn't good.
> 
>>>
>>> Is that an issue since in that case the gsi will get mapped and bound
>>> to dom0?
>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(), 
> 
> Can we see about finding another way to fix this check?
> 
> One option would be granting permissions over the IRQ in
> PHYSDEVOP_setup_gsi?

There's no domain available there, and imo it's also the wrong interface to
possibly grant any permissions.

> Otherwise we could see about modifying the logic in PHYSDEVOP_map_pirq
> so that the hardware domain can map IRQs to other domains without
> having it mapped to itself first?

While this may be possible to arrange for, it still would feel wrong. How
would you express the same in a disaggregated environment? I.e. how would
you make sure a domain trying to grant permission is actually permitted to
do so for the resource in question?

> I think the call to PHYSDEVOP_setup_gsi in pciback is fine, but I
> would really like to avoid the usage of PHYSDEVOP_{,un}map_pirq on a
> PVH dom0 against itself.

+1

Jan
Roger Pau Monné Dec. 12, 2023, 11:18 a.m. UTC | #22
On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
> (I think the Cc list is too long here, but then I don't know who to
> keep and who to possibly drop.)
> 
> On 12.12.2023 09:49, Roger Pau Monné wrote:
> > On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
> >> On 2023/12/11 23:45, Roger Pau Monné wrote:
> >>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> >>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> >>>> +{
> >>>> +       struct physdev_setup_gsi setup_gsi;
> >>>> +
> >>>> +       setup_gsi.gsi = gsi_info->gsi;
> >>>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> >>>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> >>>> +
> >>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> >>>> +}
> >>>
> >>> Hm, why not simply call pcibios_enable_device() from pciback?  What
> >> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
> >> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> >>> you are doing here using the hypercalls is a backdoor into what's done
> >>> automatically by Xen on IO-APIC accesses by a PVH dom0.
> >> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
> >> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
> >>
> > 
> > I see, it does setup the IO-APIC pin but doesn't unmask it, that's
> > what I feared.
> > 
> >>> It will be much more natural for the PVH dom0 model to simply use the
> >>> native way to configure and unmask the IO-APIC pin, and that would
> >>> correctly setup the triggering/polarity and bind it to dom0 without
> >>> requiring the usage of any hypercalls.
> >> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
> >> But Thomas Gleixner think it is not suitable to export unmask_irq.
> > 
> > Yeah, that wasn't good.
> > 
> >>>
> >>> Is that an issue since in that case the gsi will get mapped and bound
> >>> to dom0?
> >> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(), 
> > 
> > Can we see about finding another way to fix this check?
> > 
> > One option would be granting permissions over the IRQ in
> > PHYSDEVOP_setup_gsi?
> 
> There's no domain available there, and imo it's also the wrong interface to
> possibly grant any permissions.

Well, the domain is the caller.

> > Otherwise we could see about modifying the logic in PHYSDEVOP_map_pirq
> > so that the hardware domain can map IRQs to other domains without
> > having it mapped to itself first?
> 
> While this may be possible to arrange for, it still would feel wrong. How
> would you express the same in a disaggregated environment? I.e. how would
> you make sure a domain trying to grant permission is actually permitted to
> do so for the resource in question?

I've been looking again at the original issue, and I think I was
confused.  The issue is not that dom0 doesn't have permissions over
the GSIs (as we do grant those in dom0_setup_permissions()), but
rather that XEN_DOMCTL_irq_permission attempts to use
domain_pirq_to_irq() in order to get the IRQ from the PIRQ parameter.

I've always been a bit confused with the PIRQ GSI relation, but IIRC
GSIs are always identity mapped to PIRQs, and hence you could possibly
adjust XEN_DOMCTL_irq_permission to use irq_permit_access() to check
if the caller domain has permissions over the target PIRQ.

Thanks, Roger.
Jan Beulich Dec. 12, 2023, 11:19 a.m. UTC | #23
On 12.12.2023 12:18, Roger Pau Monné wrote:
> On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
>> (I think the Cc list is too long here, but then I don't know who to
>> keep and who to possibly drop.)
>>
>> On 12.12.2023 09:49, Roger Pau Monné wrote:
>>> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
>>>> On 2023/12/11 23:45, Roger Pau Monné wrote:
>>>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>>>>> +{
>>>>>> +       struct physdev_setup_gsi setup_gsi;
>>>>>> +
>>>>>> +       setup_gsi.gsi = gsi_info->gsi;
>>>>>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>>>>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>>>>> +
>>>>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>>>>> +}
>>>>>
>>>>> Hm, why not simply call pcibios_enable_device() from pciback?  What
>>>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
>>>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
>>>>> you are doing here using the hypercalls is a backdoor into what's done
>>>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
>>>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
>>>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>>>>
>>>
>>> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
>>> what I feared.
>>>
>>>>> It will be much more natural for the PVH dom0 model to simply use the
>>>>> native way to configure and unmask the IO-APIC pin, and that would
>>>>> correctly setup the triggering/polarity and bind it to dom0 without
>>>>> requiring the usage of any hypercalls.
>>>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
>>>> But Thomas Gleixner think it is not suitable to export unmask_irq.
>>>
>>> Yeah, that wasn't good.
>>>
>>>>>
>>>>> Is that an issue since in that case the gsi will get mapped and bound
>>>>> to dom0?
>>>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(), 
>>>
>>> Can we see about finding another way to fix this check?
>>>
>>> One option would be granting permissions over the IRQ in
>>> PHYSDEVOP_setup_gsi?
>>
>> There's no domain available there, and imo it's also the wrong interface to
>> possibly grant any permissions.
> 
> Well, the domain is the caller.

Granting permission to itself?

Jan
Roger Pau Monné Dec. 12, 2023, 11:39 a.m. UTC | #24
On Tue, Dec 12, 2023 at 12:19:49PM +0100, Jan Beulich wrote:
> On 12.12.2023 12:18, Roger Pau Monné wrote:
> > On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
> >> (I think the Cc list is too long here, but then I don't know who to
> >> keep and who to possibly drop.)
> >>
> >> On 12.12.2023 09:49, Roger Pau Monné wrote:
> >>> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
> >>>> On 2023/12/11 23:45, Roger Pau Monné wrote:
> >>>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> >>>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> >>>>>> +{
> >>>>>> +       struct physdev_setup_gsi setup_gsi;
> >>>>>> +
> >>>>>> +       setup_gsi.gsi = gsi_info->gsi;
> >>>>>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> >>>>>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> >>>>>> +
> >>>>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> >>>>>> +}
> >>>>>
> >>>>> Hm, why not simply call pcibios_enable_device() from pciback?  What
> >>>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
> >>>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> >>>>> you are doing here using the hypercalls is a backdoor into what's done
> >>>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
> >>>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
> >>>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
> >>>>
> >>>
> >>> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
> >>> what I feared.
> >>>
> >>>>> It will be much more natural for the PVH dom0 model to simply use the
> >>>>> native way to configure and unmask the IO-APIC pin, and that would
> >>>>> correctly setup the triggering/polarity and bind it to dom0 without
> >>>>> requiring the usage of any hypercalls.
> >>>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
> >>>> But Thomas Gleixner think it is not suitable to export unmask_irq.
> >>>
> >>> Yeah, that wasn't good.
> >>>
> >>>>>
> >>>>> Is that an issue since in that case the gsi will get mapped and bound
> >>>>> to dom0?
> >>>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(), 
> >>>
> >>> Can we see about finding another way to fix this check?
> >>>
> >>> One option would be granting permissions over the IRQ in
> >>> PHYSDEVOP_setup_gsi?
> >>
> >> There's no domain available there, and imo it's also the wrong interface to
> >> possibly grant any permissions.
> > 
> > Well, the domain is the caller.
> 
> Granting permission to itself?

See below in the previous email, the issue is not with the
permissions, which are correctly assigned from
dom0_setup_permissions(), but the usage of domain_pirq_to_irq() in
pirq_access_permitted() as called by XEN_DOMCTL_irq_permission.
There's no need to play with the permissions at all.

Regards, Roger.
diff mbox series

Patch

diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index 5a96b6c66c07..b83d02bcc76c 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -357,6 +357,7 @@  static int pcistub_match(struct pci_dev *dev)
 static int pcistub_init_device(struct pci_dev *dev)
 {
 	struct xen_pcibk_dev_data *dev_data;
+	struct irq_desc *desc = NULL;
 	int err = 0;
 
 	dev_dbg(&dev->dev, "initializing...\n");
@@ -399,6 +400,12 @@  static int pcistub_init_device(struct pci_dev *dev)
 	if (err)
 		goto config_release;
 
+	if (xen_initial_domain() && xen_pvh_domain()) {
+		if (dev->irq <= 0 || !(desc = irq_to_desc(dev->irq)))
+			goto config_release;
+		unmask_irq(desc);
+	}
+
 	if (dev->msix_cap) {
 		struct physdev_pci_device ppdev = {
 			.seg = pci_domain_nr(dev->bus),
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 90081afa10ce..44650ca178d9 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -659,6 +659,7 @@  extern void handle_percpu_irq(struct irq_desc *desc);
 extern void handle_percpu_devid_irq(struct irq_desc *desc);
 extern void handle_bad_irq(struct irq_desc *desc);
 extern void handle_nested_irq(unsigned int irq);
+extern void unmask_irq(struct irq_desc *desc);
 
 extern void handle_fasteoi_nmi(struct irq_desc *desc);
 extern void handle_percpu_devid_fasteoi_nmi(struct irq_desc *desc);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index dc94e0bf2c94..fd67b40b678d 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -439,6 +439,7 @@  void unmask_irq(struct irq_desc *desc)
 		irq_state_clr_masked(desc);
 	}
 }
+EXPORT_SYMBOL_GPL(unmask_irq);
 
 void unmask_threaded_irq(struct irq_desc *desc)
 {
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index bcc7f21db9ee..d08e3e7b2819 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -95,7 +95,6 @@  extern void irq_disable(struct irq_desc *desc);
 extern void irq_percpu_enable(struct irq_desc *desc, unsigned int cpu);
 extern void irq_percpu_disable(struct irq_desc *desc, unsigned int cpu);
 extern void mask_irq(struct irq_desc *desc);
-extern void unmask_irq(struct irq_desc *desc);
 extern void unmask_threaded_irq(struct irq_desc *desc);
 
 #ifdef CONFIG_SPARSE_IRQ
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 27ca1c866f29..5977efed31b5 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -380,7 +380,7 @@  struct irq_desc *irq_to_desc(unsigned int irq)
 {
 	return mtree_load(&sparse_irqs, irq);
 }
-#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
+#if defined CONFIG_KVM_BOOK3S_64_HV_MODULE || defined CONFIG_XEN_PVH
 EXPORT_SYMBOL_GPL(irq_to_desc);
 #endif