diff mbox series

[RFC/RFT] vfio/pci: Create feature to disable MSI virtualization

Message ID 20240812170014.1583783-1-alex.williamson@redhat.com
State New
Headers show
Series [RFC/RFT] vfio/pci: Create feature to disable MSI virtualization | expand

Commit Message

Alex Williamson Aug. 12, 2024, 4:59 p.m. UTC
vfio-pci has always virtualized the MSI address and data registers as
MSI programming is performed through the SET_IRQS ioctl.  Often this
virtualization is not used, and in specific cases can be unhelpful.

One such case where the virtualization is a hinderance is when the
device contains an onboard interrupt controller programmed by the guest
driver.  Userspace VMMs have a chance to quirk this programming,
injecting the host physical MSI information, but only if the userspace
driver can get access to the host physical address and data registers.

This introduces a device feature which allows the userspace driver to
disable virtualization of the MSI capability address and data registers
in order to provide read-only access the the physical values.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216055
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 26 ++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c   | 21 +++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  1 +
 include/uapi/linux/vfio.h          | 14 ++++++++++++++
 4 files changed, 62 insertions(+)

Comments

Jason Gunthorpe Aug. 13, 2024, 4:30 p.m. UTC | #1
On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
> vfio-pci has always virtualized the MSI address and data registers as
> MSI programming is performed through the SET_IRQS ioctl.  Often this
> virtualization is not used, and in specific cases can be unhelpful.
> 
> One such case where the virtualization is a hinderance is when the
> device contains an onboard interrupt controller programmed by the guest
> driver.  Userspace VMMs have a chance to quirk this programming,
> injecting the host physical MSI information, but only if the userspace
> driver can get access to the host physical address and data registers.
> 
> This introduces a device feature which allows the userspace driver to
> disable virtualization of the MSI capability address and data registers
> in order to provide read-only access the the physical values.

Personally, I very much dislike this. Encouraging such hacky driver
use of the interrupt subsystem is not a good direction. Enabling this
in VMs will further complicate fixing the IRQ usages in these drivers
over the long run.

If the device has it's own interrupt sources then the device needs to
create an irq_chip and related and hook them up properly. Not hackily
read the MSI-X registers and write them someplace else.

Thomas Gleixner has done alot of great work recently to clean this up.

So if you imagine the driver is fixed, then this is not necessary.

Howver, it will still not work in a VM. Making IMS and non-MSI
interrupt controlers work within VMs is still something that needs to
be done.

Jason
Thomas Gleixner Aug. 13, 2024, 5:30 p.m. UTC | #2
On Tue, Aug 13 2024 at 13:30, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
>> vfio-pci has always virtualized the MSI address and data registers as
>> MSI programming is performed through the SET_IRQS ioctl.  Often this
>> virtualization is not used, and in specific cases can be unhelpful.
>> 
>> One such case where the virtualization is a hinderance is when the
>> device contains an onboard interrupt controller programmed by the guest
>> driver.  Userspace VMMs have a chance to quirk this programming,
>> injecting the host physical MSI information, but only if the userspace
>> driver can get access to the host physical address and data registers.
>> 
>> This introduces a device feature which allows the userspace driver to
>> disable virtualization of the MSI capability address and data registers
>> in order to provide read-only access the the physical values.
>
> Personally, I very much dislike this. Encouraging such hacky driver
> use of the interrupt subsystem is not a good direction. Enabling this
> in VMs will further complicate fixing the IRQ usages in these drivers
> over the long run.
>
> If the device has it's own interrupt sources then the device needs to
> create an irq_chip and related and hook them up properly. Not hackily
> read the MSI-X registers and write them someplace else.
>
> Thomas Gleixner has done alot of great work recently to clean this up.
>
> So if you imagine the driver is fixed, then this is not necessary.

Yes. I looked at the at11k driver when I was reworking the PCI/MSI
subsystem and that's a perfect candidate for a proper device specific
interrupt domain to replace the horrible MSI hackery it has.

> Howver, it will still not work in a VM. Making IMS and non-MSI
> interrupt controlers work within VMs is still something that needs to
> be done.

Sure, but we really want to do that in a generic way and not based on ad
hoc workarounds.

Did the debate around this go anywhere?

Thanks,

        tglx
Alex Williamson Aug. 13, 2024, 9:14 p.m. UTC | #3
On Tue, 13 Aug 2024 13:30:53 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
> > vfio-pci has always virtualized the MSI address and data registers as
> > MSI programming is performed through the SET_IRQS ioctl.  Often this
> > virtualization is not used, and in specific cases can be unhelpful.
> > 
> > One such case where the virtualization is a hinderance is when the
> > device contains an onboard interrupt controller programmed by the guest
> > driver.  Userspace VMMs have a chance to quirk this programming,
> > injecting the host physical MSI information, but only if the userspace
> > driver can get access to the host physical address and data registers.
> > 
> > This introduces a device feature which allows the userspace driver to
> > disable virtualization of the MSI capability address and data registers
> > in order to provide read-only access the the physical values.  
> 
> Personally, I very much dislike this. Encouraging such hacky driver
> use of the interrupt subsystem is not a good direction. Enabling this
> in VMs will further complicate fixing the IRQ usages in these drivers
> over the long run.

Clearly these _guest_ drivers are doing this regardless of the
interfaces provided by vfio, so I don't see how we're encouraging hacky
driver behavior, especially when it comes to Windows guest drivers.

> If the device has it's own interrupt sources then the device needs to
> create an irq_chip and related and hook them up properly. Not hackily
> read the MSI-X registers and write them someplace else.

This is how the hardware works, regardless of whether the guest driver
represents the hardware using an irq_chip.

> Thomas Gleixner has done alot of great work recently to clean this up.
> 
> So if you imagine the driver is fixed, then this is not necessary.

How so?  Regardless of the guest driver structure, something is writing
the MSI address and data values elsewhere in the device.  AFAICT the
only way to avoid needing to fixup those values is to give the guest
ownership of the address space as you suggested in the other patch.
That also seems to have a pile of issues though.

> Howver, it will still not work in a VM. Making IMS and non-MSI
> interrupt controlers work within VMs is still something that needs to
> be done.

Making it work in a VM is sort of the point here.  Thanks,

Alex
Jason Gunthorpe Aug. 13, 2024, 11:16 p.m. UTC | #4
On Tue, Aug 13, 2024 at 03:14:01PM -0600, Alex Williamson wrote:

> > Personally, I very much dislike this. Encouraging such hacky driver
> > use of the interrupt subsystem is not a good direction. Enabling this
> > in VMs will further complicate fixing the IRQ usages in these drivers
> > over the long run.
> 
> Clearly these _guest_ drivers are doing this regardless of the
> interfaces provided by vfio, so I don't see how we're encouraging hacky
> driver behavior, especially when it comes to Windows guest drivers.

Because people will then say the Linux driver can't be fixed to
properly use an irq_domain/etc as the only option that works in VMs
will be the hacky copy from MSI-X approach :\

> > Thomas Gleixner has done alot of great work recently to clean this up.
> > 
> > So if you imagine the driver is fixed, then this is not necessary.
> 
> How so? 

Because if the driver is properly using the new irq_domain/etc
infrastructure to model its additional interrupt source then this
patch won't make it work in the VM anyhow, so it is not necessary..

Your other patch would be the only short term answer.

Jason
Jason Gunthorpe Aug. 13, 2024, 11:39 p.m. UTC | #5
On Tue, Aug 13, 2024 at 07:30:41PM +0200, Thomas Gleixner wrote:
> > Howver, it will still not work in a VM. Making IMS and non-MSI
> > interrupt controlers work within VMs is still something that needs to
> > be done.
> 
> Sure, but we really want to do that in a generic way and not based on ad
> hoc workarounds.
>
> Did the debate around this go anywhere?

No, it got stuck on the impossible situation that there is no existing
way for the VM to have any idea if IMS will work or is broken. Recall
Intel was planning to "solve" this by sticking a DVSEC in their
virtual config space that said to turn off IMS :\

So using IMS in the real world looked impractical and interest faded a
bit.

But the underlying reasons for IMS haven't gone away and more work is
coming that will bring it up again...

Jason
Alex Williamson Aug. 14, 2024, 2:55 p.m. UTC | #6
On Tue, 13 Aug 2024 20:16:42 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Aug 13, 2024 at 03:14:01PM -0600, Alex Williamson wrote:
> 
> > > Personally, I very much dislike this. Encouraging such hacky driver
> > > use of the interrupt subsystem is not a good direction. Enabling this
> > > in VMs will further complicate fixing the IRQ usages in these drivers
> > > over the long run.  
> > 
> > Clearly these _guest_ drivers are doing this regardless of the
> > interfaces provided by vfio, so I don't see how we're encouraging hacky
> > driver behavior, especially when it comes to Windows guest drivers.  
> 
> Because people will then say the Linux driver can't be fixed to
> properly use an irq_domain/etc as the only option that works in VMs
> will be the hacky copy from MSI-X approach :\

Ironically QEMU already has direct access to the MSI-X vector table in
MMIO space and could implement this type of quirk with no kernel
changes.  It's MSI that is now blocked by virtualization of the address
and data registers.  Note also that QEMU is still virtualizing these
registers, the values seen in the guest are unchanged.  It's only the
VMM that can bypass that virtualization to see the host values.

Let's imagine the guest driver does change to implement an irq_domain.
How does that fundamentally change the problem for the VMM that guest
MSI values are being written to other portions of the device?  The
guest driver can have whatever architecture it wants (we don't know
the architecture of the Windows driver) but we still need to trap
writes of the guest MSI address/data and replace it with host values.

> > > Thomas Gleixner has done alot of great work recently to clean this up.
> > > 
> > > So if you imagine the driver is fixed, then this is not necessary.  
> > 
> > How so?   
> 
> Because if the driver is properly using the new irq_domain/etc
> infrastructure to model its additional interrupt source then this
> patch won't make it work in the VM anyhow, so it is not necessary..
> 
> Your other patch would be the only short term answer.

The QEMU patch relies on this kernel patch in order to be able to
access the host physical MSI address and data values through the vfio
interface.  Otherwise QEMU has no host values with which to patch-up
guest values.  As noted above, this does not provide any visible change
to a QEMU guest, it only enables QEMU to implement the quirk in the
other patch.  Thanks,

Alex
Jason Gunthorpe Aug. 14, 2024, 3:20 p.m. UTC | #7
On Wed, Aug 14, 2024 at 08:55:05AM -0600, Alex Williamson wrote:
> Let's imagine the guest driver does change to implement an irq_domain.
> How does that fundamentally change the problem for the VMM that guest
> MSI values are being written to other portions of the device?

If changed to irq_domain the VM will write addr/data pairs into those
special register that are unique to that interrupt source and will not
re-use values already set in the MSI table.

This means the VMM doesn't get any value from inspecting the MSI table
because the value it needs won't be there, and alos that no interrupt
routing will have been setup. The VMM must call VFIO_DEVICE_SET_IRQS
to setup the unique routing.

These two patches are avoiding VFIO_DEVICE_SET_IRQS based on the
assumption that the VM will re-use a addr/data pair already setup in
the MSI table. Invalidating that assumption is the fundamental change
irq_domain in the VM will make.

> The guest driver can have whatever architecture it wants (we don't
> know the architecture of the Windows driver) but we still need to
> trap writes of the guest MSI address/data and replace it with host
> values.

Yes you do. But the wrinkle is you can't just assume one of the
existing MSI entries is a valid replacement and copy from the MSI
table. That works right now only because the Linux/Windows driver is
re-using a MSI vector in the IMS registers.

I suggest the general path is something like:

 1) A vfio variant driver sets up an irq_domain for the additional
    interrupt source registers
 2) Somehow wire up VFIO_DEVICE_SET_IRQS so it can target vectors in
    the additional interrupt domain
 3) Have the VMM trap writes to the extra interrupt source registers
    and execute VFIO_DEVICE_SET_IRQS
 4) IRQ layer will setup an appropriate unique IRQ and route it to the
    guest/whatever just like MSI. Callbacks into the variant driver's
    irq_domain will program the HW registers.

Basically exactly the same flow as MSI, except instead of targetting a
vector in the PCI core's MSI irq_domain it targets a vector in the
variant driver's IMS IRQ domain.

Then we don't make any assumptions about how the VM is using these
interrupt vectors, and crucially, SET_IRQs is called for every
interrupt source and we rely on the kernel to produce the correct
addr/data pair. No need for copying addr/data pairs from MSI tables.

> As noted above, this does not provide any visible change to a QEMU
> guest, it only enables QEMU to implement the quirk in the other
> patch.

I see, I definitely didn't understand that it only reaches qemu from
the commit message..

Jason
David Woodhouse Dec. 13, 2024, 9:10 a.m. UTC | #8
On Tue, 2024-08-13 at 19:30 +0200, Thomas Gleixner wrote:
> On Tue, Aug 13 2024 at 13:30, Jason Gunthorpe wrote:
> > On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote:
> > > vfio-pci has always virtualized the MSI address and data registers as
> > > MSI programming is performed through the SET_IRQS ioctl.  Often this
> > > virtualization is not used, and in specific cases can be unhelpful.
> > > 
> > > One such case where the virtualization is a hinderance is when the
> > > device contains an onboard interrupt controller programmed by the guest
> > > driver.  Userspace VMMs have a chance to quirk this programming,
> > > injecting the host physical MSI information, but only if the userspace
> > > driver can get access to the host physical address and data registers.
> > > 
> > > This introduces a device feature which allows the userspace driver to
> > > disable virtualization of the MSI capability address and data registers
> > > in order to provide read-only access the the physical values.
> > 
> > Personally, I very much dislike this. Encouraging such hacky driver
> > use of the interrupt subsystem is not a good direction. Enabling this
> > in VMs will further complicate fixing the IRQ usages in these drivers
> > over the long run.
> > 
> > If the device has it's own interrupt sources then the device needs to
> > create an irq_chip and related and hook them up properly. Not hackily
> > read the MSI-X registers and write them someplace else.
> > 
> > Thomas Gleixner has done alot of great work recently to clean this up.
> > 
> > So if you imagine the driver is fixed, then this is not necessary.
> 
> Yes. I looked at the at11k driver when I was reworking the PCI/MSI
> subsystem and that's a perfect candidate for a proper device specific
> interrupt domain to replace the horrible MSI hackery it has.

The ath11k hacks may be awful, but in their defence, that's because the
whole way the hardware works is awful.

Q: With PCI passthrough to a guest, how does the guest OS tell the
device where to do DMA?

A: The guest OS just hands the device a guest physical address and the
IOMMU does the rest. Nothing 'intercedes' between the guest and the
device to mess with that address.

Q: MSIs are just DMA. So with PCI passthrough to a guest, how does the
guest OS configure the device's MSIs? 

<fantasy>
A: The guest OS just hands the device a standard MSI message encoding
the target guest APIC ID and vector (etc.), and the IOMMU does the
rest. Nothing 'intercedes' between the guest and the device to mess
with that MSI message.

And thus ath11k didn't need to do *any* hacks to work around a stupid
hardware design with the VMM snooping on stuff it ideally shouldn't
have had any business touching in the first place.

Posted interrupts are almost the *default* because the IOMMU receives a
<source-id, vCPU APIC ID, vector> tuple on the bus. If receiving an
interrupt for a vCPU which isn't currently running, that's when the
IOMMU sets a bit in a table somewhere and notifies the host OS.

All that special case MSI handling and routing code that I had
nightmares about because it fell through a wormhole from a parallel
universe, doesn't exist.

And look, DPDK drivers which run in polling mode and 'abuse' MSIs by
using real memory addresses and asking the device to "write <these> 32
bits to <this> structure if you want attention" just work nicely in
virtual machines too, just as they do on real hardware.
</fantasy>

/me wakes up...

Shit.

And we have to enable this Interrupt Remapping crap even to address
more than 255 CPUs *without* virtualization? Even a *guest* has to see
a virtual IOMMU and enable Interrupt Remapping to be able to use more
than 255 vCPUs? Even though there were a metric shitload of spare bits
in the MSI message we could have used¹.

Wait, so that means we have to offer an IOMMU with *DMA* remapping to
guests, which means 2-stage translations and/or massive overhead, just
for that guest to be able to use >255 vCPUs?

Screw you all, I'm going back to bed.



¹ And *should* use, if we ever do something similar like, say, expand
  the vector# space past 8 bits. Intel and AMD take note.
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 97422aafaa7b..5f86e75ea6ca 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1259,6 +1259,32 @@  static int vfio_msi_cap_len(struct vfio_pci_core_device *vdev, u8 pos)
 	return len;
 }
 
+/* Disable virtualization of the MSI address and data fields */
+int vfio_pci_msi_novirt(struct vfio_pci_core_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct perm_bits *perm = vdev->msi_perm;
+	u16 flags;
+	int ret;
+
+	if (!perm)
+		return -EINVAL;
+
+	ret = pci_read_config_word(pdev, pdev->msi_cap + PCI_MSI_FLAGS, &flags);
+	if (ret)
+		return pcibios_err_to_errno(ret);
+
+	p_setd(perm, PCI_MSI_ADDRESS_LO, NO_VIRT, NO_WRITE);
+	if (flags & PCI_MSI_FLAGS_64BIT) {
+		p_setd(perm, PCI_MSI_ADDRESS_HI, NO_VIRT, NO_WRITE);
+		p_setw(perm, PCI_MSI_DATA_64, (u16)NO_VIRT, (u16)NO_WRITE);
+	} else {
+		p_setw(perm, PCI_MSI_DATA_32, (u16)NO_VIRT, (u16)NO_WRITE);
+	}
+
+	return 0;
+}
+
 /* Determine extended capability length for VC (2 & 9) and MFVC */
 static int vfio_vc_cap_len(struct vfio_pci_core_device *vdev, u16 pos)
 {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ba0ce0075b2f..acdced212be2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1518,6 +1518,24 @@  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 	return 0;
 }
 
+static int vfio_pci_core_feature_msi_novirt(struct vfio_device *device,
+					    u32 flags, void __user *arg,
+					    size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	if (!vdev->msi_perm)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret != 1)
+		return ret;
+
+	return vfio_pci_msi_novirt(vdev);
+}
+
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
@@ -1531,6 +1549,9 @@  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT:
+		return vfio_pci_core_feature_msi_novirt(device, flags,
+							arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5e4fa69aee16..6e6cc74c6579 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -53,6 +53,7 @@  int vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
 
 int vfio_pci_init_perm_bits(void);
 void vfio_pci_uninit_perm_bits(void);
+int vfio_pci_msi_novirt(struct vfio_pci_core_device *vdev);
 
 int vfio_config_init(struct vfio_pci_core_device *vdev);
 void vfio_config_free(struct vfio_pci_core_device *vdev);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf190..ddf5dd9245fb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1458,6 +1458,20 @@  struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Toggle virtualization of PCI MSI address and data fields off.  By default
+ * vfio-pci-core based drivers virtualize the MSI address and data fields of
+ * the MSI capability to emulate direct access to the device, ie. writes are
+ * allowed and buffered where subsequent reads return the buffered data.
+ * VMMs often virtualize these registers anyway and there are cases in user-
+ * space where having access to the host MSI fields can be useful, such as
+ * quirking an embedded interrupt controller on the device to generate physical
+ * MSI interrupts.  Upon VFIO_DEVICE_FEATURE_SET of the PCI_MSI_NOVIRT feature
+ * this virtualization is disabled, reads of the MSI address and data fields
+ * will return the physical values and writes are dropped.
+ */
+#define VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT 11
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**