diff mbox series

[v2,4/4] vfio/pci: Allow MMIO regions to be exported through dma-buf

Message ID 4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com
State New
Headers show
Series Allow MMIO regions to be exported through dma-buf | expand

Commit Message

Jason Gunthorpe Aug. 31, 2022, 11:12 p.m. UTC
dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.

The patch design loosely follows the pattern in commit
db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
does not support pinning.

Instead, this implements what, in the past, we've called a revocable
attachment using move. In normal situations the attachment is pinned, as a
BAR does not change physical address. However when the VFIO device is
closed, or a PCI reset is issued, access to the MMIO memory is revoked.

Revoked means that move occurs, but an attempt to immediately re-map the
memory will fail. In the reset case a future move will be triggered when
MMIO access returns. As both close and reset are under userspace control
it is expected that userspace will suspend use of the dma-buf before doing
these operations, the revoke is purely for kernel self-defense against a
hostile userspace.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/dma_buf.c         | 269 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c |   8 +-
 drivers/vfio/pci/vfio_pci_core.c   |  28 ++-
 drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  18 ++
 7 files changed, 340 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vfio/pci/dma_buf.c

Comments

Jason Gunthorpe Sept. 6, 2022, 11:48 a.m. UTC | #1
On Tue, Sep 06, 2022 at 12:38:44PM +0200, Christian König wrote:
> Am 06.09.22 um 11:51 schrieb Christoph Hellwig:
> > > +{
> > > +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > > +	int rc;
> > > +
> > > +	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
> > > +				      true);
> > This should just use pci_p2pdma_distance.

OK

> > > +	/*
> > > +	 * Since the memory being mapped is a device memory it could never be in
> > > +	 * CPU caches.
> > > +	 */
> > DMA_ATTR_SKIP_CPU_SYNC doesn't even apply to dma_map_resource, not sure
> > where this wisdom comes from.

Habana driver

> > > +	dma_addr = dma_map_resource(
> > > +		attachment->dev,
> > > +		pci_resource_start(priv->vdev->pdev, priv->index) +
> > > +			priv->offset,
> > > +		priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
> > This is not how P2P addresses are mapped.  You need to use
> > dma_map_sgtable and have the proper pgmap for it.
> 
> The problem is once more that this is MMIO space, in other words register
> BARs which needs to be exported/imported.
> 
> Adding struct pages for it generally sounds like the wrong approach here.
> You can't even access this with the CPU or would trigger potentially
> unwanted hardware actions.

Right, this whole thing is the "standard" that dmabuf has adopted
instead of the struct pages. Once the AMD GPU driver started doing
this some time ago other drivers followed.

Now we have struct pages, almost, but I'm not sure if their limits are
compatible with VFIO? This has to work for small bars as well.

Jason
'Christoph Hellwig' Sept. 7, 2022, 3:32 p.m. UTC | #2
On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
>  2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
>     certain special cases.

Not just certain special cases, but one of the main use cases.
Basically P2P can happen in two ways:

 a) through a PCIe switch, or
 b) through connected root ports

The open code version here only supports a), only supports it when there
is no offset between the 'phyiscal' address of the BAR seen PCIe
endpoint and the Linux way.  x86 usually (always?) doesn't have an
offset there, but other architectures often do.

Last but not least I don't really see how the code would even work
when an IOMMU is used, as dma_map_resource will return an IOVA that
is only understood by the IOMMU itself, and not the other endpoint.

How was this code even tested?
Jason Gunthorpe Sept. 7, 2022, 4:12 p.m. UTC | #3
On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
> >  2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
> >     certain special cases.
> 
> Not just certain special cases, but one of the main use cases.
> Basically P2P can happen in two ways:
>
>  a) through a PCIe switch, or
>  b) through connected root ports

Yes, we tested both, both work.

> The open code version here only supports a), only supports it when there
> is no offset between the 'phyiscal' address of the BAR seen PCIe
> endpoint and the Linux way.  x86 usually (always?) doesn't have an
> offset there, but other architectures often do.

The PCI offset is some embedded thing - I've never seen it in a server
platform.

Frankly, it is just bad SOC design and there is good reason why
non-zero needs to be avoided. As soon as you create aliases between
the address spaces you invite trouble. IIRC a SOC I used once put the
memory at 0 -> 4G then put the only PCI aperture at 4g ->
4g+N. However this design requires 64 bit PCI support, which at the
time, the platform didn't have. So they used PCI offset to hackily
alias the aperture over the DDR. I don't remember if they threw out a
bit of DDR to resolve the alias, or if they just didn't support PCI
switches.

In any case, it is a complete mess. You either drastically limit your
BAR size, don't support PCI switches or loose a lot of DDR.

I also seem to remember that iommu and PCI offset don't play nice
together - so for the VFIO use case where the iommu is present I'm
pretty sure we can very safely assume 0 offset. That seems confirmed
by the fact that VFIO has never handled PCI offset in its own P2P path
and P2P works fine in VMs across a wide range of platforms.

That said, I agree we should really have APIs that support this
properly, and dma_map_resource is certainly technically wrong.

So, would you be OK with this series if I try to make a dma_map_p2p()
that resolves the offset issue?

> Last but not least I don't really see how the code would even work
> when an IOMMU is used, as dma_map_resource will return an IOVA that
> is only understood by the IOMMU itself, and not the other endpoint.

I don't understand this.

__iommu_dma_map() will put the given phys into the iommu_domain
associated with 'dev' and return the IOVA it picked.

Here 'dev' is the importing device, it is the device that will issue
the DMA:

+       dma_addr = dma_map_resource(
+               attachment->dev,
+               pci_resource_start(priv->vdev->pdev, priv->index) +
+                       priv->offset,
+               priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);

eg attachment->dev is the PCI device of the RDMA device, not the VFIO
device.

'phys' is the CPU physical of the PCI BAR page, which with 0 PCI
offset is the right thing to program into the IO page table.

> How was this code even tested?

It was tested on a few platforms, like I said above, the cases where
it doesn't work are special, largely embedded, and not anything we
have in our labs - AFAIK.

Jason
Robin Murphy Sept. 7, 2022, 4:31 p.m. UTC | #4
On 2022-09-07 16:23, Jason Gunthorpe wrote:
> On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote:
>> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
>>> Yes, you said that, and I said that when the AMD driver first merged
>>> it - but it went in anyhow and now people are using it in a bunch of
>>> places.
>>
>> drm folks made up their own weird rules, if they internally stick
>> to it they have to listen to it given that they ignore review comments,
>> but it violates the scatterlist API and has not business anywhere
>> else in the kernel.  And yes, there probably is a reason or two why
>> the drm code is unusually error prone.
> 
> That may be, but it is creating problems if DRM gets to do X crazy
> thing and nobody else can..
> 
> So, we have two issues here
> 
>   1) DMABUF abuses the scatter list, but this is very constrainted we have
>      this weird special "DMABUF scatterlist" that is only touched by DMABUF
>      importers. The imports signal that they understand the format with
>      a flag. This is ugly and would be nice to clean to a dma mapped
>      address list of some sort.
> 
>      I spent alot of time a few years ago removing driver touches of
>      the SGL and preparing the RDMA stack to do this kind of change, at
>      least.
> 
>   2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in
>      certain special cases.

FWIW, dma_map_resource() *is* for P2P in general. The classic case of 
one device poking at another's registers that was the original 
motivation is a standalone DMA engine reading/writing a peripheral 
device's FIFO, so the very similar inter-device doorbell signal is 
absolutely in scope too; VRAM might be a slightly greyer area, but if 
it's still not page-backed kernel memory then I reckon that's fair game.

The only trouble is that it's not geared for *PCI* P2P when that may or 
may not happen entirely upstream of IOMMU translation.

Robin.
Jason Gunthorpe Sept. 7, 2022, 4:47 p.m. UTC | #5
On Wed, Sep 07, 2022 at 05:31:14PM +0100, Robin Murphy wrote:

> The only trouble is that it's not geared for *PCI* P2P when that may or may
> not happen entirely upstream of IOMMU translation.

This is why PCI users have to call the pci_distance stuff before using
dma_map_resource(), it ensures the PCI fabric is setup in a way that
is consistent with the iommu. eg if we have IOMMU turned on then the
fabric must have ACS/etc to ensure that all TLPs are translated.

PCI P2P is very complicated and fragile, sadly.

Thanks,
Jason
Dan Williams Sept. 7, 2022, 5:03 p.m. UTC | #6
Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
> > Yes, you said that, and I said that when the AMD driver first merged
> > it - but it went in anyhow and now people are using it in a bunch of
> > places.
> 
> drm folks made up their own weird rules, if they internally stick
> to it they have to listen to it given that they ignore review comments,
> but it violates the scatterlist API and has not business anywhere
> else in the kernel.  And yes, there probably is a reason or two why
> the drm code is unusually error prone.
> 
> > > Why would small BARs be problematic for the pages?  The pages are more
> > > a problem for gigantic BARs do the memory overhead.
> > 
> > How do I get a struct page * for a 4k BAR in vfio?
> 
> I guess we have different definitions of small then :)
> 
> But unless my understanding of the code is out out of data,
> memremap_pages just requires the (virtual) start address to be 2MB
> aligned, not the size.  Adding Dan for comments.

The minimum granularity for sparse_add_section() that memremap_pages
uses internally is 2MB, so start and end need to be 2MB aligned. Details
here:

ba72b4c8cf60 mm/sparsemem: support sub-section hotplug
'Christoph Hellwig' Sept. 9, 2022, 1:24 p.m. UTC | #7
On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote:
> The PCI offset is some embedded thing - I've never seen it in a server
> platform.

That's not actually true, e.g. some power system definitively had it,
althiugh I don't know if the current ones do.

But that's not that point.  The offset is a configuration fully
supported by Linux, and someone that just works by using the proper
APIs.  Doing some handwaiving about embedded only or bad design doesn't
matter.  There is a reason why we have these proper APIs and no one
has any business bypassing them.

> I also seem to remember that iommu and PCI offset don't play nice
> together - so for the VFIO use case where the iommu is present I'm
> pretty sure we can very safely assume 0 offset. That seems confirmed
> by the fact that VFIO has never handled PCI offset in its own P2P path
> and P2P works fine in VMs across a wide range of platforms.

I think the offset is one of the reasons why IOVA windows can be
reserved (and maybe also why ppc is so weird).

> So, would you be OK with this series if I try to make a dma_map_p2p()
> that resolves the offset issue?

Well, if it also solves the other issue of invalid scatterlists leaking
outside of drm we can think about it.

> 
> > Last but not least I don't really see how the code would even work
> > when an IOMMU is used, as dma_map_resource will return an IOVA that
> > is only understood by the IOMMU itself, and not the other endpoint.
> 
> I don't understand this.
> 
> __iommu_dma_map() will put the given phys into the iommu_domain
> associated with 'dev' and return the IOVA it picked.

Yes, __iommu_dma_map creates an IOVA for the mapped remote BAR.  That
is the right thing if the I/O goes through the host bridge, but it is
the wrong thing if the I/O goes through the switch - in that case the
IOVA generated is not something that the endpoint that owns the BAR
can even understand.

Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how
this is handled.
Jason Gunthorpe Sept. 9, 2022, 2:09 p.m. UTC | #8
On Fri, Sep 09, 2022 at 06:24:35AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote:
> > The PCI offset is some embedded thing - I've never seen it in a server
> > platform.
> 
> That's not actually true, e.g. some power system definitively had it,
> althiugh I don't know if the current ones do.

I thought those were all power embedded systems.

> There is a reason why we have these proper APIs and no one has any
> business bypassing them.

Yes, we should try to support these things, but you said this patch
didn't work and wasn't tested - that is not true at all.

And it isn't like we have APIs just sitting here to solve this
specific problem. So lets make something.

> > So, would you be OK with this series if I try to make a dma_map_p2p()
> > that resolves the offset issue?
> 
> Well, if it also solves the other issue of invalid scatterlists leaking
> outside of drm we can think about it.

The scatterlist stuff has already leaked outside of DRM anyhow.

Again, I think it is very problematic to let DRM get away with things
and then insist all the poor non-DRM people be responsible to clean up
their mess.

I'm skeptical I can fix AMD GPU, but I can try to create a DMABUF op
that returns something that is not a scatterlist and teach RDMA to use
it. So at least the VFIO/RDMA part can avoid the scatter list abuse. I
expected to need non-scatterlist for iommufd anyhow.

Coupled with a series to add some dma_map_resource_pci() that handles
the PCI_P2PDMA_MAP_BUS_ADDR and the PCI offset, would it be an
agreeable direction?

> Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how
> this is handled.

So there is a bug in all these DMABUF implementations, they do ignore
the PCI_P2PDMA_MAP_BUS_ADDR "distance type".

This isn't a real-world problem for VFIO because VFIO is largely
incompatible with the non-ACS configuration that would trigger
PCI_P2PDMA_MAP_BUS_ADDR, and explains why we never saw any
problem. All our systems have ACS turned on so we can use VFIO.

I'm unclear how Habana or AMD have avoided a problem here..

This is much more serious than the pci offset in my mind.

Thanks,
Jason
diff mbox series

Patch

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5a3..514c12a81127d6 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@ 
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
+vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
new file mode 100644
index 00000000000000..11396aeabff405
--- /dev/null
+++ b/drivers/vfio/pci/dma_buf.c
@@ -0,0 +1,269 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#include <linux/dma-buf.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/dma-resv.h>
+
+#include "vfio_pci_priv.h"
+
+MODULE_IMPORT_NS(DMA_BUF);
+
+struct vfio_pci_dma_buf {
+	struct dma_buf *dmabuf;
+	struct vfio_pci_core_device *vdev;
+	struct list_head dmabufs_elm;
+	unsigned int index;
+	unsigned int orig_nents;
+	size_t offset;
+	bool revoked;
+};
+
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+				   struct dma_buf_attachment *attachment)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+	int rc;
+
+	rc = pci_p2pdma_distance_many(priv->vdev->pdev, &attachment->dev, 1,
+				      true);
+	if (rc < 0)
+		attachment->peer2peer = false;
+	return 0;
+}
+
+static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
+{
+}
+
+static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
+{
+	/*
+	 * Uses the dynamic interface but must always allow for
+	 * dma_buf_move_notify() to do revoke
+	 */
+	return -EINVAL;
+}
+
+static struct sg_table *
+vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
+		     enum dma_data_direction dir)
+{
+	size_t sgl_size = dma_get_max_seg_size(attachment->dev);
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+	struct scatterlist *sgl;
+	struct sg_table *sgt;
+	dma_addr_t dma_addr;
+	unsigned int nents;
+	size_t offset;
+	int ret;
+
+	dma_resv_assert_held(priv->dmabuf->resv);
+
+	if (!attachment->peer2peer)
+		return ERR_PTR(-EPERM);
+
+	if (priv->revoked)
+		return ERR_PTR(-ENODEV);
+
+	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
+	if (!sgt)
+		return ERR_PTR(-ENOMEM);
+
+	nents = DIV_ROUND_UP(priv->dmabuf->size, sgl_size);
+	ret = sg_alloc_table(sgt, nents, GFP_KERNEL);
+	if (ret)
+		goto err_kfree_sgt;
+
+	/*
+	 * Since the memory being mapped is a device memory it could never be in
+	 * CPU caches.
+	 */
+	dma_addr = dma_map_resource(
+		attachment->dev,
+		pci_resource_start(priv->vdev->pdev, priv->index) +
+			priv->offset,
+		priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
+	ret = dma_mapping_error(attachment->dev, dma_addr);
+	if (ret)
+		goto err_free_sgt;
+
+	/*
+	 * Break the BAR's physical range up into max sized SGL's according to
+	 * the device's requirement.
+	 */
+	sgl = sgt->sgl;
+	for (offset = 0; offset != priv->dmabuf->size;) {
+		size_t chunk_size = min(priv->dmabuf->size - offset, sgl_size);
+
+		sg_set_page(sgl, NULL, chunk_size, 0);
+		sg_dma_address(sgl) = dma_addr + offset;
+		sg_dma_len(sgl) = chunk_size;
+		sgl = sg_next(sgl);
+		offset += chunk_size;
+	}
+
+	/*
+	 * Because we are not going to include a CPU list we want to have some
+	 * chance that other users will detect this by setting the orig_nents to
+	 * 0 and using only nents (length of DMA list) when going over the sgl
+	 */
+	priv->orig_nents = sgt->orig_nents;
+	sgt->orig_nents = 0;
+	return sgt;
+
+err_free_sgt:
+	sg_free_table(sgt);
+err_kfree_sgt:
+	kfree(sgt);
+	return ERR_PTR(ret);
+}
+
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
+				   struct sg_table *sgt,
+				   enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+
+	sgt->orig_nents = priv->orig_nents;
+	dma_unmap_resource(attachment->dev, sg_dma_address(sgt->sgl),
+			   priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
+	sg_free_table(sgt);
+	kfree(sgt);
+}
+
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	/*
+	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
+	 * The refcount prevents both.
+	 */
+	if (priv->vdev) {
+		down_write(&priv->vdev->memory_lock);
+		list_del_init(&priv->dmabufs_elm);
+		up_write(&priv->vdev->memory_lock);
+		vfio_device_put(&priv->vdev->vdev);
+	}
+	kfree(priv);
+}
+
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+	.attach = vfio_pci_dma_buf_attach,
+	.map_dma_buf = vfio_pci_dma_buf_map,
+	.pin = vfio_pci_dma_buf_pin,
+	.unpin = vfio_pci_dma_buf_unpin,
+	.release = vfio_pci_dma_buf_release,
+	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
+};
+
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz)
+{
+	struct vfio_device_feature_dma_buf get_dma_buf;
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct vfio_pci_dma_buf *priv;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(get_dma_buf));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
+		return -EFAULT;
+
+	/* For PCI the region_index is the BAR number like everything else */
+	if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	exp_info.ops = &vfio_pci_dmabuf_ops;
+	exp_info.size = pci_resource_len(vdev->pdev, get_dma_buf.region_index);
+	if (!exp_info.size)
+		return -EINVAL;
+	if (get_dma_buf.offset || get_dma_buf.length) {
+		if (get_dma_buf.length > exp_info.size ||
+		    get_dma_buf.offset >= exp_info.size ||
+		    get_dma_buf.length > exp_info.size - get_dma_buf.offset ||
+		    get_dma_buf.offset % PAGE_SIZE ||
+		    get_dma_buf.length % PAGE_SIZE)
+			return -EINVAL;
+		exp_info.size = get_dma_buf.length;
+	}
+	exp_info.flags = get_dma_buf.open_flags;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&priv->dmabufs_elm);
+	priv->offset = get_dma_buf.offset;
+	priv->index = get_dma_buf.region_index;
+
+	exp_info.priv = priv;
+	priv->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(priv->dmabuf)) {
+		ret = PTR_ERR(priv->dmabuf);
+		kfree(priv);
+		return ret;
+	}
+
+	/* dma_buf_put() now frees priv */
+
+	down_write(&vdev->memory_lock);
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	priv->revoked = !__vfio_pci_memory_enabled(vdev);
+	priv->vdev = vdev;
+	vfio_device_get(&vdev->vdev);
+	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+	dma_resv_unlock(priv->dmabuf->resv);
+	up_write(&vdev->memory_lock);
+
+	/*
+	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
+	 * will be released.
+	 */
+	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+}
+
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!dma_buf_try_get(priv->dmabuf))
+			continue;
+		if (priv->revoked != revoked) {
+			dma_resv_lock(priv->dmabuf->resv, NULL);
+			priv->revoked = revoked;
+			dma_buf_move_notify(priv->dmabuf);
+			dma_resv_unlock(priv->dmabuf->resv);
+		}
+		dma_buf_put(priv->dmabuf);
+	}
+}
+
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	down_write(&vdev->memory_lock);
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!dma_buf_try_get(priv->dmabuf))
+			continue;
+		dma_resv_lock(priv->dmabuf->resv, NULL);
+		list_del_init(&priv->dmabufs_elm);
+		priv->vdev = NULL;
+		priv->revoked = true;
+		dma_buf_move_notify(priv->dmabuf);
+		dma_resv_unlock(priv->dmabuf->resv);
+		vfio_device_put(&vdev->vdev);
+		dma_buf_put(priv->dmabuf);
+	}
+	up_write(&vdev->memory_lock);
+}
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index d22921efa25987..f8a9c24d04aeb7 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -584,10 +584,12 @@  static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
-		if (!new_mem)
+		if (!new_mem) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-		else
+			vfio_pci_dma_buf_move(vdev, true);
+		} else {
 			down_write(&vdev->memory_lock);
+		}
 
 		/*
 		 * If the user is writing mem/io enable (new_mem/io) and we
@@ -622,6 +624,8 @@  static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		*virt_cmd &= cpu_to_le16(~mask);
 		*virt_cmd |= cpu_to_le16(new_cmd & mask);
 
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
 		up_write(&vdev->memory_lock);
 	}
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 885706944e4d96..7e27bfc60440c2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -504,6 +504,8 @@  void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 	vfio_spapr_pci_eeh_release(vdev->pdev);
 	vfio_pci_core_disable(vdev);
 
+	vfio_pci_dma_buf_cleanup(vdev);
+
 	mutex_lock(&vdev->igate);
 	if (vdev->err_trigger) {
 		eventfd_ctx_put(vdev->err_trigger);
@@ -980,7 +982,10 @@  int vfio_pci_try_reset_function(struct vfio_pci_core_device *vdev)
 	 */
 	vfio_pci_set_power_state(vdev, PCI_D0);
 
+	vfio_pci_dma_buf_move(vdev, true);
 	ret = pci_try_reset_function(vdev->pdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 
 	return ret;
@@ -1210,11 +1215,10 @@  long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
-				       uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
+				       u32 flags, uuid_t __user *arg,
+				       size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	uuid_t uuid;
 	int ret;
 
@@ -1241,9 +1245,14 @@  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF:
+		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -1881,6 +1890,7 @@  void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev,
 	INIT_LIST_HEAD(&vdev->vma_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
 	init_rwsem(&vdev->memory_lock);
+	INIT_LIST_HEAD(&vdev->dmabufs);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_init_device);
 
@@ -2227,11 +2237,17 @@  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	 * cause the PCI config space reset without restoring the original
 	 * state (saved locally in 'vdev->pm_save').
 	 */
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+		vfio_pci_dma_buf_move(cur, true);
 		vfio_pci_set_power_state(cur, PCI_D0);
+	}
 
 	ret = pci_reset_bus(pdev);
 
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+		if (__vfio_pci_memory_enabled(cur))
+			vfio_pci_dma_buf_move(cur, false);
+
 err_undo:
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
 		if (cur == cur_mem)
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5b1cb9a9838442..c295a1166e7a9c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -102,4 +102,27 @@  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#ifdef CONFIG_DMA_SHARED_BUFFER
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+#else
+static int
+vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+			      struct vfio_device_feature_dma_buf __user *arg,
+			      size_t argsz)
+{
+	return -ENOTTY;
+}
+static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
+					 bool revoked)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index e5cf0d3313a694..fb9769d9e5d149 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -93,6 +93,7 @@  struct vfio_pci_core_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	struct list_head	dmabufs;
 };
 
 /* Will be exported for vfio pci drivers usage */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30a5..1dcfad6dcb6863 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,24 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
+ * region selected.
+ *
+ * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
+ * etc. offset/length specify a slice of the region to create the dmabuf from.
+ * If both are 0 then the whole region is used.
+ *
+ * Return: The fd number on success, -1 and errno is set on failure.
+ */
+struct vfio_device_feature_dma_buf {
+	__u32 region_index;
+	__u32 open_flags;
+	__u32 offset;
+	__u64 length;
+};
+#define VFIO_DEVICE_FEATURE_DMA_BUF 3
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**