mbox series

[RFC,v2,0/4] vfio/hisilicon: add acc live migration driver

Message ID 20210702095849.1610-1-shameerali.kolothum.thodi@huawei.com
Headers show
Series vfio/hisilicon: add acc live migration driver | expand

Message

Shameerali Kolothum Thodi July 2, 2021, 9:58 a.m. UTC
Hi,

This series attempts to add vfio live migration support for
HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
includes both the functional register space and migration 
control register space. As discussed in RFCv1[0], this may create
security issues as these regions get shared between the Guest
driver and the migration driver. Based on the feedback, we tried
to address those concerns in this version. 

This is now based on the new vfio-pci-core framework proposal[1].
Understand that the framework proposal is still under discussion,
but really appreciate any feedback on the approach taken here
to mitigate the security risks.

The major changes from v1 are,

 -Adds a new vendor-specific vfio_pci driver(hisi-acc-vfio-pci)
  for HiSilicon ACC VF devices based on the new vfio-pci-core
  framework proposal.

 -Since HiSilicon ACC VF device MMIO space contains both the
  functional register space and migration control register space,
  override the vfio_device_ops ioctl method to report only the
  functional space to VMs.

 -For a successful migration, we still need access to VF dev
  functional register space mainly to read the status registers.
  But accessing these while the Guest vCPUs are running may leave
  a security hole. To avoid any potential security issues, we
  map/unmap the MMIO regions on a need basis and is safe to do so.
  (Please see hisi_acc_vf_ioremap/unmap() fns in patch #4).
 
 -Dropped debugfs support for now.
 -Uses common QM functions for mailbox access(patch #3).

This is now sanity tested on HiSilicon platforms that support these
ACC devices.

From v1:
 -In order to ensure the compatibility of the devices before and
  after the migration, the device compatibility information check
  will be performed in the Pre-copy stage. If the check fails,
  an error will be returned and the source VM will exit the migration.
 -After the compatibility check is passed, it will enter the
  Stop-and-copy stage. At this time, all the live migration data
  will be copied and saved to the VF device of the destination.
  Finally, the VF device of the destination will be started.

Thanks,
Shameer
[0] https://lore.kernel.org/lkml/20210415220137.GA1672608@nvidia.com/
[1] https://lore.kernel.org/lkml/20210603160809.15845-1-mgurtovoy@nvidia.com/

Longfang Liu (2):
  crypto: hisilicon/qm - Export mailbox functions for common use
  hisi_acc_vfio_pci: Add support for vfio live migration

Shameer Kolothum (2):
  hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size

 drivers/crypto/hisilicon/qm.c        |    8 +-
 drivers/crypto/hisilicon/qm.h        |    4 +
 drivers/vfio/pci/Kconfig             |   13 +
 drivers/vfio/pci/Makefile            |    2 +
 drivers/vfio/pci/hisi_acc_vfio_pci.c | 1155 ++++++++++++++++++++++++++
 drivers/vfio/pci/hisi_acc_vfio_pci.h |  144 ++++
 6 files changed, 1323 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.h

-- 
2.17.1

Comments

Jason Gunthorpe Feb. 3, 2022, 3:18 p.m. UTC | #1
On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> On 2/2/22 17:03, Jason Gunthorpe wrote:

> > how to integrate that with the iommufd work, which I hope will allow
> > that series, and the other IOMMU drivers that can support this to be
> > merged..
> 
> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> obvious for sure. When you say 'integrate that with the iommufd' can you
> expand on that?

The general idea is that iommufd is the place to put all the iommu
driver uAPI for consumption by userspace. The IOMMU feature of dirty
tracking would belong there.

So, some kind of API needs to be designed to meet the needs of the
IOMMU drivers.

> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
> VMM to scan iommu pagetables itself and see what was marked dirty or
> not?

iommufd and VFIO container's don't co-exist, either iommufd is
providing the IOMMU interface, or the current type 1 code - not both
together.

iommfd current approach presents the same ABI as the type1 container
as compatability, and it is a possible direction to provide the
iommu_domain stored dirty bits through that compat API.

But, as you say, it looks unnatural and inefficient when the domain
itself is storing the dirty bits inside the IOPTE.

It need some study I haven't got into yet :)

> My over-simplistic/naive view was that the proposal in the link
> above sounded a lot simpler. While iommu-fd had more longevity for
> many other usecases outside dirty tracking, no?

I'd prefer we don't continue to hack on the type1 code if iommufd is
expected to take over in this role - especially for a multi-vendor
feature like dirty tracking.

It is actually a pretty complicated topic because migration capable
PCI devices are also include their own dirty tracking HW, all this
needs to be harmonized somehow. VFIO proposed to squash everything
into the container code, but I've been mulling about having iommufd
only do system iommu and push the PCI device internal tracking over to
VFIO.

> I have a PoC-ish using the interface in the link, with AMD IOMMU
> dirty bit supported (including Qemu emulated amd-iommu for folks
> lacking the hardware). Albeit the eager-spliting + collapsing of
> IOMMU hugepages is not yet done there, and I wanted to play around
> the emulated intel-iommu SLADS from specs looks quite similar. Happy
> to join existing effort anyways.

This sounds great, I would love to bring the AMD IOMMU along with a
dirty tracking implementation! Can you share some patches so we can
see what the HW implementation looks like?

Jason
Joao Martins Feb. 4, 2022, 7:53 p.m. UTC | #2
On 2/3/22 15:18, Jason Gunthorpe wrote:
> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>> how to integrate that with the iommufd work, which I hope will allow
>>> that series, and the other IOMMU drivers that can support this to be
>>> merged..
>>
>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>> obvious for sure. When you say 'integrate that with the iommufd' can you
>> expand on that?
> 
> The general idea is that iommufd is the place to put all the iommu
> driver uAPI for consumption by userspace. The IOMMU feature of dirty
> tracking would belong there.
> 
> So, some kind of API needs to be designed to meet the needs of the
> IOMMU drivers.
> 
/me nods

I am gonna assume below is the most up-to-date to iommufd (as you pointed
out in another thread IIRC):

  https://github.com/jgunthorpe/linux iommufd

Let me know if it's not :)

>> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
>> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
>> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
>> VMM to scan iommu pagetables itself and see what was marked dirty or
>> not?
> 
> iommufd and VFIO container's don't co-exist, either iommufd is
> providing the IOMMU interface, or the current type 1 code - not both
> together.
> 
> iommfd current approach presents the same ABI as the type1 container
> as compatability, and it is a possible direction to provide the
> iommu_domain stored dirty bits through that compat API.
> 
> But, as you say, it looks unnatural and inefficient when the domain
> itself is storing the dirty bits inside the IOPTE.
> 
How much is this already represented as the io-pgtable in IOMMU internal kAPI
(if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

> It need some study I haven't got into yet :)
> 
Heh :)

Depending on how easy it is to obtain full extent of IO pagetables via iommu_fd
and whether userspace code can scan the dirty bits on its own ... then potentially
VMM/process can more efficiently scan the dirtied set? But if some layer needs to
somehow mediate between the vendor IOPTE representation and an UAPI IOPTE representation,
to be able to make that delegation to userspace ... then maybe both might be inefficient?
I didn't see how iommu-fd would abstract the IOPTEs lookup as far as I glanced through the
code, perhaps that's another ioctl().

>> My over-simplistic/naive view was that the proposal in the link
>> above sounded a lot simpler. While iommu-fd had more longevity for
>> many other usecases outside dirty tracking, no?
> 
> I'd prefer we don't continue to hack on the type1 code if iommufd is
> expected to take over in this role - especially for a multi-vendor
> feature like dirty tracking.
> 
But what strikes /specifically/ on the dirty bit feature is that it looks
simpler with the current VFIO, the heavy lifting seems to be
mostly on the IOMMU vendor. The proposed API above for VFIO looking at
the container (small changes), and IOMMU vendor would do most of it:

* Toggling API for start/stop dirty tracking
* API to get the IOVAs dirtied
* API to clear the IOVAs dirtied
[this last one I am not sure it is needed as the clear could be done as we scan
 the ones, thus minimize TLB flush cost if these are separate]

At the same time, what particularly scares me perf-wise (for the device being migrated)
... is the fact that we need to dynamically split and collapse page tables to
increase the granularity of which we track. In the above interface it splits/collapses
when you turn on/off the dirty tracking (respectively). That's *probably* where we
need more flexibility, not sure.

> It is actually a pretty complicated topic because migration capable
> PCI devices are also include their own dirty tracking HW, all this
> needs to be harmonized somehow. 

Do you have thoughts on what such device-dirty interface could look like?
(Perhaps too early to poke while the FSM/UAPI is being worked out)

I was wondering if container has a dirty scan/sync callback funnelled
by a vendor IOMMU ops implemented (as Shameerali patches proposed), and vfio vendor driver
provides one per device. Or propagate the dirty tracking API to vendor vfio driver[*]. The
reporting of the dirtying, though, looks hazzy to achieve if you try to make
it uniform even to userspace. Perhaps with iommu-fd you're thinking to mmap()
the dirty region back to userspace, or an iommu-fd ioctl() updates the PTEs,
while letting the kernel clear the dirty status via the mmap() object. And that
would be the common API regardless of dirty-hw scheme. Anyway, just thinking
out loud.

[*] considering the device may choose where to place its tracking storage, and
which scheme (bitmap, ring, etc) it might be.

> VFIO proposed to squash everything
> into the container code, but I've been mulling about having iommufd
> only do system iommu and push the PCI device internal tracking over to
> VFIO.
> 

Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
not sure yet how much one can we 'offload' to a generic layer, at least
compared with this other proposal.

>> I have a PoC-ish using the interface in the link, with AMD IOMMU
>> dirty bit supported (including Qemu emulated amd-iommu for folks
>> lacking the hardware). Albeit the eager-spliting + collapsing of
>> IOMMU hugepages is not yet done there, and I wanted to play around
>> the emulated intel-iommu SLADS from specs looks quite similar. Happy
>> to join existing effort anyways.
> 
> This sounds great, I would love to bring the AMD IOMMU along with a
> dirty tracking implementation! Can you share some patches so we can
> see what the HW implementation looks like?

Oh yes for sure! As I said I'm happy to help&implement along.
I would really love to leverage the IOMMU feature, as that relieves the
migrateable PCI device having to do the dirty tracking themselves.

And well, we seem to be getting there -- spec-wise everybody has that
feature *at least* documented :)

Give me some time (few days only, as I gotta sort some things) and I'll
respond here as follow up with link to a branch with the WIP/PoC patches.

Summing up a couple of remarks on hw, hopefully they enlight:

1) On AMD the feature is advertised by their extended feature register
as supported or not. On Intel, same in its equivalent (ECAP). On course
different bits on different IOMMUs registers. If it is supported, the
IOMMU updates (when activated) two bits per page table entry to indicate
'access' and 'dirty'. Slightly different page table format bit-wise on
where access/dirty is located (between the two vendors).

2) Change protection domain flags to enable the dirty/access tracking.
On AMD, it's the DTE flags (a new flag for dirty tracking).
[Intel does this in the PASID table entry]

3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
bit is cheap, clearing them is 'expensive' because one needs to flush
IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
operations to actually update the Access/Dirty bit in concurrency with
the CPU. The AMD manuals are a tad misleading as they talk about marking
non-present, but that would be catastrophic for migration as it would
mean a DMA target abort for the PCI device, unless I missed something obvious.
In any case, this means that the dirty bit *clearing* needs to be
batched as much as possible, to amortize the cost of flushing the IOTLB.
This is the same for Intel *IIUC*.

4) Adjust the granularity of pagetables in place:
[This item wasn't done, but it is generic to any IOMMU because it
is mostly the ability to split existing IO pages in place.]

4.a) Eager splitting and late collapsing IO hugepages -- essentially
to have tracking at finer granularity. as you know generally the
IOMMU map is generally done on the biggest IO page size, specially on
guests. We sadly can't write-protect or anything fancier that we
usually do so otherwise the PCI device gets a DMA target-abort :(
So splitting, when we start dirty tracking (what I mean by eager), and
do for the whole IO page table. When finished tracking, collapse the
pages (which should probably be deemed optional in the common case of
succeeding the migration). Guest expected to have higher IOTLB miss.
This part is particularly worrying for the IO guest performance, but
hopefully it doesn't turn out to be too bad.

4.b) Optionally starting dirtying earlier (at provisioning) and let
userspace dynamically split pages. This is to hopefully minimize the
IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
So dirty tracking would be enabled at creation of the protection domain
after the vfio container is set up, and we would use pages dirtied
as a indication of what needs to be splited. Problem is for IO page
sizes bigger than 1G, which might unnecessarily lead to marking too
much as dirty early on; but at least it's better than transferring the
whole set.

	Joao
Jason Gunthorpe Feb. 4, 2022, 11:07 p.m. UTC | #3
On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
> On 2/3/22 15:18, Jason Gunthorpe wrote:
> > On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> >> On 2/2/22 17:03, Jason Gunthorpe wrote:
> >>> how to integrate that with the iommufd work, which I hope will allow
> >>> that series, and the other IOMMU drivers that can support this to be
> >>> merged..
> >>
> >> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> >> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> >> obvious for sure. When you say 'integrate that with the iommufd' can you
> >> expand on that?
> > 
> > The general idea is that iommufd is the place to put all the iommu
> > driver uAPI for consumption by userspace. The IOMMU feature of dirty
> > tracking would belong there.
> > 
> > So, some kind of API needs to be designed to meet the needs of the
> > IOMMU drivers.
> > 
> /me nods
> 
> I am gonna assume below is the most up-to-date to iommufd (as you pointed
> out in another thread IIRC):
> 
>   https://github.com/jgunthorpe/linux iommufd
> 
> Let me know if it's not :)

The iommufd part is pretty good, but there is hacky patch to hook it
into vfio that isn't there, if you want to actually try it.

> > But, as you say, it looks unnatural and inefficient when the domain
> > itself is storing the dirty bits inside the IOPTE.
> 
> How much is this already represented as the io-pgtable in IOMMU internal kAPI
> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

Which are you looking at? AFACIT there is no diry page support in
iommu_ops ?

> then potentially VMM/process can more efficiently scan the dirtied
> set? But if some layer needs to somehow mediate between the vendor
> IOPTE representation and an UAPI IOPTE representation, to be able to
> make that delegation to userspace ... then maybe both might be
> inefficient?  I didn't see how iommu-fd would abstract the IOPTEs
> lookup as far as I glanced through the code, perhaps that's another
> ioctl().

It is based around the same model as VFIO container - map/unmap of
user address space into the IOPTEs and the user space doesn't see
anything resembling a 'pte' - at least for kernel owned IO page
tables.

User space page tables will not be abstracted and the userspace must
know the direct HW format of the IOMMU they is being used.

> But what strikes /specifically/ on the dirty bit feature is that it looks
> simpler with the current VFIO, the heavy lifting seems to be
> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
> the container (small changes), and IOMMU vendor would do most of it:

It is basically the same, almost certainly the user API in iommufd
will be some 'get dirty bits' and 'unmap and give me the dirty bits'
just like vfio has.
 
The tricky details are around how do you manage this when the system
may have multiple things invovled capable, or not, of actualy doing
dirty tracking.

> At the same time, what particularly scares me perf-wise (for the
> device being migrated) ... is the fact that we need to dynamically
> split and collapse page tables to increase the granularity of which
> we track. In the above interface it splits/collapses when you turn
> on/off the dirty tracking (respectively). That's *probably* where we
> need more flexibility, not sure.

For sure that is a particularly big adventure in the iommu driver..

> Do you have thoughts on what such device-dirty interface could look like?
> (Perhaps too early to poke while the FSM/UAPI is being worked out)

I've been thinking the same general read-and-clear of a dirty
bitmap. It matches nicely the the KVM interface.

> I was wondering if container has a dirty scan/sync callback funnelled
> by a vendor IOMMU ops implemented (as Shameerali patches proposed), 

Yes, this is almost certainly how the in-kernel parts will look

> and vfio vendor driver provides one per device. 

But this is less clear..

> Or propagate the dirty tracking API to vendor vfio driver[*]. 
> [*] considering the device may choose where to place its tracking storage, and
> which scheme (bitmap, ring, etc) it might be.

This has been my thinking, yes

> The reporting of the dirtying, though, looks hazzy to achieve if you
> try to make it uniform even to userspace. Perhaps with iommu-fd
> you're thinking to mmap() the dirty region back to userspace, or an
> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
> the dirty status via the mmap() object. And that would be the common
> API regardless of dirty-hw scheme. Anyway, just thinking out loud.

My general thinking has be that iommufd would control only the system
IOMMU hardware. The FD interface directly exposes the iommu_domain as
a manipulable object, so I'd imagine making userspace have a simple
1:1 connection to the iommu_ops of a single iommu_domain.

Doing this avoids all the weirdo questions about what do you do if
there is non-uniformity in the iommu_domain's.

Keeping with that theme the vfio_device would provide a similar
interface, on its own device FD.

I don't know if mmap should be involed here, the dirty bitmaps are not
so big, I suspect a simple get_user_pages_fast() would be entirely OK.

> > VFIO proposed to squash everything
> > into the container code, but I've been mulling about having iommufd
> > only do system iommu and push the PCI device internal tracking over to
> > VFIO.
> > 
> 
> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
> not sure yet how much one can we 'offload' to a generic layer, at least
> compared with this other proposal.

Yes, I expect there is very little generic code here if we go this
way. The generic layer is just marshalling the ioctl(s) to the iommu
drivers. Certainly not providing storage or anything/

> Give me some time (few days only, as I gotta sort some things) and I'll
> respond here as follow up with link to a branch with the WIP/PoC patches.

Great!
  
> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
> bit is cheap, clearing them is 'expensive' because one needs to flush
> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
> operations to actually update the Access/Dirty bit in concurrency with
> the CPU. The AMD manuals are a tad misleading as they talk about marking
> non-present, but that would be catastrophic for migration as it would
> mean a DMA target abort for the PCI device, unless I missed something obvious.
> In any case, this means that the dirty bit *clearing* needs to be
> batched as much as possible, to amortize the cost of flushing the IOTLB.
> This is the same for Intel *IIUC*.

You have to mark it as non-present to do the final read out if
something unmaps while the tracker is on - eg emulating a viommu or
something. Then you mark non-present, flush the iotlb and read back
the dirty bit.

Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
then read and clear them.

> 4) Adjust the granularity of pagetables in place:
> [This item wasn't done, but it is generic to any IOMMU because it
> is mostly the ability to split existing IO pages in place.]

This seems like it would be some interesting amount of driver work,
but yes it could be a generic new iommu_domina op.

> 4.b) Optionally starting dirtying earlier (at provisioning) and let
> userspace dynamically split pages. This is to hopefully minimize the
> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
> So dirty tracking would be enabled at creation of the protection domain
> after the vfio container is set up, and we would use pages dirtied
> as a indication of what needs to be splited. Problem is for IO page
> sizes bigger than 1G, which might unnecessarily lead to marking too
> much as dirty early on; but at least it's better than transferring the
> whole set.

I'm not sure running with dirty tracking permanently on would be good
for guest performance either.

I'd suspect you'd be better to have a warm up period where you track
dirtys and split down pages.

It is interesting, this is a possible reason why device dirty tracking
might actually perfom better because it can operate at a different
granularity from the system iommu without disrupting the guest DMA
performance.

Jason
Jason Gunthorpe Feb. 11, 2022, 5:49 p.m. UTC | #4
On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:

> But well, at the end of the day for an IOMMU driver the domain ops are
> the important stuff, maybe IO pgtable framework isn't as critical
> (Intel for example, doesn't use that at all).

Right, it doesn't matter what library was used to implement the
domain..

> > User space page tables will not be abstracted and the userspace must
> > know the direct HW format of the IOMMU they is being used.
> > 
> That's countering earlier sentence? Because hw format (for me at least)
> means PTE and protection domain config format too. And if iommufd
> abstracts the HW format, modelling after the IOMMU domain and its ops,
> then it's abstracting userspace from those details e.g. it works over
> IOVAs, but not over its vendor representation of how that IOVA is set up.
> 
> I am probably being dense.

It is both ways, one kind of domain provides a kernel supplied
map/unmap that implements the IO PTE manipulation in kernel memory

But if you want the IOPTE to be in user memory then user must
read/write it and it cannot use that - so a user domain will not have
map/unmap.

> > It is basically the same, almost certainly the user API in iommufd
> > will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> > just like vfio has.
> > 
> 
> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
> migration flow. 

It is essential to implement any kind of viommu behavior where
map/unmap is occuring while the dirty tracking is running. It should
never make a difference except in some ugly edge cases where the dma
and unmap are racing.

> supposed to be happening (excluding P2P)? So perhaps the unmap is
> unneeded after quiescing the VF.

Yes, you don't need to unmap for migration, a simple fully synchronous
read and clear is sufficient. But that final read, while DMA is quite,
must obtain the latest dirty bit data.
 
> We have a bitmap based interface in KVM, but there's also a recent ring
> interface for dirty tracking, which has some probably more determinism than
> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
> and breaking its entries on-demand IIRC, whereas Intel resembles something
> closer to a 512 entries 'ring' with VMX PML, which tells what has been
> dirtied.

KVM has an advantage that it can throttle the rate of dirty generation
so it can rely on logging. The IOMMU can't throttle DMA, so we are
stuck with a bitmap

> > I don't know if mmap should be involed here, the dirty bitmaps are not
> > so big, I suspect a simple get_user_pages_fast() would be entirely OK.
> > 
> Considering that is 32MB of a bitmap per TB maybe it is cheap.

Rigt. GUP fasting a couple huge pages is nothing compared to scanning
1TB of IO page table.

> >> Give me some time (few days only, as I gotta sort some things) and I'll
> >> respond here as follow up with link to a branch with the WIP/PoC patches.
> > 
> > Great!
> >
> Here it is. "A few days" turn into a week sorry :/
> 
> https://github.com/jpemartins/qemu  amd-iommu-hdsup-wip
> https://github.com/jpemartins/linux  amd-vfio-iommu-hdsup-wip
> 
> Note, it is an early PoC. I still need to get the split/collapse thing going,
> and fix the FIXMEs there, and have a second good look at the iommu page tables.

Oh I'll try to look a this thanks

> > You have to mark it as non-present to do the final read out if
> > something unmaps while the tracker is on - eg emulating a viommu or
> > something. Then you mark non-present, flush the iotlb and read back
> > the dirty bit.
> > 
> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
> too :) without needing VMM intervention (that's also not supported
> in Linux).

I'm sure, but dirty tracking has to happen on the kernel owned page
table, not the user owned one I think..

> > Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> > then read and clear them.
> > 
> It's the other way around AIUI. The dirty bits are sticky, so you flush
> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
> again on the next memory transaction (or ATS translation).

I guess it depends on how the HW works, if it writes the dirty bit
back to ram instantly on the first dirty, or if it only writes the
dirty bit when flushing the iotlb.

In any case you have to synchronize with the HW in some way to ensure
that all dirty bits are 'pulled back' into system memory to resolve
races (ie you need the iommu HW to release and the CPU to acquire on
the dirty data) before trying to read them, at least for the final
quieced system read.

> I am not entirely sure we need to unmap + mark non-present for non-viommu
> That would actually mean something is not properly quiscieing the VF DMA.
> Maybe we should .. to gate whether if we should actually continue with LM
> if something kept doing DMA when it shouldn't have.

It is certainly an edge case. A device would be misbehaving to
continue DMA.

> > This seems like it would be some interesting amount of driver work,
> > but yes it could be a generic new iommu_domina op.
> 
> I am slightly at odds that .split and .collapse at .switch() are enough.
> But, with iommu if we are working on top of an IOMMU domain object and
> .split and .collapse are iommu_ops perhaps that looks to be enough
> flexibility to give userspace the ability to decide what it wants to
> split, if it starts eargerly/warming-up tracking dirty pages.
> 
> The split and collapsing is something I wanted to work on next, to get
> to a stage closer to that of an RFC on the AMD side.

split/collapse seems kind of orthogonal to me it doesn't really
connect to dirty tracking other than being mostly useful during dirty
tracking.

And I wonder how hard split is when trying to atomically preserve any
dirty bit..

> Hmmm, judging how the IOMMU works I am not sure this is particularly
> affecting DMA performance (not sure yet about RDMA, it's something I
> curious to see how it gets to perform with 4K IOPTEs, and with dirty
> tracking always enabled). Considering how the bits are sticky, and
> unless CPU clears it, it's short of a nop? Unless of course the checking
> for A^D during an atomic memory transaction is expensive. Needs some
> performance testing nonetheless.

If you leave the bits all dirty then why do it? The point is to see
the dirties, which means the iommu is generating a workload of dirty
cachelines while operating it didn't do before.

> I forgot to mention, but the early enablement of IOMMU dirty tracking
> was also meant to fully know since guest creation what needs to be
> sent to the destination. Otherwise, wouldn't we need to send the whole
> pinned set to destination, if we only start tracking dirty pages during
> migration?

? At the start of migration you have to send everything. Dirty
tracking is intended to allow the VM to be stopped and then have only
a small set of data to xfer.
 
> Also, this is probably a differentiator for iommufd, if we were to provide
> split and collapse semantics to IOMMU domain objects that userspace can use.
> That would get more freedom, to switch dirty-tracking, and then do the warm
> up thingie and piggy back on what it wants to split before migration.
> perhaps the switch() should get some flag to pick where to split, I guess.

Yes, right. Split/collapse should be completely seperate from dirty
tracking.

Jason
Joao Martins Feb. 11, 2022, 9:43 p.m. UTC | #5
On 2/11/22 17:49, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:
>>> It is basically the same, almost certainly the user API in iommufd
>>> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
>>> just like vfio has.
>>>
>>
>> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
>> migration flow. 
> 
> It is essential to implement any kind of viommu behavior where
> map/unmap is occuring while the dirty tracking is running. It should
> never make a difference except in some ugly edge cases where the dma
> and unmap are racing.
> 
/me nods

>> supposed to be happening (excluding P2P)? So perhaps the unmap is
>> unneeded after quiescing the VF.
> 
> Yes, you don't need to unmap for migration, a simple fully synchronous
> read and clear is sufficient. But that final read, while DMA is quite,
> must obtain the latest dirty bit data.
>

The final read doesn't need special logic in VFIO/IOMMU IUC
(maybe only in the VMM)

Anyways, the above paragraph matches my understanding.

>> We have a bitmap based interface in KVM, but there's also a recent ring
>> interface for dirty tracking, which has some probably more determinism than
>> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
>> and breaking its entries on-demand IIRC, whereas Intel resembles something
>> closer to a 512 entries 'ring' with VMX PML, which tells what has been
>> dirtied.
> 
> KVM has an advantage that it can throttle the rate of dirty generation
> so it can rely on logging. The IOMMU can't throttle DMA, so we are
> stuck with a bitmap
> 
Yeap, sadly :(

>>> I don't know if mmap should be involed here, the dirty bitmaps are not
>>> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
>>>
>> Considering that is 32MB of a bitmap per TB maybe it is cheap.
> 
> Rigt. GUP fasting a couple huge pages is nothing compared to scanning
> 1TB of IO page table.
> 
... With 4K PTEs which is even more ludricous expensive. Well yeah, a
lesser concern the mangling of the bitmap, when you put that way heh

>>> You have to mark it as non-present to do the final read out if
>>> something unmaps while the tracker is on - eg emulating a viommu or
>>> something. Then you mark non-present, flush the iotlb and read back
>>> the dirty bit.
>>>
>> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
>> too :) without needing VMM intervention (that's also not supported
>> in Linux).
> 
> I'm sure, but dirty tracking has to happen on the kernel owned page
> table, not the user owned one I think..
> 
The plumbing for the hw-accelerated vIOMMU is a little different that
a regular vIOMMU, at least IIUC host does not take an active part in the
GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
have time to fiddle with the SDM:

https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf


>>> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
>>> then read and clear them.
>>>
>> It's the other way around AIUI. The dirty bits are sticky, so you flush
>> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
>> again on the next memory transaction (or ATS translation).
> 
> I guess it depends on how the HW works, if it writes the dirty bit
> back to ram instantly on the first dirty, or if it only writes the
> dirty bit when flushing the iotlb.
> 
The manual says roughly that the update is visible to CPU as soon as the
it updates. Particularly from the IOMMU SDM:

"Note that the setting of accessed and dirty status bits in the page tables is visible to
both the CPU and the peripheral when sharing guest page tables. The IOMMU interlocked
operations to update A and D bits must be 64-bit operations and naturally aligned on a
64-bit boundary"

And ...

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

> In any case you have to synchronize with the HW in some way to ensure
> that all dirty bits are 'pulled back' into system memory to resolve
> races (ie you need the iommu HW to release and the CPU to acquire on
> the dirty data) before trying to read them, at least for the final
> quieced system read.
> 
/me nods

>>> This seems like it would be some interesting amount of driver work,
>>> but yes it could be a generic new iommu_domina op.
>>
>> I am slightly at odds that .split and .collapse at .switch() are enough.
>> But, with iommu if we are working on top of an IOMMU domain object and
>> .split and .collapse are iommu_ops perhaps that looks to be enough
>> flexibility to give userspace the ability to decide what it wants to
>> split, if it starts eargerly/warming-up tracking dirty pages.
>>
>> The split and collapsing is something I wanted to work on next, to get
>> to a stage closer to that of an RFC on the AMD side.
> 
> split/collapse seems kind of orthogonal to me it doesn't really
> connect to dirty tracking other than being mostly useful during dirty
> tracking.
> 
> And I wonder how hard split is when trying to atomically preserve any
> dirty bit..
> 
Would would it be hard? The D bit is supposed to be replicated when you
split to smaller page size.

>> Hmmm, judging how the IOMMU works I am not sure this is particularly
>> affecting DMA performance (not sure yet about RDMA, it's something I
>> curious to see how it gets to perform with 4K IOPTEs, and with dirty
>> tracking always enabled). Considering how the bits are sticky, and
>> unless CPU clears it, it's short of a nop? Unless of course the checking
>> for A^D during an atomic memory transaction is expensive. Needs some
>> performance testing nonetheless.
> 
> If you leave the bits all dirty then why do it? The point is to see
> the dirties, which means the iommu is generating a workload of dirty
> cachelines while operating it didn't do before.
> 
My thinking was that if it's dirtied and in the IOTLB most likely the
descriptor in the IOTLB is cached. And if you need to do a IOMMU page walk
to resolve an IOVA, perhaps the check for the A & D bits needing
to be updated is probably the least problem in this path. Naturally, if
it's not split, you have a much higher chance (e.g. with 1GB entries) to stay
in the IOTLB and just compare two bits *prior* to consider starting
the atomic operation to update the descriptor.

>> I forgot to mention, but the early enablement of IOMMU dirty tracking
>> was also meant to fully know since guest creation what needs to be
>> sent to the destination. Otherwise, wouldn't we need to send the whole
>> pinned set to destination, if we only start tracking dirty pages during
>> migration?
> 
> ? At the start of migration you have to send everything. Dirty
> tracking is intended to allow the VM to be stopped and then have only
> a small set of data to xfer.
>  
Right, that's how it works today.

This is just preemptive longterm thinking about the overal problem space (probably
unnecessary noise at this stage). Particularly whenever I need to migrate 1 to 2TB VMs.
Particular that the stage *prior* to precopy takes way too long to transfer the whole
memory. So I was thinking say only transfer the pages that are populated[*] in the
second-stage page tables (for the CPU) coupled with IOMMU tracking from the beginning
(prior to vcpus even entering). That could probably decrease 1024 1GB Dirtied IOVA
entries, to maybe only dirty a smaller subset, saving a whole bootload of time.

[*] VMs without VFIO would be even easier as the first stage page tables are non-present.

>> Also, this is probably a differentiator for iommufd, if we were to provide
>> split and collapse semantics to IOMMU domain objects that userspace can use.
>> That would get more freedom, to switch dirty-tracking, and then do the warm
>> up thingie and piggy back on what it wants to split before migration.
>> perhaps the switch() should get some flag to pick where to split, I guess.
> 
> Yes, right. Split/collapse should be completely seperate from dirty
> tracking.

Yeap.

I wonder if we could start progressing the dirty tracking as a first initial series and
then have the split + collapse handling as a second part? That would be quite
nice to get me going! :D

	Joao
Jason Gunthorpe Feb. 12, 2022, 12:01 a.m. UTC | #6
On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
> The plumbing for the hw-accelerated vIOMMU is a little different that
> a regular vIOMMU, at least IIUC host does not take an active part in the
> GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
> have time to fiddle with the SDM:
> 
> https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf

That looks like the same nesting everyone else is talking about.

I've been refering to this as 'user space page tables' to avoid a lot
of ambiguity becuase what it fundamentally does is allow userspace to
directly manage an IO page table - building the IO PTEs, issue
invalidations, etc.

To be secure the userspace page table is nested under a kernel owned
page table and all the IO PTE offsets/etc are understood to be within
the address space of the kernel owned page table.

When combined with KVM the userspace is just inside a VM.

> 1. Decodes the read and write intent from the memory access.
> 2. If P=0 in the page descriptor, fail the access.
> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
> 4. If the A or D bits need to be updated in the descriptor:

Ah, so the dirty update is actually atomic on the first write before
any DMA happens - and I suppose all of this happens when the entry is
first loaded into the IOTLB.

So the flush is to allow the IOTLB to see the cleared D bit..

> > split/collapse seems kind of orthogonal to me it doesn't really
> > connect to dirty tracking other than being mostly useful during dirty
> > tracking.
> > 
> > And I wonder how hard split is when trying to atomically preserve any
> > dirty bit..
> > 
> Would would it be hard? The D bit is supposed to be replicated when you
> split to smaller page size.

I guess it depends on how the 'acquire' is done, as the CPU has to
atomically replace a large entry with a pointer to a small entry,
flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
in the old entry then it has to sprinkle it into the new entries with
atomics.

> This is just preemptive longterm thinking about the overal problem
> space (probably unnecessary noise at this stage). Particularly
> whenever I need to migrate 1 to 2TB VMs.  Particular that the stage
> *prior* to precopy takes way too long to transfer the whole
> memory. So I was thinking say only transfer the pages that are
> populated[*] in the second-stage page tables (for the CPU) coupled
> with IOMMU tracking from the beginning (prior to vcpus even
> entering). That could probably decrease 1024 1GB Dirtied IOVA
> entries, to maybe only dirty a smaller subset, saving a whole
> bootload of time.

Oh, you want to effectively optimize zero page detection..

> I wonder if we could start progressing the dirty tracking as a first initial series and
> then have the split + collapse handling as a second part? That would be quite
> nice to get me going! :D

I think so, and I think we should. It is such a big problem space, it
needs to get broken up.

Jason
Joao Martins Feb. 14, 2022, 1:34 p.m. UTC | #7
On 2/12/22 00:01, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
>> 1. Decodes the read and write intent from the memory access.
>> 2. If P=0 in the page descriptor, fail the access.
>> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
>> 4. If the A or D bits need to be updated in the descriptor:
> 
> Ah, so the dirty update is actually atomic on the first write before
> any DMA happens - and I suppose all of this happens when the entry is
> first loaded into the IOTLB.
> 
Intel's equivalent feature suggests me that this works the same way there.

ARM update of ioptes looks to be slightly different[*] but this FEAT_BBM
in SMMUv3.2 makes it work in similar way to Intel/AMD. But I could be
misunderstanding something there.

[*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
and then write the new valid entry. Not sure I understood correctly that this
is the 'break-before-make' thingie.

> So the flush is to allow the IOTLB to see the cleared D bit..
> 
Right.

>>> split/collapse seems kind of orthogonal to me it doesn't really
>>> connect to dirty tracking other than being mostly useful during dirty
>>> tracking.
>>>
>>> And I wonder how hard split is when trying to atomically preserve any
>>> dirty bit..
>>>
>> Would would it be hard? The D bit is supposed to be replicated when you
>> split to smaller page size.
> 
> I guess it depends on how the 'acquire' is done, as the CPU has to
> atomically replace a large entry with a pointer to a small entry,
> flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
> in the old entry then it has to sprinkle it into the new entries with
> atomics.
> 
ISTR some mention to what we are chatting here in the IOMMU SDM:

When a non-default page size is used , software must OR the Dirty bits in
all of the replicated host PTEs used to map the page. The IOMMU does not
guarantee the Dirty bits are set in all of the replicated PTEs. Any portion
of the page may have been written even if the Dirty bit is set in only one
of the replicated PTEs.

>> I wonder if we could start progressing the dirty tracking as a first initial series and
>> then have the split + collapse handling as a second part? That would be quite
>> nice to get me going! :D
> 
> I think so, and I think we should. It is such a big problem space, it
> needs to get broken up.

OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
past except with the x86 support only[*]. And we poke holes there I guess.

[*] I might include Intel too, albeit emulated only.
Jason Gunthorpe Feb. 14, 2022, 2:06 p.m. UTC | #8
On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:

> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
> and then write the new valid entry. Not sure I understood correctly that this
> is the 'break-before-make' thingie.

Doesn't that explode if the invalid entry is DMA'd to?

> When a non-default page size is used , software must OR the Dirty bits in
> all of the replicated host PTEs used to map the page. The IOMMU does not
> guarantee the Dirty bits are set in all of the replicated PTEs. Any portion
> of the page may have been written even if the Dirty bit is set in only one
> of the replicated PTEs.

I suspect that is talking about something else

> >> I wonder if we could start progressing the dirty tracking as a first initial series and
> >> then have the split + collapse handling as a second part? That would be quite
> >> nice to get me going! :D
> > 
> > I think so, and I think we should. It is such a big problem space, it
> > needs to get broken up.
> 
> OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
> past except with the x86 support only[*]. And we poke holes there I guess.
> 
> [*] I might include Intel too, albeit emulated only.

Like I said, I'd prefer we not build more on the VFIO type 1 code
until we have a conclusion for iommufd..

While returning the dirty data looks straight forward, it is hard to
see an obvious path to enabling and controlling the system iommu the
way vfio is now.

Jason
Jason Gunthorpe Feb. 18, 2022, 4:37 p.m. UTC | #9
On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> Hi,
> 
> This series attempts to add vfio live migration support for
> HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> includes both the functional register space and migration 
> control register space. As discussed in RFCv1[0], this may create
> security issues as these regions get shared between the Guest
> driver and the migration driver. Based on the feedback, we tried
> to address those concerns in this version. 
> 
> This is now based on the new vfio-pci-core framework proposal[1].
> Understand that the framework proposal is still under discussion,
> but really appreciate any feedback on the approach taken here
> to mitigate the security risks.

Do you want to try to get this into this kernel along with the mlx5
driver? How RFCy is this?

If you can post a v2 with the changes discussed soon I think there is
a good chance of this.

Thanks,
Jason
Joao Martins Feb. 22, 2022, 11:55 a.m. UTC | #10
On 2/15/22 16:21, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote:
>> On 2/14/22 14:06, Jason Gunthorpe wrote:
>>> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
>>>
>>>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
>>>> and then write the new valid entry. Not sure I understood correctly that this
>>>> is the 'break-before-make' thingie.
>>>
>>> Doesn't that explode if the invalid entry is DMA'd to?
>>>
>> Yes, IIUC. Also, the manual has this note:
> 
> Heh, sounds like "this doesn't work" to me :)
> 
Yeah, but I remember reading in manual that HTTUD (what ARM calls it for dirty
tracking, albeit DBM is another term for the same thing) requires FEAT_BBM
which avoids us to play the above games. So, supposedly, we can "just"
use atomics with IOPTE changes and IOTLB flush. Not if we need the latter
flush before or after on smmuv3.

>>> Like I said, I'd prefer we not build more on the VFIO type 1 code
>>> until we have a conclusion for iommufd..
>>>
>>
>> I didn't quite understand what you mean by conclusion.
> 
> If people are dead-set against doing iommufd, then lets abandon the
> idea and go back to hacking up vfio.
>  
Heh, I was under the impression everybody was investing so much *because*
that direction was set onto iommufd direction.

>> If by conclusion you mean the whole thing to be merged, how can the work be
>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>> in terms of direction...
> 
> I think go ahead and build it on top of iommufd, start working out the
> API details, etc. I think once the direction is concluded the new APIs
> will go forward.
>
/me nods, will do. Looking at your repository it is looking good.

>>> While returning the dirty data looks straight forward, it is hard to
>>> see an obvious path to enabling and controlling the system iommu the
>>> way vfio is now.
>>
>> It seems strange to have a whole UAPI for userspace [*] meant to
>> return dirty data to userspace, when dirty right now means the whole
>> pinned page set and so copying the whole guest ... 
> 
> Yes, the whole thing is only partially implemented, and doesn't have
> any in-kernel user. It is another place holder for an implementation
> to come someday.
> 
Yeap, seems like.

>> Hence my thinking was that the patches /if small/ would let us see how dirty
>> tracking might work for iommu kAPI (and iommufd) too.
> 
> It could be tried, but I think if you go into there you will find it
> quickly turns quite complicated to address all the edge cases. Eg what
> do you do if you have a mdev present after you turn on system
> tracking? What if the mdev is using a PASID?
> What about hotplug of new
> VFIO devices?
> 
> Remember, dirty tracking for vfio is totally useless without also
> having vfio device migration. 

Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
if we can't do the device part first. But if quiescing DMA and saving
state are two hard requirements that are mandatory for a live migrateable
VF, having dirty tracking in the devices I suspect might be more rare.
So perhaps people will look at IOMMUs as a commodity-workaround to avoid
a whole bunch of hardware logic for dirty tracking, even bearing what it
entails for DMA performance (hisilicon might be an example).

> Do you already have a migration capable
> device to use with this?
> 
Not yet, but soon I hope.

>> Would it be better to do more iterative steps (when possible) as opposed to
>> scratch and rebuild VFIO type1 IOMMU handling?
> 
> Possibly, but every thing that gets added has to be carried over to
> the new code too, and energy has to be expended trying to figure out
> how the half implemented stuff should work while finishing it.
> 
/me nods I understand

> At the very least we must decide what to do with device-provided dirty
> tracking before the VFIO type1 stuff can be altered to use the system
> IOMMU.
> 
I, too, have been wondering what that is going to look like -- and how do we
convey the setup of dirty tracking versus the steering of it.

> This is very much like the migration FSM, the only appeal is the
> existing qemu implementation of the protocol.

Yeah.
Jason Gunthorpe Feb. 23, 2022, 1:03 a.m. UTC | #11
On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:

> > If people are dead-set against doing iommufd, then lets abandon the
> > idea and go back to hacking up vfio.
> >  
> Heh, I was under the impression everybody was investing so much *because*
> that direction was set onto iommufd direction.

Such is the hope :)

> >> If by conclusion you mean the whole thing to be merged, how can the work be
> >> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >> in terms of direction...
> > 
> > I think go ahead and build it on top of iommufd, start working out the
> > API details, etc. I think once the direction is concluded the new APIs
> > will go forward.
> >
> /me nods, will do. Looking at your repository it is looking good.

I would like to come with some plan for dirty tracking on iommufd and
combine that with a plan for dirty tracking inside the new migration
drivers.

> Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
> if we can't do the device part first. But if quiescing DMA and saving
> state are two hard requirements that are mandatory for a live migrateable
> VF, having dirty tracking in the devices I suspect might be more
> rare.

So far all but one of the live migration devices I know about can do
dirty tracking internally..

> So perhaps people will look at IOMMUs as a commodity-workaround to avoid
> a whole bunch of hardware logic for dirty tracking, even bearing what it
> entails for DMA performance (hisilicon might be an example).

I do expect this will be true.

> > At the very least we must decide what to do with device-provided dirty
> > tracking before the VFIO type1 stuff can be altered to use the system
> > IOMMU.
> 
> I, too, have been wondering what that is going to look like -- and how do we
> convey the setup of dirty tracking versus the steering of it.

What I suggested was to just split them.

Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
would be on a per-domain basis, not on the ioas.

Some ioctl toward the vfio device will turn on the device's tracker.

Userspace has to iterate and query each enabled tracker. This is not
so bad because the time to make the system call is going to be tiny
compared to the time to marshal XXGB of dirty bits.

This makes qemu more complicated because it has to decide what
trackers to turn on, but that is also the point because we do want
userspace to be able to decide.

The other idea that has some possible interest is to allow the
trackers to dump their dirty bits into the existing kvm tracker, then
userspace just does a single kvm centric dirty pass.

Jason
Joao Martins Feb. 25, 2022, 7:18 p.m. UTC | #12
On 2/23/22 01:03, Jason Gunthorpe wrote:
> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>> in terms of direction...
>>>
>>> I think go ahead and build it on top of iommufd, start working out the
>>> API details, etc. I think once the direction is concluded the new APIs
>>> will go forward.
>>>
>> /me nods, will do. Looking at your repository it is looking good.
> 
> I would like to come with some plan for dirty tracking on iommufd and
> combine that with a plan for dirty tracking inside the new migration
> drivers.
> 
I had a few things going on my end over the past weeks, albeit it is
getting a bit better now and I will be coming back to this topic. I hope/want
to give you a more concrete update/feedback over the coming week or two wrt
to dirty-tracking+iommufd+amd.

So far, I am not particularly concerned that this will affect overall iommufd
design. The main thing is really lookups to get vendor iopte, upon on what might
be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling the tracking,
I'll be simplifying the interface in the other type1 series I had and making it
a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
enable/disable over a domain. Perhaps same trick could be expanded to other
features to have a bit more control on what userspace is allowed to use. I think
this just needs to set/clear a feature bit in the domain, for VFIO or userspace
to have full control during the different stages of migration of dirty tracking.
In all of the IOMMU implementations/manuals I read it means setting a protection
domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
(albeit past work had also it always-on).

Provided the iommufd does /separately/ more finer granularity on what we can
do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
at will as separate operations, before and after migration respectivally. That logic
would probably be better to be in separate iommufd ioctls(), as that it's going to be
expensive.


>> I, too, have been wondering what that is going to look like -- and how do we
>> convey the setup of dirty tracking versus the steering of it.
> 
> What I suggested was to just split them.
> 
> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> would be on a per-domain basis, not on the ioas.
> 
> Some ioctl toward the vfio device will turn on the device's tracker.
> 
In the activation/fetching-data of either trackers I see some things in common in
terms of UAPI with the difference that whether a device or a list of devices are passed on
as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
your suggestion)

Albeit perhaps the main difference is going to be that one needs to setup with
hardware interface with the device tracker and how we carry the regions of memory that
want to be tracked i.e. GPA/IOVA ranges that the device should track. The tracking-GPA
space is not linear GPA space sadly. But at the same point perhaps the internal VFIO API
between core-VFIO and vendor-VFIO is just reading the @dma ranges we mapped.

In IOMMU this is sort of cheap and 'stateless', but on the setup of the
device tracker might mean giving all the IOVA ranges to the PF (once?).
Perhaps leaving to the vendor driver to pick when to register the IOVA space to
be tracked, or perhaps when you turn on the device's tracker. But on all cases,
the driver needs some form of awareness of and convey that to the PF for
tracking purposes.

> Userspace has to iterate and query each enabled tracker. This is not
> so bad because the time to make the system call is going to be tiny
> compared to the time to marshal XXGB of dirty bits.
> 
Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
following by a smaller cost copying back to userspace (which KVM does too, when it's using
a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
(gup as you mentioned earlier in the thread), or just copying based on the input bitmap
(from PF) number of leading zeroes within some threshold.

> This makes qemu more complicated because it has to decide what
> trackers to turn on, but that is also the point because we do want
> userspace to be able to decide.
> 
If the interface wants extending to pass a device or an array of devices (if I understood
you correctly), it would free/simplify VFIO from having to concatenate potentially
different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
more to avoid performing too much copying.

> The other idea that has some possible interest is to allow the
> trackers to dump their dirty bits into the existing kvm tracker, then
> userspace just does a single kvm centric dirty pass.

That would probably limit certain more modern options of ring based dirty tracking,
as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
would require KVM to always use a bitmap during migration/dirty-rate-estimation with
the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
it's a module).

Although, perhaps IOMMU sharing pgtables with CPUs would probably better align with
reusing KVM existing dirty bitmap interface. But I haven't given much thought on that one.
I just remember reading something on the IOMMU manual about this (ISTR that iommu_v2 was
made for that purpose).
Jason Gunthorpe Feb. 25, 2022, 8:44 p.m. UTC | #13
On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> On 2/23/22 01:03, Jason Gunthorpe wrote:
> > On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>>> If by conclusion you mean the whole thing to be merged, how can the work be
> >>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >>>> in terms of direction...
> >>>
> >>> I think go ahead and build it on top of iommufd, start working out the
> >>> API details, etc. I think once the direction is concluded the new APIs
> >>> will go forward.
> >>>
> >> /me nods, will do. Looking at your repository it is looking good.
> > 
> > I would like to come with some plan for dirty tracking on iommufd and
> > combine that with a plan for dirty tracking inside the new migration
> > drivers.
> > 
> I had a few things going on my end over the past weeks, albeit it is
> getting a bit better now and I will be coming back to this topic. I hope/want
> to give you a more concrete update/feedback over the coming week or two wrt
> to dirty-tracking+iommufd+amd.
> 
> So far, I am not particularly concerned that this will affect overall iommufd
> design. The main thing is really lookups to get vendor iopte, upon on what might
> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
> the tracking,

I'm not very keen on these multiplexer interfaces. I think you should
just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
'read_dirty_bits'

NULL op means not supported.

IMHO we don't need a kapi wrapper if only iommufd is going to call the
op.

> I'll be simplifying the interface in the other type1 series I had and making it
> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> enable/disable over a domain. Perhaps same trick could be expanded to other
> features to have a bit more control on what userspace is allowed to use. I think
> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> to have full control during the different stages of migration of dirty tracking.
> In all of the IOMMU implementations/manuals I read it means setting a protection
> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> (albeit past work had also it always-on).
> 
> Provided the iommufd does /separately/ more finer granularity on what we can
> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> at will as separate operations, before and after migration respectivally. That logic
> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> expensive.

This all sounds right to me

Questions I have:
 - Do we need ranges for some reason? You mentioned ARM SMMU wants
   ranges? how/what/why?

 - What about the unmap and read dirty without races operation that
   vfio has?

> >> I, too, have been wondering what that is going to look like -- and how do we
> >> convey the setup of dirty tracking versus the steering of it.
> > 
> > What I suggested was to just split them.
> > 
> > Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> > would be on a per-domain basis, not on the ioas.
> > 
> > Some ioctl toward the vfio device will turn on the device's tracker.
> > 
> In the activation/fetching-data of either trackers I see some things in common in
> terms of UAPI with the difference that whether a device or a list of devices are passed on
> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
> your suggestion)

I was thinking a VFIO_DEVICE ioctl located on the device FD
implemented in the end VFIO driver (like mlx5_vfio). No lists..

As you say the driver should just take in the request to set dirty
tracking and take core of it somehow. There is no value the core VFIO
code can add here.

> Albeit perhaps the main difference is going to be that one needs to
> setup with hardware interface with the device tracker and how we
> carry the regions of memory that want to be tracked i.e. GPA/IOVA
> ranges that the device should track. The tracking-GPA space is not
> linear GPA space sadly. But at the same point perhaps the internal
> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
> ranges we mapped.

Yes, this is a point that needs some answering. One option is to pass
in the tracking range list from userspace. Another is to query it in
the driver from the currently mapped areas in IOAS.

I know devices have limitations here in terms of how many/how big the
ranges can be, and devices probably can't track dynamic changes.

> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
> device tracker might mean giving all the IOVA ranges to the PF (once?).
> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
> the driver needs some form of awareness of and convey that to the PF for
> tracking purposes.

Yes, this is right
 
> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
> following by a smaller cost copying back to userspace (which KVM does too, when it's using
> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
> (from PF) number of leading zeroes within some threshold.

What I would probably strive for is an API that deliberately OR's in
the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
in your dirty data', then do the next page. etc. That is 134M of IOVA
per chunk which seems OK.

> > This makes qemu more complicated because it has to decide what
> > trackers to turn on, but that is also the point because we do want
> > userspace to be able to decide.
> > 
> If the interface wants extending to pass a device or an array of devices (if I understood
> you correctly), it would free/simplify VFIO from having to concatenate potentially
> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
> more to avoid performing too much copying.

Yes. Currently VFIO maintains its own bitmap so it also saves that
memory by keeping the dirty bits stored in the IOPTEs until read out.
 
> > The other idea that has some possible interest is to allow the
> > trackers to dump their dirty bits into the existing kvm tracker, then
> > userspace just does a single kvm centric dirty pass.
> 
> That would probably limit certain more modern options of ring based dirty tracking,
> as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
> would require KVM to always use a bitmap during migration/dirty-rate-estimation with
> the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
> core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
> it's a module).

Yes, I don't know enough about kvm to say if that is a great idea or
not. The fact the CPUs seem to be going to logging instead of bitmaps
suggests it isn't. I don't think DMA devices can work effectively with
logging..

Jason
Joao Martins Feb. 28, 2022, 1:01 p.m. UTC | #14
On 2/25/22 20:44, Jason Gunthorpe wrote:
> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>> in terms of direction...
>>>>>
>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>> will go forward.
>>>>>
>>>> /me nods, will do. Looking at your repository it is looking good.
>>>
>>> I would like to come with some plan for dirty tracking on iommufd and
>>> combine that with a plan for dirty tracking inside the new migration
>>> drivers.
>>>
>> I had a few things going on my end over the past weeks, albeit it is
>> getting a bit better now and I will be coming back to this topic. I hope/want
>> to give you a more concrete update/feedback over the coming week or two wrt
>> to dirty-tracking+iommufd+amd.
>>
>> So far, I am not particularly concerned that this will affect overall iommufd
>> design. The main thing is really lookups to get vendor iopte, upon on what might
>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>> the tracking,
> 
> I'm not very keen on these multiplexer interfaces. I think you should
> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> 'read_dirty_bits'
> 
> NULL op means not supported.
> 
> IMHO we don't need a kapi wrapper if only iommufd is going to call the
> op.
> 

OK, good point.

Even without a kapi wrapper I am still wondering whether the iommu op needs to
be something like a generic iommu feature toggling (e.g. .set_feature()), rather
than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
the only feature we will change out-of-band in the protection-domain.
I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
because the sync() doesn't only read, but also "writes" to root pagetable to update
the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

And TBH, the question on whether we need a clear op isn't immediately obvious: reading
the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
get_dirty_bitmap() and clear_dirty_ditmap().

Hopefully a futuristic IOMMUs would just get a new DTE/CD/etc field for stashing a set of
PFNs (for a bitmap) that the IOMMU would use for setting the dirty bits there. That is,
rather than forcing sw to split/merge pagetables for better granularity and having to
flush IOTLBs for dirty to be written again :(

>> I'll be simplifying the interface in the other type1 series I had and making it
>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>> enable/disable over a domain. Perhaps same trick could be expanded to other
>> features to have a bit more control on what userspace is allowed to use. I think
>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>> to have full control during the different stages of migration of dirty tracking.
>> In all of the IOMMU implementations/manuals I read it means setting a protection
>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>> (albeit past work had also it always-on).
>>
>> Provided the iommufd does /separately/ more finer granularity on what we can
>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>> at will as separate operations, before and after migration respectivally. That logic
>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>> expensive.
> 
> This all sounds right to me
> 
> Questions I have:
>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>    ranges? how/what/why?
> 
Ignore that. I got mislead by the implementation and when I read the SDM
I realized that the implementation was doing the same thing I was doing
i.e. enabling dirty-bit in the protection domain at start rather than
dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
bit in the context descriptor to enable dirty bits or the STE.S2HD in the
stream table entry for the stage2 equivalent. Nothing here is per-range
basis. And the ranges was only used by the implementation for the eager
splitting/merging of IO page table levels.

>  - What about the unmap and read dirty without races operation that
>    vfio has?
> 
I am afraid that might need a new unmap iommu op that reads the dirty bit
after clearing the page table entry. It's marshalling the bits from
iopte into a bitmap as opposed to some logic added on top. As far as I
looked for AMD this isn't difficult to add, (same for Intel) it can
*I think* reuse all of the unmap code.

>>>> I, too, have been wondering what that is going to look like -- and how do we
>>>> convey the setup of dirty tracking versus the steering of it.
>>>
>>> What I suggested was to just split them.
>>>
>>> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
>>> would be on a per-domain basis, not on the ioas.
>>>
>>> Some ioctl toward the vfio device will turn on the device's tracker.
>>>
>> In the activation/fetching-data of either trackers I see some things in common in
>> terms of UAPI with the difference that whether a device or a list of devices are passed on
>> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
>> your suggestion)
> 
> I was thinking a VFIO_DEVICE ioctl located on the device FD
> implemented in the end VFIO driver (like mlx5_vfio). No lists..
> 
Interesting. I was assuming that we wanted to keep the same ioctl()
interface for dirty-tracking hence me mentioning the device/device-list on
top of the DIRTY ioctl. But well on a second thought it doesn't make sense
given that we query a container and we want to move away from vfio for iopgtbl
related-work and rather into iommufd direct access instead by the VMM. So
placing more dependency on that ioctl wouldn't align with that. I suppose, it
makes sense that this is on a per-vfio-device basis, hence device-vfio ioctl().

> As you say the driver should just take in the request to set dirty
> tracking and take core of it somehow. There is no value the core VFIO
> code can add here.
> 
>> Albeit perhaps the main difference is going to be that one needs to
>> setup with hardware interface with the device tracker and how we
>> carry the regions of memory that want to be tracked i.e. GPA/IOVA
>> ranges that the device should track. The tracking-GPA space is not
>> linear GPA space sadly. But at the same point perhaps the internal
>> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
>> ranges we mapped.
> 
> Yes, this is a point that needs some answering. One option is to pass
> in the tracking range list from userspace. Another is to query it in
> the driver from the currently mapped areas in IOAS.
> 
I sort of like the second given that we de-duplicate info that is already
tracked by IOAS -- it would be mostly internal and then it would be a
matter of when does this device tracker is set up, and whether we should
differentiate tracker "start"/"stop" vs "setup"/"teardown".

> I know devices have limitations here in terms of how many/how big the
> ranges can be, and devices probably can't track dynamic changes.
> 
/me nods

Should this be some sort of capability perhaps? Given that this may limit
how many concurrent VFs can be migrated and how much ranges it can store,
for example.

(interestingly and speaking of VF capabilities, it's yet another thing to
tackle in the migration stream between src and dst hosts)

>> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
>> device tracker might mean giving all the IOVA ranges to the PF (once?).
>> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
>> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
>> the driver needs some form of awareness of and convey that to the PF for
>> tracking purposes.
> 
> Yes, this is right
>  
>> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
>> following by a smaller cost copying back to userspace (which KVM does too, when it's using
>> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
>> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
>> (from PF) number of leading zeroes within some threshold.
> 
> What I would probably strive for is an API that deliberately OR's in
> the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
> in your dirty data', then do the next page. etc. That is 134M of IOVA
> per chunk which seems OK.
> 
Yeap, sort of what I was aiming for.

>>> This makes qemu more complicated because it has to decide what
>>> trackers to turn on, but that is also the point because we do want
>>> userspace to be able to decide.
>>>
>> If the interface wants extending to pass a device or an array of devices (if I understood
>> you correctly), it would free/simplify VFIO from having to concatenate potentially
>> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
>> more to avoid performing too much copying.
> 
> Yes. Currently VFIO maintains its own bitmap so it also saves that
> memory by keeping the dirty bits stored in the IOPTEs until read out.
>  
/me nods with the iommu, yes.
Joao Martins March 1, 2022, 1:06 p.m. UTC | #15
On 2/28/22 21:01, Jason Gunthorpe wrote:
> On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote:
>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>>>> in terms of direction...
>>>>>>>
>>>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>>>> will go forward.
>>>>>>>
>>>>>> /me nods, will do. Looking at your repository it is looking good.
>>>>>
>>>>> I would like to come with some plan for dirty tracking on iommufd and
>>>>> combine that with a plan for dirty tracking inside the new migration
>>>>> drivers.
>>>>>
>>>> I had a few things going on my end over the past weeks, albeit it is
>>>> getting a bit better now and I will be coming back to this topic. I hope/want
>>>> to give you a more concrete update/feedback over the coming week or two wrt
>>>> to dirty-tracking+iommufd+amd.
>>>>
>>>> So far, I am not particularly concerned that this will affect overall iommufd
>>>> design. The main thing is really lookups to get vendor iopte, upon on what might
>>>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>>>> the tracking,
>>>
>>> I'm not very keen on these multiplexer interfaces. I think you should
>>> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
>>> 'read_dirty_bits'
>>>
>>> NULL op means not supported.
>>>
>>> IMHO we don't need a kapi wrapper if only iommufd is going to call the
>>> op.
>>>
>>
>> OK, good point.
>>
>> Even without a kapi wrapper I am still wondering whether the iommu op needs to
>> be something like a generic iommu feature toggling (e.g. .set_feature()), rather
>> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
>> the only feature we will change out-of-band in the protection-domain.
>> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
>> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).
> 
> I just generally dislike multiplexers like this. We are already
> calling through a function pointer struct, why should the driver
> implement another switch/case just to find out which function pointer
> the caller really ment to use? It doesn't make things faster, it
> doesn't make things smaller, it doesn't use less LOC. Why do it?
> 
I concur with you in the above but I don't mean it like a multiplexer.
Rather, mimicking the general nature of feature bits in the protection domain
(or the hw pagetable abstraction). So hypothetically for every bit ... if you
wanted to create yet another op that just flips a bit of the
DTEs/PASID-table/CD it would be an excessive use of callbacks to get
to in the iommu_domain_ops if they all set do the same thing.
Right now it's only Dirty (Access bit I don't see immediate use for it
right now) bits, but could there be more? Perhaps this is something to
think about.

Anyways, I am anyways sticking with plain ops function.

>> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
>> because the sync() doesn't only read, but also "writes" to root pagetable to update
>> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
>> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.
> 
> 'read and clear' is what I'd suggest
> 
/me nods


>>>  - What about the unmap and read dirty without races operation that
>>>    vfio has?
>>>
>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>> after clearing the page table entry. It's marshalling the bits from
>> iopte into a bitmap as opposed to some logic added on top. As far as I
>> looked for AMD this isn't difficult to add, (same for Intel) it can
>> *I think* reuse all of the unmap code.
> 
> Ok. It feels necessary to be complete
> 
Yes, I guess it's mandatory if we want to fully implement vfio-compat dirty
tracking IOCTLs, too.

>>> Yes, this is a point that needs some answering. One option is to pass
>>> in the tracking range list from userspace. Another is to query it in
>>> the driver from the currently mapped areas in IOAS.
>>>
>> I sort of like the second given that we de-duplicate info that is already
>> tracked by IOAS -- it would be mostly internal and then it would be a
>> matter of when does this device tracker is set up, and whether we should
>> differentiate tracker "start"/"stop" vs "setup"/"teardown".
> 
> One problem with this is that devices that don't support dynamic
> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
> set of all memory mapped. 
> 
Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
that you have no limitation of ranges and fw/hw can cope with being fed of
'new-ranges' to enable dirty-tracking. Or does it mean something else at
hw level? Like dirty bitmap being pulled out of device memory, and hw keeps
its own state of dirtying and sw 'just' needs to sync the bitmap every scan.
Jason Gunthorpe March 1, 2022, 1:54 p.m. UTC | #16
On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote:

> I concur with you in the above but I don't mean it like a multiplexer.
> Rather, mimicking the general nature of feature bits in the protection domain
> (or the hw pagetable abstraction). So hypothetically for every bit ... if you
> wanted to create yet another op that just flips a bit of the
> DTEs/PASID-table/CD it would be an excessive use of callbacks to get
> to in the iommu_domain_ops if they all set do the same thing.
> Right now it's only Dirty (Access bit I don't see immediate use for it
> right now) bits, but could there be more? Perhaps this is something to
> think about.

Not sure it matters :)

> > One problem with this is that devices that don't support dynamic
> > tracking are stuck in vIOMMU cases where the IOAS will have some tiny
> > set of all memory mapped. 
> > 
> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
> that you have no limitation of ranges and fw/hw can cope with being fed of
> 'new-ranges' to enable dirty-tracking. 

Yes, the ranges can change once the tracking starts, like the normal
IOMMU can do

We are looking at devices where the HW can track a range at the start
of tracking and that range cannot change over time.

So, we need to track the whole guest memory and some extra for memory
hot plug, not just the currently viommu mapped things.

Jason
Joao Martins March 1, 2022, 2:27 p.m. UTC | #17
On 3/1/22 13:54, Jason Gunthorpe wrote:
> On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote:
> 
>> I concur with you in the above but I don't mean it like a multiplexer.
>> Rather, mimicking the general nature of feature bits in the protection domain
>> (or the hw pagetable abstraction). So hypothetically for every bit ... if you
>> wanted to create yet another op that just flips a bit of the
>> DTEs/PASID-table/CD it would be an excessive use of callbacks to get
>> to in the iommu_domain_ops if they all set do the same thing.
>> Right now it's only Dirty (Access bit I don't see immediate use for it
>> right now) bits, but could there be more? Perhaps this is something to
>> think about.
> 
> Not sure it matters :)
> 
Hehe, most likely it doesn't :)

>>> One problem with this is that devices that don't support dynamic
>>> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
>>> set of all memory mapped. 
>>>
>> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
>> that you have no limitation of ranges and fw/hw can cope with being fed of
>> 'new-ranges' to enable dirty-tracking. 
> 
> Yes, the ranges can change once the tracking starts, like the normal
> IOMMU can do
> 
> We are looking at devices where the HW can track a range at the start
> of tracking and that range cannot change over time.
> 
> So, we need to track the whole guest memory and some extra for memory
> hot plug, not just the currently viommu mapped things.

I'm aware of the guest memmap accounting foo, so I was already factoring
that in the equation. I was just mainly grokking at hardware limitations
you are coming from to be captured in these future VFIO extensions.

Tracking the max possible GPA (as you hinted earlier) could solve the two natures
of write-tracking, also given that the IOMMU ultimately protects what's really
allowed to be DMA-written to. Unless such tracking-bitmap is placed on device
memory, where space probably needs to be more carefully managed.

Anyways, thanks for the insights!
Joao Martins March 11, 2022, 1:51 p.m. UTC | #18
On 2/28/22 13:01, Joao Martins wrote:
> On 2/25/22 20:44, Jason Gunthorpe wrote:
>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>> I'll be simplifying the interface in the other type1 series I had and making it
>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>> features to have a bit more control on what userspace is allowed to use. I think
>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>> to have full control during the different stages of migration of dirty tracking.
>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>> (albeit past work had also it always-on).
>>>
>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>> at will as separate operations, before and after migration respectivally. That logic
>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>> expensive.
>>
>> This all sounds right to me
>>
>> Questions I have:
>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>    ranges? how/what/why?
>>
> Ignore that. I got mislead by the implementation and when I read the SDM
> I realized that the implementation was doing the same thing I was doing
> i.e. enabling dirty-bit in the protection domain at start rather than
> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> stream table entry for the stage2 equivalent. Nothing here is per-range
> basis. And the ranges was only used by the implementation for the eager
> splitting/merging of IO page table levels.
> 
>>  - What about the unmap and read dirty without races operation that
>>    vfio has?
>>
> I am afraid that might need a new unmap iommu op that reads the dirty bit
> after clearing the page table entry. It's marshalling the bits from
> iopte into a bitmap as opposed to some logic added on top. As far as I
> looked for AMD this isn't difficult to add, (same for Intel) it can
> *I think* reuse all of the unmap code.
> 

OK, made some progress.

It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
revising locking, bugs, and more -- works with my emulated qemu patches which
is a good sign. But hopefully starts some sort of skeleton of what we were
talking about in the above thread.

The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
the vfio-compat one as it was easier provided there's existing userspace to work
with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
steering the dirty tracking, read-clear the dirty bits, unmap and get dirty. But
as we discussed earlier, the one gap of VFIO dirty API is that it lacks controls
for upgrading/downgrading area/IOPTE sizes which is where IOMMUFD would most
likely shine. When that latter part is done we can probably adopt an eager-split
approach inside vfio-compat.

Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
Some of the commits made notes of some of research I made, so *I think* the APIs
introduced capture h/w semantics for all the three IOMMUs supporting dirty
tracking.

I am storing my dirty-tracking bits here:

	https://github.com/jpemartins/linux	iommufd

It's a version of it rebased on top of your iommufd branch.
Jason Gunthorpe March 15, 2022, 7:29 p.m. UTC | #19
On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
> On 2/28/22 13:01, Joao Martins wrote:
> > On 2/25/22 20:44, Jason Gunthorpe wrote:
> >> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> >>> On 2/23/22 01:03, Jason Gunthorpe wrote:
> >>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>> I'll be simplifying the interface in the other type1 series I had and making it
> >>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> >>> enable/disable over a domain. Perhaps same trick could be expanded to other
> >>> features to have a bit more control on what userspace is allowed to use. I think
> >>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> >>> to have full control during the different stages of migration of dirty tracking.
> >>> In all of the IOMMU implementations/manuals I read it means setting a protection
> >>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> >>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> >>> (albeit past work had also it always-on).
> >>>
> >>> Provided the iommufd does /separately/ more finer granularity on what we can
> >>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> >>> at will as separate operations, before and after migration respectivally. That logic
> >>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> >>> expensive.
> >>
> >> This all sounds right to me
> >>
> >> Questions I have:
> >>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
> >>    ranges? how/what/why?
> >>
> > Ignore that. I got mislead by the implementation and when I read the SDM
> > I realized that the implementation was doing the same thing I was doing
> > i.e. enabling dirty-bit in the protection domain at start rather than
> > dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> > bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> > stream table entry for the stage2 equivalent. Nothing here is per-range
> > basis. And the ranges was only used by the implementation for the eager
> > splitting/merging of IO page table levels.
> > 
> >>  - What about the unmap and read dirty without races operation that
> >>    vfio has?
> >>
> > I am afraid that might need a new unmap iommu op that reads the dirty bit
> > after clearing the page table entry. It's marshalling the bits from
> > iopte into a bitmap as opposed to some logic added on top. As far as I
> > looked for AMD this isn't difficult to add, (same for Intel) it can
> > *I think* reuse all of the unmap code.
> > 
> 
> OK, made some progress.

Nice!

 
> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
> revising locking, bugs, and more -- works with my emulated qemu patches which
> is a good sign. But hopefully starts some sort of skeleton of what we were
> talking about in the above thread.

I'm a bit bogged with the coming merge window right now, but will have
more to say in a bit
 
> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
> the vfio-compat one as it was easier provided there's existing userspace to work
> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
> steering the dirty tracking, read-clear the dirty bits, unmap and
> get dirty. 

I think this is fine to start experimenting with

> But as we discussed earlier, the one gap of VFIO dirty API is that
> it lacks controls for upgrading/downgrading area/IOPTE sizes which
> is where IOMMUFD would most likely shine. When that latter part is
> done we can probably adopt an eager-split approach inside
> vfio-compat.

I think the native API should be new ioctls that operate on the
hw_pagetable object to:
  - enable/disable dirty tracking 
  - read&clear a bitmap from a range
  - read&unmap a bitmap from a range
  - Manipulate IOPTE sizes

As you say it should not be much distance from the VFIO compat stuff

Most probably I would say to leave dirty tracking out of the type1 api
and compat for it. Maybe we can make some limited cases work back
compat, like the whole ioas supports iommu dirty tracking or
something..

Need to understand if it is wortwhile - remember to use any of this
you need a qemu that is updated to the v2 migration interface,
so there is little practical value in back compat to old qemu if we
expect qemu will use the native interface anyhow.

> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
> Some of the commits made notes of some of research I made, so *I think* the APIs
> introduced capture h/w semantics for all the three IOMMUs supporting dirty
> tracking.

I think the primitives are pretty basic, I'd be surprised if there is
something different :)

Though things to be thinking about are how does this work with nested
translation and other advanced features.

Thanks,
Jason
Jason Gunthorpe March 18, 2022, 5:34 p.m. UTC | #20
On Fri, Mar 18, 2022 at 05:12:01PM +0000, Joao Martins wrote:
> I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based
> on the past work) and adjusted the 'set tracking' structure to cover this slightly
> different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API,
> which is internal in iommufd obviously. The UAPI won't change ofc.

Shameerali has HW, AFAIK..

Jason