mbox series

[QEMU,v25,00/17] Add migration support for VFIO devices

Message ID 1600817059-26721-1-git-send-email-kwankhede@nvidia.com
Headers show
Series Add migration support for VFIO devices | expand

Message

Kirti Wankhede Sept. 22, 2020, 11:24 p.m. UTC
Hi,

This Patch set adds migration support for VFIO devices in QEMU.

This Patch set include patches as below:
Patch 1-2:
- Few code refactor

Patch 3:
- Added save and restore functions for PCI configuration space. Used
  pci_device_save() and pci_device_load() so that config space cache is saved
  and restored.

Patch 4-9:
- Generic migration functionality for VFIO device.
  * This patch set adds functionality for PCI devices, but can be
    extended to other VFIO devices.
  * Added all the basic functions required for pre-copy, stop-and-copy and
    resume phases of migration.
  * Added state change notifier and from that notifier function, VFIO
    device's state changed is conveyed to VFIO device driver.
  * During save setup phase and resume/load setup phase, migration region
    is queried and is used to read/write VFIO device data.
  * .save_live_pending and .save_live_iterate are implemented to use QEMU's
    functionality of iteration during pre-copy phase.
  * In .save_live_complete_precopy, that is in stop-and-copy phase,
    iteration to read data from VFIO device driver is implemented till pending
    bytes returned by driver are not zero.

Patch 10
- Set DIRTY_MEMORY_MIGRATION flag in dirty log mask for migration with vIOMMU
  enabled.

Patch 11:
- Get migration capability from kernel module.

Patch 12-14:
- Add function to start and stop dirty pages tracking.
- Add vfio_listerner_log_sync to mark dirty pages. Dirty pages bitmap is queried
  per container. All pages pinned by vendor driver through vfio_pin_pages
  external API has to be marked as dirty during  migration.
  When there are CPU writes, CPU dirty page tracking can identify dirtied
  pages, but any page pinned by vendor driver can also be written by
  device. As of now there is no device which has hardware support for
  dirty page tracking. So all pages which are pinned by vendor driver
  should be considered as dirty.
  In Qemu, marking pages dirty is only done when device is in stop-and-copy
  phase because if pages are marked dirty during pre-copy phase and content is
  transfered from source to distination, there is no way to know newly dirtied
  pages from the point they were copied earlier until device stops. To avoid
  repeated copy of same content, pinned pages are marked dirty only during
  stop-and-copy phase.

Patch 15:
- With vIOMMU, IO virtual address range can get unmapped while in pre-copy
  phase of migration. In that case, unmap ioctl should return pages pinned
  in that range and QEMU should report corresponding guest physical pages
  dirty.
Note: Comments from v25 for this patch are not addressed in this series.
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg714646.html

Patch 16:
- Make VFIO PCI device migration capable. If migration region is not provided by
  driver, migration is blocked.

Patch 17:
- Added VFIO device stats to MigrationInfo
Note: Comments from v25 for this patch are not addressed yet.
https://www.mail-archive.com/qemu-devel@nongnu.org/msg715620.html


Yet TODO:
Since there is no device which has hardware support for system memmory
dirty bitmap tracking, right now there is no other API from vendor driver
to VFIO IOMMU module to report dirty pages. In future, when such hardware
support will be implemented, an API will be required in kernel such that
vendor driver could report dirty pages to VFIO module during migration phases.

Below is the flow of state change for live migration where states in brackets
represent VM state, migration state and VFIO device state as:
    (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)

Live migration save path:
        QEMU normal running state
        (RUNNING, _NONE, _RUNNING)
                        |
    migrate_init spawns migration_thread.
    (RUNNING, _SETUP, _RUNNING|_SAVING)
    Migration thread then calls each device's .save_setup()
                        |
    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
    If device is active, get pending bytes by .save_live_pending()
    if pending bytes >= threshold_size,  call save_live_iterate()
    Data of VFIO device for pre-copy phase is copied.
    Iterate till pending bytes converge and are less than threshold
                        |
    On migration completion, vCPUs stops and calls .save_live_complete_precopy
    for each active device. VFIO device is then transitioned in
     _SAVING state.
    (FINISH_MIGRATE, _DEVICE, _SAVING)
    For VFIO device, iterate in  .save_live_complete_precopy  until
    pending data is 0.
    (FINISH_MIGRATE, _DEVICE, _STOPPED)
                        |
    (FINISH_MIGRATE, _COMPLETED, STOPPED)
    Migraton thread schedule cleanup bottom half and exit

Live migration resume path:
    Incomming migration calls .load_setup for each device
    (RESTORE_VM, _ACTIVE, STOPPED)
                        |
    For each device, .load_state is called for that device section data
                        |
    At the end, called .load_cleanup for each device and vCPUs are started.
                        |
        (RUNNING, _NONE, _RUNNING)

Note that:
- Migration post copy is not supported.


v25 -> 26
- Removed emulated_config_bits cache and vdev->pdev.wmask from config space save
  load functions.
- Used VMStateDescription for config space save and load functionality.
- Major fixes from previous version review.
  https://www.mail-archive.com/qemu-devel@nongnu.org/msg714625.html

v23 -> 25
- Updated config space save and load to save config cache, emulated bits cache
  and wmask cache.
- Created idr string as suggested by Dr Dave that includes bus path.
- Updated save and load function to read/write data to mixed regions, mapped or
  trapped.
- When vIOMMU is enabled, created mapped iova range list which also keeps
  translated address. This list is used to mark dirty pages. This reduces
  downtime significantly with vIOMMU enabled than migration patches from
   previous version. 
- Removed get_address_limit() function from v23 patch as this not required now.

v22 -> v23
-- Fixed issue reported by Yan
https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd531c@nvidia.com/
- Sending this version to test v23 kernel version patches:
https://lore.kernel.org/kvm/1589998088-3250-1-git-send-email-kwankhede@nvidia.com/

v18 -> v22
- Few fixes from v18 review. But not yet fixed all concerns. I'll address those
  concerns in subsequent iterations.
- Sending this version to test v22 kernel version patches:
https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankhede@nvidia.com/

v16 -> v18
- Nit fixes
- Get migration capability flags from container
- Added VFIO stats to MigrationInfo
- Fixed bug reported by Yan
    https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg00004.html

v9 -> v16
- KABI almost finalised on kernel patches.
- Added support for migration with vIOMMU enabled.

v8 -> v9:
- Split patch set in 2 sets, Kernel and QEMU sets.
- Dirty pages bitmap is queried from IOMMU container rather than from
  vendor driver for per device. Added 2 ioctls to achieve this.

v7 -> v8:
- Updated comments for KABI
- Added BAR address validation check during PCI device's config space load as
  suggested by Dr. David Alan Gilbert.
- Changed vfio_migration_set_state() to set or clear device state flags.
- Some nit fixes.

v6 -> v7:
- Fix build failures.

v5 -> v6:
- Fix build failure.

v4 -> v5:
- Added decriptive comment about the sequence of access of members of structure
  vfio_device_migration_info to be followed based on Alex's suggestion
- Updated get dirty pages sequence.
- As per Cornelia Huck's suggestion, added callbacks to VFIODeviceOps to
  get_object, save_config and load_config.
- Fixed multiple nit picks.
- Tested live migration with multiple vfio device assigned to a VM.

v3 -> v4:
- Added one more bit for _RESUMING flag to be set explicitly.
- data_offset field is read-only for user space application.
- data_size is read for every iteration before reading data from migration, that
  is removed assumption that data will be till end of migration region.
- If vendor driver supports mappable sparsed region, map those region during
  setup state of save/load, similarly unmap those from cleanup routines.
- Handles race condition that causes data corruption in migration region during
  save device state by adding mutex and serialiaing save_buffer and
  get_dirty_pages routines.
- Skip called get_dirty_pages routine for mapped MMIO region of device.
- Added trace events.
- Splitted into multiple functional patches.

v2 -> v3:
- Removed enum of VFIO device states. Defined VFIO device state with 2 bits.
- Re-structured vfio_device_migration_info to keep it minimal and defined action
  on read and write access on its members.

v1 -> v2:
- Defined MIGRATION region type and sub-type which should be used with region
  type capability.
- Re-structured vfio_device_migration_info. This structure will be placed at 0th
  offset of migration region.
- Replaced ioctl with read/write for trapped part of migration region.
- Added both type of access support, trapped or mmapped, for data section of the
  region.
- Moved PCI device functions to pci file.
- Added iteration to get dirty page bitmap until bitmap for all requested pages
  are copied.

Thanks,
Kirti



Kirti Wankhede (17):
  vfio: Add function to unmap VFIO region
  vfio: Add vfio_get_object callback to VFIODeviceOps
  vfio: Add save and load functions for VFIO PCI devices
  vfio: Add migration region initialization and finalize function
  vfio: Add VM state change handler to know state of VM
  vfio: Add migration state change notifier
  vfio: Register SaveVMHandlers for VFIO device
  vfio: Add save state functions to SaveVMHandlers
  vfio: Add load state functions to SaveVMHandlers
  memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled
  vfio: Get migration capability flags for container
  vfio: Add function to start and stop dirty pages tracking
  vfio: create mapped iova list when vIOMMU is enabled
  vfio: Add vfio_listener_log_sync to mark dirty pages
  vfio: Add ioctl to get dirty pages bitmap during dma unmap.
  vfio: Make vfio-pci device migration capable
  qapi: Add VFIO devices migration stats in Migration stats

 hw/vfio/common.c              | 426 ++++++++++++++++++--
 hw/vfio/meson.build           |   1 +
 hw/vfio/migration.c           | 892 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 170 +++++++-
 hw/vfio/pci.h                 |   2 +-
 hw/vfio/trace-events          |  20 +
 include/hw/vfio/vfio-common.h |  30 ++
 include/qemu/vfio-helpers.h   |   3 +
 migration/migration.c         |  14 +
 monitor/hmp-cmds.c            |   6 +
 qapi/migration.json           |  17 +
 softmmu/memory.c              |   2 +-
 12 files changed, 1539 insertions(+), 44 deletions(-)
 create mode 100644 hw/vfio/migration.c

Comments

Zenghui Yu Sept. 23, 2020, 7:06 a.m. UTC | #1
On 2020/9/23 7:24, Kirti Wankhede wrote:

> Patch 4-9:
> - Generic migration functionality for VFIO device.
>   * This patch set adds functionality for PCI devices, but can be
>     extended to other VFIO devices.
>   * Added all the basic functions required for pre-copy, stop-and-copy and
>     resume phases of migration.
>   * Added state change notifier and from that notifier function, VFIO
>     device's state changed is conveyed to VFIO device driver.
>   * During save setup phase and resume/load setup phase, migration region
>     is queried and is used to read/write VFIO device data.
>   * .save_live_pending and .save_live_iterate are implemented to use QEMU's
>     functionality of iteration during pre-copy phase.
>   * In .save_live_complete_precopy, that is in stop-and-copy phase,
>     iteration to read data from VFIO device driver is implemented till pending
>     bytes returned by driver are not zero.

s/are not zero/are zero/ ?

[...]

> Live migration resume path:
>     Incomming migration calls .load_setup for each device
>     (RESTORE_VM, _ACTIVE, STOPPED)

The _RESUMING device state is missed here?

>                         |
>     For each device, .load_state is called for that device section data
>                         |
>     At the end, called .load_cleanup for each device and vCPUs are started.
>                         |
>         (RUNNING, _NONE, _RUNNING)


Thanks,
Zenghui
Alex Williamson Sept. 25, 2020, 8:20 p.m. UTC | #2
On Wed, 23 Sep 2020 04:54:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Added migration state change notifier to get notification on migration state
> change. These states are translated to VFIO device state and conveyed to vendor
> driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |  5 +++--
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index a30d628ba963..f650fe9fc3c8 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -199,6 +199,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>      }
>  }
>  
> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> +{
> +    MigrationState *s = data;
> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> +    int ret;
> +
> +    trace_vfio_migration_state_notifier(vbasedev->name,
> +                                        MigrationStatus_str(s->state));
> +
> +    switch (s->state) {
> +    case MIGRATION_STATUS_CANCELLING:
> +    case MIGRATION_STATUS_CANCELLED:
> +    case MIGRATION_STATUS_FAILED:
> +        ret = vfio_migration_set_state(vbasedev,
> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> +                      VFIO_DEVICE_STATE_RUNNING);
> +        if (ret) {
> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> +        }

Here again the caller assumes success means the device has entered the
desired state, but as implemented it only means the device is in some
non-error state.

> +    }
> +}
> +
>  static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
> @@ -221,6 +243,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> +    add_migration_state_change_notifier(&vbasedev->migration_state);
>      return ret;
>  }
>  
> @@ -263,6 +287,11 @@ add_blocker:
>  
>  void vfio_migration_finalize(VFIODevice *vbasedev)
>  {
> +
> +    if (vbasedev->migration_state.notify) {
> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> +    }
> +
>      if (vbasedev->vm_state) {
>          qemu_del_vm_change_state_handler(vbasedev->vm_state);
>      }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 6524734bf7b4..bcb3fa7314d7 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -149,5 +149,6 @@ vfio_display_edid_write_error(void) ""
>  
>  # migration.c
>  vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> -vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> -vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> +vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 25e3b1a3b90a..49c7c7a0e29a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>      VMChangeStateEntry *vm_state;
>      uint32_t device_state;
>      int vm_running;
> +    Notifier migration_state;

Can this live in VFIOMigration?  Thanks,

Alex

>  } VFIODevice;
>  
>  struct VFIODeviceOps {
Alex Williamson Sept. 25, 2020, 8:20 p.m. UTC | #3
On Wed, 23 Sep 2020 04:54:09 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Define flags to be used as delimeter in migration file stream.
> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> region from these functions at source during saving or pre-copy phase.
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/trace-events |  2 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index f650fe9fc3c8..8e8adaa25779 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/cutils.h"
>  #include <linux/vfio.h>
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -25,6 +28,17 @@
>  #include "trace.h"
>  #include "hw/hw.h"
>  
> +/*
> + * Flags used as delimiter:
> + * 0xffffffff => MSB 32-bit all 1s
> + * 0xef10     => emulated (virtual) function IO
> + * 0x0000     => 16-bits reserved for flags
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>      return 0;
>  }
>  
> +/* ---------------------------------------------------------------------- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    int ret;
> +
> +    trace_vfio_save_setup(vbasedev->name);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +    if (migration->region.mmaps) {
> +        qemu_mutex_lock_iothread();
> +        ret = vfio_region_mmap(&migration->region);
> +        qemu_mutex_unlock_iothread();

Please add a comment identifying why the iothread mutex lock is
necessary here.

> +        if (ret) {
> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> +                         vbasedev->name, migration->region.nr,

We don't support multiple migration regions, is it useful to include
the region index here?

> +                         strerror(-ret));
> +            error_report("%s: Falling back to slow path", vbasedev->name);
> +        }
> +    }
> +
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> +                                   VFIO_DEVICE_STATE_SAVING);
> +    if (ret) {
> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> +        return ret;
> +    }

Again, doesn't match the function semantics that success only means the
device is in a non-error state, maybe the one that was asked for.

> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);

What's the overall purpose of writing these markers into the migration
stream?  vfio_load_state() doesn't do anything with this other than
validate that the end-of-state immediately follows.  Is this a
placeholder for something in the future?

> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_save_cleanup(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->region.mmaps) {
> +        vfio_region_unmap(&migration->region);
> +    }
> +    trace_vfio_save_cleanup(vbasedev->name);
> +}
> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +    .save_setup = vfio_save_setup,
> +    .save_cleanup = vfio_save_cleanup,
> +};
> +
> +/* ---------------------------------------------------------------------- */
> +
>  static void vfio_vmstate_change(void *opaque, int running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>                                 struct vfio_region_info *info)
>  {
>      int ret = -EINVAL;
> +    char id[256] = "";
> +    Object *obj;
>  
>      if (!vbasedev->ops->vfio_get_object) {
>          return ret;
> @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>          return ret;
>      }
>  
> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> +
> +    if (obj) {
> +        DeviceState *dev = DEVICE(obj);
> +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> +
> +        if (oid) {
> +            pstrcpy(id, sizeof(id), oid);
> +            pstrcat(id, sizeof(id), "/");
> +            g_free(oid);
> +        }
> +    }

Here's where vfio_migration_init() starts using vfio_get_object() as I
referenced back on patch 04.  We might as well get the object before
calling vfio_migration_region_init() and then pass the object.  The
conditional branch to handle obj is strange here too, it's fatal if
vfio_migration_region_init() doesn't find an object, why do we handle
it as optional here?  Also, what is this doing?  Comments would be
nice...  Thanks,

Alex

> +    pstrcat(id, sizeof(id), "vfio");
> +
> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> +                         vbasedev);
>      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>                                                            vbasedev);
>      vbasedev->migration_state.notify = vfio_migration_state_notifier;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index bcb3fa7314d7..982d8dccb219 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> +vfio_save_setup(const char *name) " (%s)"
> +vfio_save_cleanup(const char *name) " (%s)"
Dr. David Alan Gilbert Sept. 29, 2020, 10:19 a.m. UTC | #4
* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> On 9/23/20 1:24 AM, Kirti Wankhede wrote:
> > Define flags to be used as delimeter in migration file stream.
> 
> Typo "delimiter".
> 
> > Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> > region from these functions at source during saving or pre-copy phase.
> > Set VFIO device state depending on VM's state. During live migration, VM is
> > running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> > device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/trace-events |  2 ++
> >  2 files changed, 93 insertions(+)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index f650fe9fc3c8..8e8adaa25779 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -8,12 +8,15 @@
> >   */
> >  
> >  #include "qemu/osdep.h"
> > +#include "qemu/main-loop.h"
> > +#include "qemu/cutils.h"
> >  #include <linux/vfio.h>
> >  
> >  #include "sysemu/runstate.h"
> >  #include "hw/vfio/vfio-common.h"
> >  #include "cpu.h"
> >  #include "migration/migration.h"
> > +#include "migration/vmstate.h"
> >  #include "migration/qemu-file.h"
> >  #include "migration/register.h"
> >  #include "migration/blocker.h"
> > @@ -25,6 +28,17 @@
> >  #include "trace.h"
> >  #include "hw/hw.h"
> >  
> > +/*
> > + * Flags used as delimiter:
> > + * 0xffffffff => MSB 32-bit all 1s
> > + * 0xef10     => emulated (virtual) function IO
> > + * 0x0000     => 16-bits reserved for flags
> > + */
> > +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> > +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> > +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> > +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> > +
> >  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
> >                                    off_t off, bool iswrite)
> >  {
> > @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
> >      return 0;
> >  }
> >  
> > +/* ---------------------------------------------------------------------- */
> > +
> > +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +    int ret;
> > +
> > +    trace_vfio_save_setup(vbasedev->name);
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> > +
> > +    if (migration->region.mmaps) {
> > +        qemu_mutex_lock_iothread();
> > +        ret = vfio_region_mmap(&migration->region);
> > +        qemu_mutex_unlock_iothread();
> > +        if (ret) {
> > +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
> > +                         vbasedev->name, migration->region.nr,
> > +                         strerror(-ret));
> > +            error_report("%s: Falling back to slow path", vbasedev->name);
> > +        }
> > +    }
> > +
> > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> > +                                   VFIO_DEVICE_STATE_SAVING);
> > +    if (ret) {
> > +        error_report("%s: Failed to set state SAVING", vbasedev->name);
> > +        return ret;
> > +    }
> > +
> > +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > +
> > +    ret = qemu_file_get_error(f);
> > +    if (ret) {
> > +        return ret;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_save_cleanup(void *opaque)
> > +{
> > +    VFIODevice *vbasedev = opaque;
> > +    VFIOMigration *migration = vbasedev->migration;
> > +
> > +    if (migration->region.mmaps) {
> > +        vfio_region_unmap(&migration->region);
> > +    }
> > +    trace_vfio_save_cleanup(vbasedev->name);
> > +}
> > +
> > +static SaveVMHandlers savevm_vfio_handlers = {
> > +    .save_setup = vfio_save_setup,
> > +    .save_cleanup = vfio_save_cleanup,
> > +};
> > +
> > +/* ---------------------------------------------------------------------- */
> > +
> >  static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >  {
> >      VFIODevice *vbasedev = opaque;
> > @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >                                 struct vfio_region_info *info)
> >  {
> >      int ret = -EINVAL;
> > +    char id[256] = "";
> > +    Object *obj;
> >  
> >      if (!vbasedev->ops->vfio_get_object) {
> >          return ret;
> > @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >          return ret;
> >      }
> >  
> > +    obj = vbasedev->ops->vfio_get_object(vbasedev);
> > +
> > +    if (obj) {
> > +        DeviceState *dev = DEVICE(obj);
> > +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
> > +
> > +        if (oid) {
> > +            pstrcpy(id, sizeof(id), oid);
> > +            pstrcat(id, sizeof(id), "/");
> > +            g_free(oid);
> > +        }
> > +    }
> > +    pstrcat(id, sizeof(id), "vfio");
> 
> Alternatively (easier to review, matter of taste):
> 
>  g_autofree char *path = NULL;
> 
>  if (oid) {
>    path = g_strdup_printf("%s/vfio",
>                           vmstate_if_get_id(VMSTATE_IF(obj)););
>  } else {
>    path = g_strdup("vfio");
>  }
>  strpadcpy(id, sizeof(id), path, '\0');

Maybe, although it's a straight copy of the magic in unregister_savevm.

Dave

> > +
> > +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
> > +                         vbasedev);
> >      vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >                                                            vbasedev);
> >      vbasedev->migration_state.notify = vfio_migration_state_notifier;
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index bcb3fa7314d7..982d8dccb219 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> >  vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> >  vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >  vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> > +vfio_save_setup(const char *name) " (%s)"
> > +vfio_save_cleanup(const char *name) " (%s)"
> > 
>
Dr. David Alan Gilbert Sept. 29, 2020, 10:40 a.m. UTC | #5
* Kirti Wankhede (kwankhede@nvidia.com) wrote:
> Added amount of bytes transferred to the target VM by all VFIO devices
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
> 
> Note: Comments from v25 for this patch are not addressed yet.
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg715620.html
> 
> Alex, need more pointer on documentation part raised Markus Armbruster.

I think I'm OK with this from the migration side, except for the minor
5.2's below, so

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Markus is right that we do have lots more information that falls out of
the migration stats, but I'm not sure what more you have to collect.

Dave

> 
>  hw/vfio/common.c            | 20 ++++++++++++++++++++
>  hw/vfio/migration.c         | 10 ++++++++++
>  include/qemu/vfio-helpers.h |  3 +++
>  migration/migration.c       | 14 ++++++++++++++
>  monitor/hmp-cmds.c          |  6 ++++++
>  qapi/migration.json         | 17 +++++++++++++++++
>  6 files changed, 70 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7eeaa368187a..286cdaac8674 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -39,6 +39,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> +#include "qemu/vfio-helpers.h"
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> @@ -292,6 +293,25 @@ const MemoryRegionOps vfio_region_ops = {
>   * Device state interfaces
>   */
>  
> +bool vfio_mig_active(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    if (QLIST_EMPTY(&vfio_group_list)) {
> +        return false;
> +    }
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (vbasedev->migration_blocker) {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
>  static bool vfio_devices_all_stopped_and_saving(VFIOContainer *container)
>  {
>      VFIOGroup *group;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 822b68b4e015..c4226fa8b183 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -28,6 +28,7 @@
>  #include "pci.h"
>  #include "trace.h"
>  #include "hw/hw.h"
> +#include "qemu/vfio-helpers.h"
>  
>  /*
>   * Flags used as delimiter:
> @@ -40,6 +41,8 @@
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>  
> +static int64_t bytes_transferred;
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>                                    off_t off, bool iswrite)
>  {
> @@ -289,6 +292,7 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
>          *size = data_size;
>      }
>  
> +    bytes_transferred += data_size;
>      return ret;
>  }
>  
> @@ -770,6 +774,7 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>          }
>  
>          vfio_set_dirty_page_tracking(vbasedev, false);
> +        bytes_transferred = 0;
>      }
>  }
>  
> @@ -820,6 +825,11 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>  
>  /* ---------------------------------------------------------------------- */
>  
> +int64_t vfio_mig_bytes_transferred(void)
> +{
> +    return bytes_transferred;
> +}
> +
>  int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>  {
>      VFIOContainer *container = vbasedev->group->container;
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index 1f057c2b9e40..26a7df0767b1 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -29,4 +29,7 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>  int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>                             int irq_type, Error **errp);
>  
> +bool vfio_mig_active(void);
> +int64_t vfio_mig_bytes_transferred(void);
> +
>  #endif
> diff --git a/migration/migration.c b/migration/migration.c
> index 58a5452471f9..b204bb1f6cd9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -56,6 +56,7 @@
>  #include "net/announce.h"
>  #include "qemu/queue.h"
>  #include "multifd.h"
> +#include "qemu/vfio-helpers.h"
>  
>  #define MAX_THROTTLE  (32 << 20)      /* Migration transfer speed throttling */
>  
> @@ -996,6 +997,17 @@ static void populate_disk_info(MigrationInfo *info)
>      }
>  }
>  
> +static void populate_vfio_info(MigrationInfo *info)
> +{
> +#ifdef CONFIG_LINUX
> +    if (vfio_mig_active()) {
> +        info->has_vfio = true;
> +        info->vfio = g_malloc0(sizeof(*info->vfio));
> +        info->vfio->transferred = vfio_mig_bytes_transferred();
> +    }
> +#endif
> +}
> +
>  static void fill_source_migration_info(MigrationInfo *info)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -1020,6 +1032,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
>          populate_disk_info(info);
> +        populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_COLO:
>          info->has_status = true;
> @@ -1028,6 +1041,7 @@ static void fill_source_migration_info(MigrationInfo *info)
>      case MIGRATION_STATUS_COMPLETED:
>          populate_time_info(info, s);
>          populate_ram_info(info, s);
> +        populate_vfio_info(info);
>          break;
>      case MIGRATION_STATUS_FAILED:
>          info->has_status = true;
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index 7711726fd222..40d60d6a6651 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -355,6 +355,12 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>          }
>          monitor_printf(mon, "]\n");
>      }
> +
> +    if (info->has_vfio) {
> +        monitor_printf(mon, "vfio device transferred: %" PRIu64 " kbytes\n",
> +                       info->vfio->transferred >> 10);
> +    }
> +
>      qapi_free_MigrationInfo(info);
>  }
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 675f70bb6734..3535977123d3 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -147,6 +147,18 @@
>              'active', 'postcopy-active', 'postcopy-paused',
>              'postcopy-recover', 'completed', 'failed', 'colo',
>              'pre-switchover', 'device', 'wait-unplug' ] }
> +##
> +# @VfioStats:
> +#
> +# Detailed VFIO devices migration statistics
> +#
> +# @transferred: amount of bytes transferred to the target VM by VFIO devices
> +#
> +# Since: 5.1
> +#
> +##
> +{ 'struct': 'VfioStats',
> +  'data': {'transferred': 'int' } }
>  
>  ##
>  # @MigrationInfo:
> @@ -208,11 +220,16 @@
>  #
>  # @socket-address: Only used for tcp, to know what the real port is (Since 4.0)
>  #
> +# @vfio: @VfioStats containing detailed VFIO devices migration statistics,
> +#        only returned if VFIO device is present, migration is supported by all
> +#         VFIO devices and status is 'active' or 'completed' (since 5.1)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
>    'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
>             '*disk': 'MigrationStats',
> +           '*vfio': 'VfioStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
>             '*expected-downtime': 'int',
> -- 
> 2.7.0
>
Kirti Wankhede Oct. 17, 2020, 8:35 p.m. UTC | #6
On 9/26/2020 1:50 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:08 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added migration state change notifier to get notification on migration state
>> change. These states are translated to VFIO device state and conveyed to vendor
>> driver.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |  5 +++--
>>   include/hw/vfio/vfio-common.h |  1 +
>>   3 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index a30d628ba963..f650fe9fc3c8 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -199,6 +199,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>       }
>>   }
>>   
>> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>> +{
>> +    MigrationState *s = data;
>> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
>> +    int ret;
>> +
>> +    trace_vfio_migration_state_notifier(vbasedev->name,
>> +                                        MigrationStatus_str(s->state));
>> +
>> +    switch (s->state) {
>> +    case MIGRATION_STATUS_CANCELLING:
>> +    case MIGRATION_STATUS_CANCELLED:
>> +    case MIGRATION_STATUS_FAILED:
>> +        ret = vfio_migration_set_state(vbasedev,
>> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
>> +                      VFIO_DEVICE_STATE_RUNNING);
>> +        if (ret) {
>> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
>> +        }
> 
> Here again the caller assumes success means the device has entered the
> desired state, but as implemented it only means the device is in some
> non-error state.
> 
>> +    }
>> +}
>> +
>>   static int vfio_migration_init(VFIODevice *vbasedev,
>>                                  struct vfio_region_info *info)
>>   {
>> @@ -221,6 +243,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>   
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> +    add_migration_state_change_notifier(&vbasedev->migration_state);
>>       return ret;
>>   }
>>   
>> @@ -263,6 +287,11 @@ add_blocker:
>>   
>>   void vfio_migration_finalize(VFIODevice *vbasedev)
>>   {
>> +
>> +    if (vbasedev->migration_state.notify) {
>> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
>> +    }
>> +
>>       if (vbasedev->vm_state) {
>>           qemu_del_vm_change_state_handler(vbasedev->vm_state);
>>       }
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 6524734bf7b4..bcb3fa7314d7 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -149,5 +149,6 @@ vfio_display_edid_write_error(void) ""
>>   
>>   # migration.c
>>   vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>> -vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
>> -vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> +vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>> +vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>> +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 25e3b1a3b90a..49c7c7a0e29a 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>>       VMChangeStateEntry *vm_state;
>>       uint32_t device_state;
>>       int vm_running;
>> +    Notifier migration_state;
> 
> Can this live in VFIOMigration?  Thanks,
> 

No, callback vfio_migration_state_notifier() has notifier argument and 
to reach its corresponding device structure as below, its should be in 
VFIODevice.

VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);

Thanks,
Kirti

> Alex
> 
>>   } VFIODevice;
>>   
>>   struct VFIODeviceOps {
>
Kirti Wankhede Oct. 17, 2020, 8:36 p.m. UTC | #7
On 9/29/2020 3:49 PM, Dr. David Alan Gilbert wrote:
> * Philippe Mathieu-Daudé (philmd@redhat.com) wrote:

>> On 9/23/20 1:24 AM, Kirti Wankhede wrote:

>>> Define flags to be used as delimeter in migration file stream.

>>

>> Typo "delimiter".

>>

>>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration

>>> region from these functions at source during saving or pre-copy phase.

>>> Set VFIO device state depending on VM's state. During live migration, VM is

>>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO

>>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.

>>>

>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>

>>> Reviewed-by: Neo Jia <cjia@nvidia.com>

>>> ---

>>>   hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++

>>>   hw/vfio/trace-events |  2 ++

>>>   2 files changed, 93 insertions(+)

>>>

>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c

>>> index f650fe9fc3c8..8e8adaa25779 100644

>>> --- a/hw/vfio/migration.c

>>> +++ b/hw/vfio/migration.c

>>> @@ -8,12 +8,15 @@

>>>    */

>>>   

>>>   #include "qemu/osdep.h"

>>> +#include "qemu/main-loop.h"

>>> +#include "qemu/cutils.h"

>>>   #include <linux/vfio.h>

>>>   

>>>   #include "sysemu/runstate.h"

>>>   #include "hw/vfio/vfio-common.h"

>>>   #include "cpu.h"

>>>   #include "migration/migration.h"

>>> +#include "migration/vmstate.h"

>>>   #include "migration/qemu-file.h"

>>>   #include "migration/register.h"

>>>   #include "migration/blocker.h"

>>> @@ -25,6 +28,17 @@

>>>   #include "trace.h"

>>>   #include "hw/hw.h"

>>>   

>>> +/*

>>> + * Flags used as delimiter:

>>> + * 0xffffffff => MSB 32-bit all 1s

>>> + * 0xef10     => emulated (virtual) function IO

>>> + * 0x0000     => 16-bits reserved for flags

>>> + */

>>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)

>>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)

>>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)

>>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)

>>> +

>>>   static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,

>>>                                     off_t off, bool iswrite)

>>>   {

>>> @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,

>>>       return 0;

>>>   }

>>>   

>>> +/* ---------------------------------------------------------------------- */

>>> +

>>> +static int vfio_save_setup(QEMUFile *f, void *opaque)

>>> +{

>>> +    VFIODevice *vbasedev = opaque;

>>> +    VFIOMigration *migration = vbasedev->migration;

>>> +    int ret;

>>> +

>>> +    trace_vfio_save_setup(vbasedev->name);

>>> +

>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);

>>> +

>>> +    if (migration->region.mmaps) {

>>> +        qemu_mutex_lock_iothread();

>>> +        ret = vfio_region_mmap(&migration->region);

>>> +        qemu_mutex_unlock_iothread();

>>> +        if (ret) {

>>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",

>>> +                         vbasedev->name, migration->region.nr,

>>> +                         strerror(-ret));

>>> +            error_report("%s: Falling back to slow path", vbasedev->name);

>>> +        }

>>> +    }

>>> +

>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,

>>> +                                   VFIO_DEVICE_STATE_SAVING);

>>> +    if (ret) {

>>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);

>>> +        return ret;

>>> +    }

>>> +

>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);

>>> +

>>> +    ret = qemu_file_get_error(f);

>>> +    if (ret) {

>>> +        return ret;

>>> +    }

>>> +

>>> +    return 0;

>>> +}

>>> +

>>> +static void vfio_save_cleanup(void *opaque)

>>> +{

>>> +    VFIODevice *vbasedev = opaque;

>>> +    VFIOMigration *migration = vbasedev->migration;

>>> +

>>> +    if (migration->region.mmaps) {

>>> +        vfio_region_unmap(&migration->region);

>>> +    }

>>> +    trace_vfio_save_cleanup(vbasedev->name);

>>> +}

>>> +

>>> +static SaveVMHandlers savevm_vfio_handlers = {

>>> +    .save_setup = vfio_save_setup,

>>> +    .save_cleanup = vfio_save_cleanup,

>>> +};

>>> +

>>> +/* ---------------------------------------------------------------------- */

>>> +

>>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)

>>>   {

>>>       VFIODevice *vbasedev = opaque;

>>> @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,

>>>                                  struct vfio_region_info *info)

>>>   {

>>>       int ret = -EINVAL;

>>> +    char id[256] = "";

>>> +    Object *obj;

>>>   

>>>       if (!vbasedev->ops->vfio_get_object) {

>>>           return ret;

>>> @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,

>>>           return ret;

>>>       }

>>>   

>>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);

>>> +

>>> +    if (obj) {

>>> +        DeviceState *dev = DEVICE(obj);

>>> +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));

>>> +

>>> +        if (oid) {

>>> +            pstrcpy(id, sizeof(id), oid);

>>> +            pstrcat(id, sizeof(id), "/");

>>> +            g_free(oid);

>>> +        }

>>> +    }

>>> +    pstrcat(id, sizeof(id), "vfio");

>>

>> Alternatively (easier to review, matter of taste):

>>

>>   g_autofree char *path = NULL;

>>

>>   if (oid) {

>>     path = g_strdup_printf("%s/vfio",

>>                            vmstate_if_get_id(VMSTATE_IF(obj)););

>>   } else {

>>     path = g_strdup("vfio");

>>   }

>>   strpadcpy(id, sizeof(id), path, '\0');

> 

> Maybe, although it's a straight copy of the magic in unregister_savevm.

> 


Ok. Changing it as Philippe suggested.

Thanks,
Kirti

> Dave

> 

>>> +

>>> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,

>>> +                         vbasedev);

>>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,

>>>                                                             vbasedev);

>>>       vbasedev->migration_state.notify = vfio_migration_state_notifier;

>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events

>>> index bcb3fa7314d7..982d8dccb219 100644

>>> --- a/hw/vfio/trace-events

>>> +++ b/hw/vfio/trace-events

>>> @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"

>>>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"

>>>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"

>>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"

>>> +vfio_save_setup(const char *name) " (%s)"

>>> +vfio_save_cleanup(const char *name) " (%s)"

>>>

>>
Kirti Wankhede Oct. 18, 2020, 5:40 p.m. UTC | #8
On 9/26/2020 1:50 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:09 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Define flags to be used as delimeter in migration file stream.
>> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
>> region from these functions at source during saving or pre-copy phase.
>> Set VFIO device state depending on VM's state. During live migration, VM is
>> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
>> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events |  2 ++
>>   2 files changed, 93 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index f650fe9fc3c8..8e8adaa25779 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -8,12 +8,15 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qemu/main-loop.h"
>> +#include "qemu/cutils.h"
>>   #include <linux/vfio.h>
>>   
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "cpu.h"
>>   #include "migration/migration.h"
>> +#include "migration/vmstate.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/register.h"
>>   #include "migration/blocker.h"
>> @@ -25,6 +28,17 @@
>>   #include "trace.h"
>>   #include "hw/hw.h"
>>   
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10     => emulated (virtual) function IO
>> + * 0x0000     => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
>> +
>>   static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>>                                     off_t off, bool iswrite)
>>   {
>> @@ -166,6 +180,65 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +/* ---------------------------------------------------------------------- */
>> +
>> +static int vfio_save_setup(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    trace_vfio_save_setup(vbasedev->name);
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
>> +
>> +    if (migration->region.mmaps) {
>> +        qemu_mutex_lock_iothread();
>> +        ret = vfio_region_mmap(&migration->region);
>> +        qemu_mutex_unlock_iothread();
> 
> Please add a comment identifying why the iothread mutex lock is
> necessary here.
> 
>> +        if (ret) {
>> +            error_report("%s: Failed to mmap VFIO migration region %d: %s",
>> +                         vbasedev->name, migration->region.nr,
> 
> We don't support multiple migration regions, is it useful to include
> the region index here?
> 

Ok. Removing region.nr


>> +                         strerror(-ret));
>> +            error_report("%s: Falling back to slow path", vbasedev->name);
>> +        }
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
>> +                                   VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state SAVING", vbasedev->name);
>> +        return ret;
>> +    }
> 
> Again, doesn't match the function semantics that success only means the
> device is in a non-error state, maybe the one that was asked for.
> 

Fixed in patch 05.

>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> 
> What's the overall purpose of writing these markers into the migration
> stream?  vfio_load_state() doesn't do anything with this other than
> validate that the end-of-state immediately follows.  Is this a
> placeholder for something in the future?
> 

Its not placeholder, it is used in vfio_load_state() to determine upto 
what point to loop to fetch data for each state, otherwise how would we 
know when to stop reading data from stream for that VFIO device.

>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vfio_save_cleanup(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->region.mmaps) {
>> +        vfio_region_unmap(&migration->region);
>> +    }
>> +    trace_vfio_save_cleanup(vbasedev->name);
>> +}
>> +
>> +static SaveVMHandlers savevm_vfio_handlers = {
>> +    .save_setup = vfio_save_setup,
>> +    .save_cleanup = vfio_save_cleanup,
>> +};
>> +
>> +/* ---------------------------------------------------------------------- */
>> +
>>   static void vfio_vmstate_change(void *opaque, int running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> @@ -225,6 +298,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>                                  struct vfio_region_info *info)
>>   {
>>       int ret = -EINVAL;
>> +    char id[256] = "";
>> +    Object *obj;
>>   
>>       if (!vbasedev->ops->vfio_get_object) {
>>           return ret;
>> @@ -241,6 +316,22 @@ static int vfio_migration_init(VFIODevice *vbasedev,
>>           return ret;
>>       }
>>   
>> +    obj = vbasedev->ops->vfio_get_object(vbasedev);
>> +
>> +    if (obj) {
>> +        DeviceState *dev = DEVICE(obj);
>> +        char *oid = vmstate_if_get_id(VMSTATE_IF(dev));
>> +
>> +        if (oid) {
>> +            pstrcpy(id, sizeof(id), oid);
>> +            pstrcat(id, sizeof(id), "/");
>> +            g_free(oid);
>> +        }
>> +    }
> 
> Here's where vfio_migration_init() starts using vfio_get_object() as I
> referenced back on patch 04.  We might as well get the object before
> calling vfio_migration_region_init() and then pass the object.  The
> conditional branch to handle obj is strange here too, it's fatal if
> vfio_migration_region_init() doesn't find an object, why do we handle
> it as optional here?  Also, what is this doing?  Comments would be
> nice...  Thanks,
> 

Changing it as I mentioned in other patch reply.

Thanks.

> Alex
> 
>> +    pstrcat(id, sizeof(id), "vfio");
>> +
>> +    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
>> +                         vbasedev);
>>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
>>                                                             vbasedev);
>>       vbasedev->migration_state.notify = vfio_migration_state_notifier;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index bcb3fa7314d7..982d8dccb219 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -152,3 +152,5 @@ vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
>>   vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
>>   vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>> +vfio_save_setup(const char *name) " (%s)"
>> +vfio_save_cleanup(const char *name) " (%s)"
>
Kirti Wankhede Oct. 18, 2020, 6 p.m. UTC | #9
On 9/26/2020 2:32 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:10 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
>> functions. These functions handles pre-copy and stop-and-copy phase.
>>
>> In _SAVING|_RUNNING device state or pre-copy phase:
>> - read pending_bytes. If pending_bytes > 0, go through below steps.
>> - read data_offset - indicates kernel driver to write data to staging
>>    buffer.
>> - read data_size - amount of data in bytes written by vendor driver in
>>    migration region.
>> - read data_size bytes of data from data_offset in the migration region.
>> - Write data packet to file stream as below:
>> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
>> VFIO_MIG_FLAG_END_OF_STATE }
>>
>> In _SAVING device state or stop-and-copy phase
>> a. read config space of device and save to migration file stream. This
>>     doesn't need to be from vendor driver. Any other special config state
>>     from driver can be saved as data in following iteration.
>> b. read pending_bytes. If pending_bytes > 0, go through below steps.
>> c. read data_offset - indicates kernel driver to write data to staging
>>     buffer.
>> d. read data_size - amount of data in bytes written by vendor driver in
>>     migration region.
>> e. read data_size bytes of data from data_offset in the migration region.
>> f. Write data packet as below:
>>     {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
>> g. iterate through steps b to f while (pending_bytes > 0)
>> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
>>
>> When data region is mapped, its user's responsibility to read data from
>> data_offset of data_size before moving to next steps.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>> ---
>>   hw/vfio/migration.c           | 273 ++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/trace-events          |   6 +
>>   include/hw/vfio/vfio-common.h |   1 +
>>   3 files changed, 280 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 8e8adaa25779..4611bb972228 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -180,6 +180,154 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
>>       return 0;
>>   }
>>   
>> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
>> +                                   uint64_t data_size, uint64_t *size)
>> +{
>> +    void *ptr = NULL;
>> +    uint64_t limit = 0;
>> +    int i;
>> +
>> +    if (!region->mmaps) {
>> +        if (size) {
>> +            *size = data_size;
>> +        }
>> +        return ptr;
>> +    }
>> +
>> +    for (i = 0; i < region->nr_mmaps; i++) {
>> +        VFIOMmap *map = region->mmaps + i;
>> +
>> +        if ((data_offset >= map->offset) &&
>> +            (data_offset < map->offset + map->size)) {
>> +
>> +            /* check if data_offset is within sparse mmap areas */
>> +            ptr = map->mmap + data_offset - map->offset;
>> +            if (size) {
>> +                *size = MIN(data_size, map->offset + map->size - data_offset);
>> +            }
>> +            break;
>> +        } else if ((data_offset < map->offset) &&
>> +                   (!limit || limit > map->offset)) {
>> +            /*
>> +             * data_offset is not within sparse mmap areas, find size of
>> +             * non-mapped area. Check through all list since region->mmaps list
>> +             * is not sorted.
>> +             */
>> +            limit = map->offset;
>> +        }
>> +    }
>> +
>> +    if (!ptr && size) {
>> +        *size = limit ? limit - data_offset : data_size;
> 
> 'limit - data_offset' doesn't take data_size into account, this should
> be MIN(data_size, limit - data_offset).
> 

Done.

>> +    }
>> +    return ptr;
>> +}
>> +
>> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t data_offset = 0, data_size = 0, sz;
>> +    int ret;
>> +
>> +    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_offset));
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             data_size));
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
>> +                           migration->pending_bytes);
>> +
>> +    qemu_put_be64(f, data_size);
>> +    sz = data_size;
>> +
>> +    while (sz) {
>> +        void *buf = NULL;
> 
> Unnecessary initialization.
> 
>> +        uint64_t sec_size;
>> +        bool buf_allocated = false;
>> +
>> +        buf = get_data_section_size(region, data_offset, sz, &sec_size);
>> +
>> +        if (!buf) {
>> +            buf = g_try_malloc(sec_size);
>> +            if (!buf) {
>> +                error_report("%s: Error allocating buffer ", __func__);
>> +                return -ENOMEM;
>> +            }
>> +            buf_allocated = true;
>> +
>> +            ret = vfio_mig_read(vbasedev, buf, sec_size,
>> +                                region->fd_offset + data_offset);
>> +            if (ret < 0) {
>> +                g_free(buf);
>> +                return ret;
>> +            }
>> +        }
>> +
>> +        qemu_put_buffer(f, buf, sec_size);
>> +
>> +        if (buf_allocated) {
>> +            g_free(buf);
>> +        }
>> +        sz -= sec_size;
>> +        data_offset += sec_size;
>> +    }
>> +
>> +    ret = qemu_file_get_error(f);
>> +
>> +    if (!ret && size) {
>> +        *size = data_size;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int vfio_update_pending(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    VFIORegion *region = &migration->region;
>> +    uint64_t pending_bytes = 0;
>> +    int ret;
>> +
>> +    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
>> +                region->fd_offset + offsetof(struct vfio_device_migration_info,
>> +                                             pending_bytes));
>> +    if (ret < 0) {
>> +        migration->pending_bytes = 0;
>> +        return ret;
>> +    }
>> +
>> +    migration->pending_bytes = pending_bytes;
>> +    trace_vfio_update_pending(vbasedev->name, pending_bytes);
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_CONFIG_STATE);
>> +
>> +    if (vbasedev->ops && vbasedev->ops->vfio_save_config) {
>> +        vbasedev->ops->vfio_save_config(vbasedev, f);
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    trace_vfio_save_device_config_state(vbasedev->name);
>> +
>> +    return qemu_file_get_error(f);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -232,9 +380,134 @@ static void vfio_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>   
>> +static void vfio_save_pending(QEMUFile *f, void *opaque,
>> +                              uint64_t threshold_size,
>> +                              uint64_t *res_precopy_only,
>> +                              uint64_t *res_compatible,
>> +                              uint64_t *res_postcopy_only)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    int ret;
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return;
>> +    }
>> +
>> +    *res_precopy_only += migration->pending_bytes;
>> +
>> +    trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
>> +                            *res_postcopy_only, *res_compatible);
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    uint64_t data_size;
>> +    int ret;
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +
>> +    if (migration->pending_bytes == 0) {
>> +        ret = vfio_update_pending(vbasedev);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +
>> +        if (migration->pending_bytes == 0) {
>> +            /* indicates data finished, goto complete phase */
>> +            return 1;
>> +        }
>> +    }
>> +
>> +    ret = vfio_save_buffer(f, vbasedev, &data_size);
>> +
>> +    if (ret) {
>> +        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
>> +                     strerror(errno));
>> +        return ret;
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    trace_vfio_save_iterate(vbasedev->name, data_size);
>> +
>> +    return 0;
>> +}
>> +
>> +static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    uint64_t data_size;
>> +    int ret;
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_RUNNING,
>> +                                   VFIO_DEVICE_STATE_SAVING);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOP and SAVING",
>> +                     vbasedev->name);
>> +        return ret;
>> +    }
> 
> This also assumes success implies desired state.
> 
>> +
>> +    ret = vfio_save_device_config_state(f, opaque);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_update_pending(vbasedev);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    while (migration->pending_bytes > 0) {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>> +        ret = vfio_save_buffer(f, vbasedev, &data_size);
>> +        if (ret < 0) {
>> +            error_report("%s: Failed to save buffer", vbasedev->name);
>> +            return ret;
>> +        }
>> +
>> +        if (data_size == 0) {
>> +            break;
>> +        }
>> +
>> +        ret = vfio_update_pending(vbasedev);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    ret = qemu_file_get_error(f);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_SAVING, 0);
>> +    if (ret) {
>> +        error_report("%s: Failed to set state STOPPED", vbasedev->name);
>> +        return ret;
>> +    }
> 
> And another.  Thanks,
>


Fixing in patch 5.

Thanks,
Kirti


> Alex
> 
>> +
>> +    trace_vfio_save_complete_precopy(vbasedev->name);
>> +    return ret;
>> +}
>> +
>>   static SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> +    .save_live_pending = vfio_save_pending,
>> +    .save_live_iterate = vfio_save_iterate,
>> +    .save_live_complete_precopy = vfio_save_complete_precopy,
>>   };
>>   
>>   /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 982d8dccb219..118b5547c921 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -154,3 +154,9 @@ vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t
>>   vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
>>   vfio_save_setup(const char *name) " (%s)"
>>   vfio_save_cleanup(const char *name) " (%s)"
>> +vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
>> +vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
>> +vfio_save_device_config_state(const char *name) " (%s)"
>> +vfio_save_pending(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t compatible) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" compatible 0x%"PRIx64
>> +vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
>> +vfio_save_complete_precopy(const char *name) " (%s)"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 49c7c7a0e29a..471e444a364c 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -60,6 +60,7 @@ typedef struct VFIORegion {
>>   
>>   typedef struct VFIOMigration {
>>       VFIORegion region;
>> +    uint64_t pending_bytes;
>>   } VFIOMigration;
>>   
>>   typedef struct VFIOAddressSpace {
>
Kirti Wankhede Oct. 18, 2020, 8:52 p.m. UTC | #10
On 9/26/2020 3:25 AM, Alex Williamson wrote:
> On Wed, 23 Sep 2020 04:54:14 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Call VFIO_IOMMU_DIRTY_PAGES ioctl to start and stop dirty pages tracking
>> for VFIO devices.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>>   hw/vfio/migration.c | 36 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 4306f6316417..822b68b4e015 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -11,6 +11,7 @@
>>   #include "qemu/main-loop.h"
>>   #include "qemu/cutils.h"
>>   #include <linux/vfio.h>
>> +#include <sys/ioctl.h>
>>   
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>> @@ -355,6 +356,32 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>>       return qemu_file_get_error(f);
>>   }
>>   
>> +static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
>> +{
>> +    int ret;
>> +    VFIOContainer *container = vbasedev->group->container;
>> +    struct vfio_iommu_type1_dirty_bitmap dirty = {
>> +        .argsz = sizeof(dirty),
>> +    };
>> +
>> +    if (start) {
>> +        if (vbasedev->device_state & VFIO_DEVICE_STATE_SAVING) {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
>> +        } else {
>> +            return -EINVAL;
>> +        }
>> +    } else {
>> +            dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
>> +    }
>> +
>> +    ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
>> +    if (ret) {
>> +        error_report("Failed to set dirty tracking flag 0x%x errno: %d",
>> +                     dirty.flags, errno);
>> +    }
> 
> Maybe doesn't matter in the long run, but do you want to use -errno for
> the return rather than -1 from the ioctl on error?  Thanks,
> 

Makes sense. Changing it.

Thanks,
Kirti

> Alex
> 
>> +    return ret;
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>   
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -386,6 +413,11 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>           return ret;
>>       }
>>   
>> +    ret = vfio_set_dirty_page_tracking(vbasedev, true);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>   
>>       ret = qemu_file_get_error(f);
>> @@ -401,6 +433,8 @@ static void vfio_save_cleanup(void *opaque)
>>       VFIODevice *vbasedev = opaque;
>>       VFIOMigration *migration = vbasedev->migration;
>>   
>> +    vfio_set_dirty_page_tracking(vbasedev, false);
>> +
>>       if (migration->region.mmaps) {
>>           vfio_region_unmap(&migration->region);
>>       }
>> @@ -734,6 +768,8 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
>>           if (ret) {
>>               error_report("%s: Failed to set state RUNNING", vbasedev->name);
>>           }
>> +
>> +        vfio_set_dirty_page_tracking(vbasedev, false);
>>       }
>>   }
>>   
>
Alex Williamson Oct. 19, 2020, 5:57 p.m. UTC | #11
On Sun, 18 Oct 2020 02:05:03 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/26/2020 1:50 AM, Alex Williamson wrote:
> > On Wed, 23 Sep 2020 04:54:08 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Added migration state change notifier to get notification on migration state
> >> change. These states are translated to VFIO device state and conveyed to vendor
> >> driver.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> ---
> >>   hw/vfio/migration.c           | 29 +++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events          |  5 +++--
> >>   include/hw/vfio/vfio-common.h |  1 +
> >>   3 files changed, 33 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index a30d628ba963..f650fe9fc3c8 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -199,6 +199,28 @@ static void vfio_vmstate_change(void *opaque, int running, RunState state)
> >>       }
> >>   }
> >>   
> >> +static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> >> +{
> >> +    MigrationState *s = data;
> >> +    VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);
> >> +    int ret;
> >> +
> >> +    trace_vfio_migration_state_notifier(vbasedev->name,
> >> +                                        MigrationStatus_str(s->state));
> >> +
> >> +    switch (s->state) {
> >> +    case MIGRATION_STATUS_CANCELLING:
> >> +    case MIGRATION_STATUS_CANCELLED:
> >> +    case MIGRATION_STATUS_FAILED:
> >> +        ret = vfio_migration_set_state(vbasedev,
> >> +                      ~(VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING),
> >> +                      VFIO_DEVICE_STATE_RUNNING);
> >> +        if (ret) {
> >> +            error_report("%s: Failed to set state RUNNING", vbasedev->name);
> >> +        }  
> > 
> > Here again the caller assumes success means the device has entered the
> > desired state, but as implemented it only means the device is in some
> > non-error state.
> >   
> >> +    }
> >> +}
> >> +
> >>   static int vfio_migration_init(VFIODevice *vbasedev,
> >>                                  struct vfio_region_info *info)
> >>   {
> >> @@ -221,6 +243,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
> >>   
> >>       vbasedev->vm_state = qemu_add_vm_change_state_handler(vfio_vmstate_change,
> >>                                                             vbasedev);
> >> +    vbasedev->migration_state.notify = vfio_migration_state_notifier;
> >> +    add_migration_state_change_notifier(&vbasedev->migration_state);
> >>       return ret;
> >>   }
> >>   
> >> @@ -263,6 +287,11 @@ add_blocker:
> >>   
> >>   void vfio_migration_finalize(VFIODevice *vbasedev)
> >>   {
> >> +
> >> +    if (vbasedev->migration_state.notify) {
> >> +        remove_migration_state_change_notifier(&vbasedev->migration_state);
> >> +    }
> >> +
> >>       if (vbasedev->vm_state) {
> >>           qemu_del_vm_change_state_handler(vbasedev->vm_state);
> >>       }
> >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> >> index 6524734bf7b4..bcb3fa7314d7 100644
> >> --- a/hw/vfio/trace-events
> >> +++ b/hw/vfio/trace-events
> >> @@ -149,5 +149,6 @@ vfio_display_edid_write_error(void) ""
> >>   
> >>   # migration.c
> >>   vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
> >> -vfio_migration_set_state(char *name, uint32_t state) " (%s) state %d"
> >> -vfio_vmstate_change(char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> +vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
> >> +vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
> >> +vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 25e3b1a3b90a..49c7c7a0e29a 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
> >>       VMChangeStateEntry *vm_state;
> >>       uint32_t device_state;
> >>       int vm_running;
> >> +    Notifier migration_state;  
> > 
> > Can this live in VFIOMigration?  Thanks,
> >   
> 
> No, callback vfio_migration_state_notifier() has notifier argument and 
> to reach its corresponding device structure as below, its should be in 
> VFIODevice.
> 
> VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);

An alternative would be to place migration_state within VFIOMigration,
along with a pointer back to vbasedev (like we do in VFIORegion) then
the notifier could use container_of to get the VFIOMigration structure,
from which we could get to the VFIODevice via the vbasedev pointer.
This would better compartmentalize the migration related data into a
single structure.  Thanks,

Alex
Cornelia Huck Oct. 20, 2020, 10:55 a.m. UTC | #12
On Mon, 19 Oct 2020 11:57:47 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Sun, 18 Oct 2020 02:05:03 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 9/26/2020 1:50 AM, Alex Williamson wrote:  
> > > On Wed, 23 Sep 2020 04:54:08 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:

> > >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > >> index 25e3b1a3b90a..49c7c7a0e29a 100644
> > >> --- a/include/hw/vfio/vfio-common.h
> > >> +++ b/include/hw/vfio/vfio-common.h
> > >> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
> > >>       VMChangeStateEntry *vm_state;
> > >>       uint32_t device_state;
> > >>       int vm_running;
> > >> +    Notifier migration_state;    
> > > 
> > > Can this live in VFIOMigration?  Thanks,
> > >     
> > 
> > No, callback vfio_migration_state_notifier() has notifier argument and 
> > to reach its corresponding device structure as below, its should be in 
> > VFIODevice.
> > 
> > VFIODevice *vbasedev = container_of(notifier, VFIODevice, migration_state);  
> 
> An alternative would be to place migration_state within VFIOMigration,
> along with a pointer back to vbasedev (like we do in VFIORegion) then
> the notifier could use container_of to get the VFIOMigration structure,
> from which we could get to the VFIODevice via the vbasedev pointer.
> This would better compartmentalize the migration related data into a
> single structure.  Thanks,
> 
> Alex

+1, I think having everything in VFIOMigration would be nicer.
Cornelia Huck Oct. 20, 2020, 3:51 p.m. UTC | #13
On Mon, 19 Oct 2020 02:25:28 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 9/25/2020 5:23 PM, Cornelia Huck wrote:
> > On Wed, 23 Sep 2020 04:54:09 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Define flags to be used as delimeter in migration file stream.
> >> Added .save_setup and .save_cleanup functions. Mapped & unmapped migration
> >> region from these functions at source during saving or pre-copy phase.
> >> Set VFIO device state depending on VM's state. During live migration, VM is
> >> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> >> device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >> ---
> >>   hw/vfio/migration.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   hw/vfio/trace-events |  2 ++
> >>   2 files changed, 93 insertions(+)
> >>  
> > 
> > (...)
> >   
> >> +/*
> >> + * Flags used as delimiter:
> >> + * 0xffffffff => MSB 32-bit all 1s
> >> + * 0xef10     => emulated (virtual) function IO  
> > 
> > Where is this value coming from?
> >   
> 
> Delimiter flags should be unique and this is a magic number that 
> represents (e)mulated (f)unction (10) representing IO.
> 
> >> + * 0x0000     => 16-bits reserved for flags
> >> + */
> >> +#define VFIO_MIG_FLAG_END_OF_STATE      (0xffffffffef100001ULL)
> >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
> >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
> >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)  
> > 
> > I think we need some more documentation what these values mean and how
> > they are used. From reading ahead a bit, it seems there is always
> > supposed to be a pair of DEV_*_STATE and END_OF_STATE framing some kind
> > of data?
> >   
> 
> Adding comment as below, hope it helps.
> 
> /*
>   * Flags used as delimiter for VFIO devices should be unique in 
> migration stream

Maybe

"Flags to be used as unique delimiters for VFIO devices in the
migration stream" ?

>   * These flags are composed as:
>   * 0xffffffff => MSB 32-bit all 1s
>   * 0xef10     => Magic ID, represents emulated (virtual) function IO
>   * 0x0000     => 16-bits reserved for flags
>   *
>   * Flags _DEV_CONFIG_STATE, _DEV_SETUP_STATE and _DEV_DATA_STATE marks 
> start of
>   * respective states in migration stream.
>   * FLAG _END_OF_STATE indicates end of current state, state could be any
>   * of above states.
>   */

"The beginning of state information is marked by _DEV_CONFIG_STATE,
_DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
certain state information is marked by _END_OF_STATE." ?

> 
> Thanks,
> Kirti
>