Message ID | 20220221114043.2030-1-shameerali.kolothum.thodi@huawei.com |
---|---|
Headers | show |
Series | vfio/hisilicon: add ACC live migration driver | expand |
On Mon, 21 Feb 2022 20:49:43 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Feb 21, 2022 at 11:40:35AM +0000, Shameer Kolothum wrote: > > > > Hi, > > > > This series attempts to add vfio live migration support for > > HiSilicon ACC VF devices based on the new v2 migration protocol > > definition and mlx5 v8 series discussed here[0]. > > > > RFCv4 --> v5 > > - Dropped RFC tag as v2 migration APIs are more stable now. > > - Addressed review comments from Jason and Alex (Thanks!). > > > > This is sanity tested on a HiSilicon platform using the Qemu branch > > provided here[1]. > > > > Please take a look and let me know your feedback. > > > > Thanks, > > Shameer > > [0] https://lore.kernel.org/kvm/20220220095716.153757-1-yishaih@nvidia.com/ > > [1] https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2 > > > > > > v3 --> RFCv4 > > -Based on migration v2 protocol and mlx5 v7 series. > > -Added RFC tag again as migration v2 protocol is still under discussion. > > -Added new patch #6 to retrieve the PF QM data. > > -PRE_COPY compatibility check is now done after the migration data > > transfer. This is not ideal and needs discussion. > > Alex, do you want to keep the PRE_COPY in just for acc for now? Or do > you think this is not a good temporary use for it? > > We have some work toward doing the compatability more generally, but I > think it will be a while before that is all settled. In the original migration protocol I recall that we discussed that using the pre-copy phase for compatibility testing, even without additional device data, as a valid use case. The migration driver of course needs to account for the fact that userspace is not required to perform a pre-copy, and therefore cannot rely on that exclusively for compatibility testing, but failing a migration earlier due to detection of an incompatibility is generally a good thing. If the ACC driver wants to re-incorporate this behavior into a non-RFC proposed series and we could align accepting them into the same kernel release, that sounds ok to me. Thanks, Alex
> -----Original Message----- > From: Alex Williamson [mailto:alex.williamson@redhat.com] > Sent: 22 February 2022 19:30 > To: Jason Gunthorpe <jgg@nvidia.com> > Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-crypto@vger.kernel.org; cohuck@redhat.com; mgurtovoy@nvidia.com; > yishaih@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang > <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com> > Subject: Re: [PATCH v5 0/8] vfio/hisilicon: add ACC live migration driver > > On Mon, 21 Feb 2022 20:49:43 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Feb 21, 2022 at 11:40:35AM +0000, Shameer Kolothum wrote: > > > > > > Hi, > > > > > > This series attempts to add vfio live migration support for > > > HiSilicon ACC VF devices based on the new v2 migration protocol > > > definition and mlx5 v8 series discussed here[0]. > > > > > > RFCv4 --> v5 > > > - Dropped RFC tag as v2 migration APIs are more stable now. > > > - Addressed review comments from Jason and Alex (Thanks!). > > > > > > This is sanity tested on a HiSilicon platform using the Qemu branch > > > provided here[1]. > > > > > > Please take a look and let me know your feedback. > > > > > > Thanks, > > > Shameer > > > [0] > https://lore.kernel.org/kvm/20220220095716.153757-1-yishaih@nvidia.com/ > > > [1] https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2 > > > > > > > > > v3 --> RFCv4 > > > -Based on migration v2 protocol and mlx5 v7 series. > > > -Added RFC tag again as migration v2 protocol is still under discussion. > > > -Added new patch #6 to retrieve the PF QM data. > > > -PRE_COPY compatibility check is now done after the migration data > > > transfer. This is not ideal and needs discussion. > > > > Alex, do you want to keep the PRE_COPY in just for acc for now? Or do > > you think this is not a good temporary use for it? > > > > We have some work toward doing the compatability more generally, but I > > think it will be a while before that is all settled. > > In the original migration protocol I recall that we discussed that > using the pre-copy phase for compatibility testing, even without > additional device data, as a valid use case. The migration driver of > course needs to account for the fact that userspace is not required to > perform a pre-copy, and therefore cannot rely on that exclusively for > compatibility testing, but failing a migration earlier due to detection > of an incompatibility is generally a good thing. > > If the ACC driver wants to re-incorporate this behavior into a non-RFC > proposed series and we could align accepting them into the same kernel > release, that sounds ok to me. Thanks, Ok. I will add the support to PRE_COPY and check compatibility early. From FSM arc point of view, I guess it is adding, STATE_RUNNING --> STATE_PRE_COPY create the saving file. get_match_data(); return fd; STATE_PRE_COPY --> STATE_STOP_COPY stop_device() get_device_data() update the saving migf total_len; resume_write() check compatibility once we have enough bytes. Also add support to IOCTL VFIO_DEVICE_MIG_PRECOPY. I will have a go and sent out a revised one. Thanks, Shameer
On Tue, 22 Feb 2022 20:52:51 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Feb 21, 2022 at 11:40:42AM +0000, Shameer Kolothum wrote: > > > + /* > > + * ACC VF dev BAR2 region consists of both functional register space > > + * and migration control register space. For migration to work, we > > + * need access to both. Hence, we map the entire BAR2 region here. > > + * But from a security point of view, we restrict access to the > > + * migration control space from Guest(Please see mmap/ioctl/read/write > > + * override functions). > > + * > > + * Also the HiSilicon ACC VF devices supported by this driver on > > + * HiSilicon hardware platforms are integrated end point devices > > + * and has no capability to perform PCIe P2P. > > If that is the case why not implement the RUNNING_P2P as well as a > NOP? > > Alex expressed concerned about proliferation of non-P2P devices as it > complicates qemu to support mixes I read the above as more of a statement about isolation, ie. grouping. Given that all DMA from the device is translated by the IOMMU, how is it possible that a device can entirely lack p2p support, or even know that the target address post-translation is to a peer device rather than system memory. If this is the case, it sounds like a restriction of the SMMU not supporting translations that reflect back to the I/O bus rather than a feature of the device itself. Thanks, Alex
> -----Original Message----- > From: Alex Williamson [mailto:alex.williamson@redhat.com] > Sent: 23 February 2022 16:35 > To: Jason Gunthorpe <jgg@nvidia.com> > Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-crypto@vger.kernel.org; cohuck@redhat.com; mgurtovoy@nvidia.com; > yishaih@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang > <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com> > Subject: Re: [PATCH v5 7/8] hisi_acc_vfio_pci: Add support for VFIO live > migration > > On Tue, 22 Feb 2022 20:52:51 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Feb 21, 2022 at 11:40:42AM +0000, Shameer Kolothum wrote: > > > > > + /* > > > + * ACC VF dev BAR2 region consists of both functional register space > > > + * and migration control register space. For migration to work, we > > > + * need access to both. Hence, we map the entire BAR2 region here. > > > + * But from a security point of view, we restrict access to the > > > + * migration control space from Guest(Please see mmap/ioctl/read/write > > > + * override functions). > > > + * > > > + * Also the HiSilicon ACC VF devices supported by this driver on > > > + * HiSilicon hardware platforms are integrated end point devices > > > + * and has no capability to perform PCIe P2P. > > > > If that is the case why not implement the RUNNING_P2P as well as a > > NOP? > > > > Alex expressed concerned about proliferation of non-P2P devices as it > > complicates qemu to support mixes > > I read the above as more of a statement about isolation, ie. grouping. That's right. That's what I meant by " no capability to perform PCIe P2P" Thanks, Shameer > Given that all DMA from the device is translated by the IOMMU, how is > it possible that a device can entirely lack p2p support, or even know > that the target address post-translation is to a peer device rather > than system memory. If this is the case, it sounds like a restriction > of the SMMU not supporting translations that reflect back to the I/O > bus rather than a feature of the device itself. Thanks, > > Alex
On Wed, Feb 23, 2022 at 09:34:43AM -0700, Alex Williamson wrote: > On Tue, 22 Feb 2022 20:52:51 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, Feb 21, 2022 at 11:40:42AM +0000, Shameer Kolothum wrote: > > > > > + /* > > > + * ACC VF dev BAR2 region consists of both functional register space > > > + * and migration control register space. For migration to work, we > > > + * need access to both. Hence, we map the entire BAR2 region here. > > > + * But from a security point of view, we restrict access to the > > > + * migration control space from Guest(Please see mmap/ioctl/read/write > > > + * override functions). > > > + * > > > + * Also the HiSilicon ACC VF devices supported by this driver on > > > + * HiSilicon hardware platforms are integrated end point devices > > > + * and has no capability to perform PCIe P2P. > > > > If that is the case why not implement the RUNNING_P2P as well as a > > NOP? > > > > Alex expressed concerned about proliferation of non-P2P devices as it > > complicates qemu to support mixes > > I read the above as more of a statement about isolation, ie. grouping. > Given that all DMA from the device is translated by the IOMMU, how is > it possible that a device can entirely lack p2p support, or even know > that the target address post-translation is to a peer device rather > than system memory. If this is the case, it sounds like a restriction > of the SMMU not supporting translations that reflect back to the I/O > bus rather than a feature of the device itself. Thanks, This is an interesting point.. Arguably if P2P addresses are invalid in an IOPTE then pci_p2pdma_distance() should fail and we shouldn't have installed them into the iommu in the first place. Jason
On Mon, 21 Feb 2022 11:40:40 +0000 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote: > > +static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = { > + .name = "hisi-acc-vfio-pci", Use a different name from the ops below? Thanks, Alex > + .open_device = hisi_acc_vfio_pci_open_device, > + .close_device = vfio_pci_core_close_device, > + .ioctl = hisi_acc_vfio_pci_ioctl, > + .device_feature = vfio_pci_core_ioctl_feature, > + .read = hisi_acc_vfio_pci_read, > + .write = hisi_acc_vfio_pci_write, > + .mmap = hisi_acc_vfio_pci_mmap, > + .request = vfio_pci_core_request, > + .match = vfio_pci_core_match, > +}; > + > static const struct vfio_device_ops hisi_acc_vfio_pci_ops = { > .name = "hisi-acc-vfio-pci", > .open_device = hisi_acc_vfio_pci_open_device,
On Mon, 21 Feb 2022 11:40:42 +0000 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote: > @@ -159,23 +1110,46 @@ static long hisi_acc_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int > > static int hisi_acc_vfio_pci_open_device(struct vfio_device *core_vdev) > { > - struct vfio_pci_core_device *vdev = > - container_of(core_vdev, struct vfio_pci_core_device, vdev); > + struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(core_vdev, > + struct hisi_acc_vf_core_device, core_device.vdev); > + struct vfio_pci_core_device *vdev = &hisi_acc_vdev->core_device; > int ret; > > ret = vfio_pci_core_enable(vdev); > if (ret) > return ret; > > - vfio_pci_core_finish_enable(vdev); > + if (core_vdev->migration_flags != VFIO_MIGRATION_STOP_COPY) { This looks like a minor synchronization issue with hisi_acc_vfio_pci_migrn_init(), I think it might be cleaner to test core_vdev->ops against the migration enabled set. > + vfio_pci_core_finish_enable(vdev); > + return 0; > + } > + > + ret = hisi_acc_vf_qm_init(hisi_acc_vdev); > + if (ret) { > + vfio_pci_core_disable(vdev); > + return ret; > + } > > + hisi_acc_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING; Change the polarity of the if() above and encompass this all within that branch scope so we can use the finish/return below for both cases? > + > + vfio_pci_core_finish_enable(vdev); > return 0; > } > > +static void hisi_acc_vfio_pci_close_device(struct vfio_device *core_vdev) > +{ > + struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(core_vdev, > + struct hisi_acc_vf_core_device, core_device.vdev); > + struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm; > + > + iounmap(vf_qm->io_base); > + vfio_pci_core_close_device(core_vdev); > +} > + > static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = { > .name = "hisi-acc-vfio-pci", > .open_device = hisi_acc_vfio_pci_open_device, > - .close_device = vfio_pci_core_close_device, > + .close_device = hisi_acc_vfio_pci_close_device, > .ioctl = hisi_acc_vfio_pci_ioctl, > .device_feature = vfio_pci_core_ioctl_feature, > .read = hisi_acc_vfio_pci_read, > @@ -183,6 +1157,8 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = { > .mmap = hisi_acc_vfio_pci_mmap, > .request = vfio_pci_core_request, > .match = vfio_pci_core_match, > + .migration_set_state = hisi_acc_vfio_pci_set_device_state, > + .migration_get_state = hisi_acc_vfio_pci_get_device_state, > }; > > static const struct vfio_device_ops hisi_acc_vfio_pci_ops = { > @@ -198,38 +1174,71 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_ops = { > .match = vfio_pci_core_match, > }; > > +static int > +hisi_acc_vfio_pci_migrn_init(struct hisi_acc_vf_core_device *hisi_acc_vdev, > + struct pci_dev *pdev, struct hisi_qm *pf_qm) > +{ > + int vf_id; > + > + vf_id = pci_iov_vf_id(pdev); > + if (vf_id < 0) > + return vf_id; > + > + hisi_acc_vdev->vf_id = vf_id + 1; > + hisi_acc_vdev->core_device.vdev.migration_flags = > + VFIO_MIGRATION_STOP_COPY; > + hisi_acc_vdev->pf_qm = pf_qm; > + hisi_acc_vdev->vf_dev = pdev; > + mutex_init(&hisi_acc_vdev->state_mutex); > + > + return 0; > +} > + > static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) > { > - struct vfio_pci_core_device *vdev; > + struct hisi_acc_vf_core_device *hisi_acc_vdev; > + struct hisi_qm *pf_qm; > int ret; > > - vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); > - if (!vdev) > + hisi_acc_vdev = kzalloc(sizeof(*hisi_acc_vdev), GFP_KERNEL); > + if (!hisi_acc_vdev) > return -ENOMEM; > > - vfio_pci_core_init_device(vdev, pdev, &hisi_acc_vfio_pci_ops); > + pf_qm = hisi_acc_get_pf_qm(pdev); > + if (pf_qm && pf_qm->ver >= QM_HW_V3) { > + ret = hisi_acc_vfio_pci_migrn_init(hisi_acc_vdev, pdev, pf_qm); > + if (ret < 0) { > + kfree(hisi_acc_vdev); > + return ret; > + } This error path can only occur if the VF ID lookup fails, but should we fall through to the non-migration ops, maybe with a dev_warn()? Thanks, Alex > + > + vfio_pci_core_init_device(&hisi_acc_vdev->core_device, pdev, > + &hisi_acc_vfio_pci_migrn_ops); > + } else { > + vfio_pci_core_init_device(&hisi_acc_vdev->core_device, pdev, > + &hisi_acc_vfio_pci_ops); > + }